Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visitthis page to join up and keep LWN on the net.
September 11, 2018
This article was contributed by Peter Geoghegan
PostgreSQL 11 had its third beta releaseon August 9; a fourth beta (or possibly a release candidate) is scheduledfor mid-September. While the final release of the relationaldatabase-management system (currently slated for lateSeptember) will have something new for many users, its development cycle wasnotable for being a period when the community hit its stride in twostrategic areas: partitioning and parallelism.
Partitioning and parallelism are touchstones for major relational databasesystems. Proprietary database vendors manage to extract a premium from aminority of users by upselling features in these areas. WhilePostgreSQL has had some of these "high-tier" items for many years (e.g., CREATE INDEX CONCURRENTLY, advanced replicationfunctionality), the upcoming release expands the number considerably. I may be biased as aPostgreSQL major contributor and committer, but it seems to me that thebelief that community-run database system projects are not competitive withtheir proprietary cousins when it comes to scaling enterprise workloads hasbecome just about untenable.
Partitioningis the ability to subdivide a parent table into a number of smallertables that can be managed individually, while preserving theappearance of a single table to application code. Individual partitions canbe indexed, migrated to a different filesystem/ tablespace, or droppedaltogether by the administrator. There may be significant benefits forperformance, especially with things like reporting queries that must access most or allof the data in an entire partition. The partition key is often based on anatural way of dividing the data, such as partitioning sales data bycountry or year.
Partitioning now supports unique indexes, foreign keys, and automaticcreation of new, equivalent indexes as new partitions are created. Ingeneral, partitions now have most of the capabilities of ordinarytables. Note that unique indexes must include the partition key; globalindexes, where a single B-Tree structure covers each and every partitionmay appear as an enhancement in some future version.
Support for hash-based partition keys has been added. This allowspartitions to be defined in terms of a hash key, relieving users frommanaging partitions when there is no natural way to uniformly separate rowsinto partitions.
UPDATEstatements can move a row across partition boundaries inv11. This occurs when the update happens to affect a column thatparticipates in defining the boundaries. Frequently doing so might defeat thepurpose of partitioning, but making it at least possible to do so couldmake all the difference in a pinch.
"Dynamic" partition elimination can take place via partitionpruning in v11. In version 10, the query planner inferred which partitionscould safely be excluded from consideration during query optimization. Inversion 11, dynamic pruning will also take into consideration information that is only available during query execution,thereby eliminating more unnecessary partitions in somecases.
Last but not least, partitioning can take advantage of cases whereaggregation has a GROUP BYclause that contains the partition key(i.e., partition-wise aggregate), and cases where two tables that happen to haveexactly identical partitioning schemes are joined (i.e., partition-wisejoin). The general idea is that query execution can divide and conquer byjoining and aggregating individual partitions rather than naively operatingthrough the parent table. These optimizations are governed by the settings enable_partitionwise_aggregateand enable_partitionwise_join,both of which are set to " off" by default.
Parallelism in PostgreSQL has come a long way since it was introduced in PostgreSQL 9.6just two years ago. Version 11 promises to fill many ofthe remaining gaps in the system's parallel capabilities.
CREATE INDEXcan now scan a table in parallel, which will takeplace for B-Tree index builds whenever the new strategy is thought favorable, subjectto various resource limits. Forming new index tuples occurs in workerprocesses, as does almost all sorting work. (The duration of a CREATEINDEXfor B-Tree indexes is largely a matter of how quickly a batch sortoperation can be set up and executed.)
Though as the feature's primary author I can make no claim ofobjectivity, I suspect that parallel CREATE INDEXwill be a boonto applications that have not previously seen a benefit fromparallelism. CREATE INDEXis an intensive batch operation that isoften CPU bound; it is a natural target for parallel processing. That isn'tnecessarily the case for any SELECTquery in certainapplications. There may be no SELECTstatement that processesenough rows at a time to ever benefit from parallelism. For my part, thefeature is the culmination of years of work on low-level sortingoptimizations.
Parallel B-Tree index builds will now usually be 2-3x faster than theequivalent serial index builds. Hat-tip to my co-authors, Rushabh Lathiaand Heikki Linnakangas.
A hash join is a join operation that is executed by building an ad-hochash table in memory for one side of the join (typically the smaller of thetwo inputs), and subsequently probing the hash table for a matching tuplewhen scanning the other side. Itis possible that the hash table will be spilled to disk. Hash joins areoften seen in query plans when there is an equi-joinwith an inherent need to process a large number of rows, especially whenone side of the join has far fewer input rows than the other.When the join qualification is selective, such that most of the hashtable probes don't find a match to output a joined tuple on, hash joininputs are usually from some other already-filtered scan on both sides, so theinputs are generally not correlated.
PostgreSQL 11 introduces "parallel-aware hash joins". While it waspossible for a parallel sequential scan to feed a hash join in previousreleases, the hash table on the inner side of the join had to beindependently built within each parallel worker process. While this wasn'tnecessarily a bad strategy (it's still used when the hash table/inner sideinput is a tiny lookup table), it was quite limiting much of the time. Hashjoins can now store their hash table in shared memory and can build it inparallel. The feature's primary author, Thomas Munro, wrote ablog poston this work that explains how it was developedand puts it in context.
This enhancement greatly increases the scalability of hash joins. Queryplans with a hash join that builds a large hash table in parallel are nowan excellent general showcase for PostgreSQL's parallel querycapabilities. Being able to build the hash table in parallel often enablesexecuting underlying "partial" scans in parallel, so parallel-aware hashjoins enable greater parallelism within the query plan as awhole. Both sides of a hash join can be fed by a parallelsequential scan.
After over two years of work, Andres Freund committed a series ofpatches to add just-in-time (JIT) compilation to the PostgreSQL executor inMarch. JIT compilation makes use of the LLVM compiler framework. Thegeneral idea is to replace static machine code with specialized versionsthat are JIT compiled, especially for queries where comparativelysmall amounts of code bottleneck their execution.
This builds on an enhancement added to PostgreSQL 10 by Freund, withassistance from Tom Lane, which overhauledexpression evaluation and target-list projection. Thatis, the method by which the executor evaluated expressions to filter rows(such as WHERE A = 15), and the method by which theexecutor projected expressions from rows to the next level up in the plan tree (such as RETURNING lower(name)) were each redesigned.
The v10 improvement essentially flattened the representation ofexpression state and changed how expression evaluation was performed. Acompact, linear representation based on opcode dispatch was adopted,superseding an earlier tree-based representation. This helped performancein its own right by avoiding the need to do a tree walk that visits eachnode recursively. Individual nodes are trivially simple much of the time,so fixed overheads (e.g., call-stack space) can be important. Instructioncache misses are often the dominant cost incurred by recursiveevaluation.
The follow-up v11 work builds on the flattened, composablerepresentation by adding support for JIT-compiling highly specializedvariants of expression evaluation and tuple deforming code, all withoutcompromising the mathematically rigorous abstraction of set processing overrelations. "Stock" LLVM intermediate representation (IR) bitcode files aregenerated for most standard operators (the underlying functions are compiled toIR bitcode) when PostgreSQL is initially built ( --with-llvmbuildsonly).
When a query is JIT compiled by PostgreSQL duringquery evaluation, the LLVM optimizer generates optimized IR based on thestock unoptimized IR. LLVM is well suited to JIT compiling optimized IRfrom IR generated by static languages such as C. Non-recursive expressionevaluation allows less performance-critical code to be shared betweeninterpreted and compiled plans, which was a crucial maintainabilityconsideration for Freund.
The LLVM optimization process has access to the full context of whichoperators are applied to which columns, the shape of tuples (e.g., if afixed column offset can be used), whether or not constant folding may bepossible, and other such minutiae. A lot of branches can be entirelyremoved in the final generated machine code. The performance improvementsthat Freund has been able to show[PDF]using queries from afair-use implementation of the industry standard TPC-H benchmark havebeen impressive.
A traditional complaint from those moving from other systems toPostgreSQL has been the lack of stored procedures. A v11 feature from PeterEisentraut goes some way toward addressing these concerns.
Procedures are similar to functions, in that they are routines thatconsist of reusable code that can be called from SQL. Unlike functions,procedures cannot return a value when called and cannot appear as part ofa SELECTstatement — a new CALLcommand has been added toexecute procedures. The CALLcommand will begin and end transactions on thecaller's behalf when the called procedural code requires it. Traditionalfunctions, in contrast, are always subordinate to whatever specific SQLstatement happens to execute them. However, CALLcannot be usedinside an existing transaction block in most cases.
Transaction control features ( COMMITand ROLLBACKinterfaces) were added to all of the standard procedural languages as partof the v11 feature: PL/pgSQL, PL/Python, PL/Perl, and PL/Tcl. It's clear tome that a certain proportion of application developers are allergic toputting business logic in the database system, despite the fact that batchprocessing logic is often a great deal faster when implemented thatway. Perhaps the fact that clients are now able to cede full control oftransaction scope to procedural code running on the database server (codethat can be written in a familiar scripting language) will helpapplication developers to get over the understandable aversion. Hopesprings eternal.
A personal favorite and insider's pick, INCLUDEindexes arethose that have some number of extra included columns. These extracolumns do not participate in comparisons as rows are located by indexscans — their values are just an additional payload that don't affect uniqueenforcement for unique indexes or constraints. (It's possible to use thiswith a secondary/non-unique index as well, but that will be far lesscommon.)
Judicious use of INCLUDEindexes can increase the number of index-onlyscans, and thus decrease the aggregate amount of I/O used to servicequeries. This enhancement allows users to add extra columns to a primary keyor unique constraint without affecting when or how duplicate violationerrors occur. Anadditional, quasi-redundant index need never be defined. A singleall-encompassing index may now be able to service practically all querieswith an index-only scan, at least if we make certain reasonable assumptionsabout the access patterns and rate of churn.
Regrettably, INCLUDEindexes failed to make their way into severalprevious releases. The community finally found a way forward on certainconcerns about the on-disk representation in the final CommitFestof thev11 cycle. Thanks in large part to the hard work of the feature's primaryauthor, Anastasia Lubennikova, administrators will have a powerful new tool formaximizing the use of index-only scans in specific, though important,situations.
Schema migrations are a challenge in many productionenvironments. Depending on the exact details, a utility statement that addsa new column to a table may either be more or less instantaneous, or maytake at least as long as it takes to rewrite everything. This is aworld of difference. (Lock acquisition is another complicating factor,though often doesn't need to be given much consideration.)
A rewrite that wasn't strictly necessary could happen anyway in previousversions. In particular, there was typically no good reason for a non-NULLdefault value within ALTER TABLE ... ADD COLUMNto force a rewrite, butthe extra metadata accounting required to avoid one wasn't inplace. PostgreSQL 11 is able to store metadata about the default value,which is retrieved for columns in on-disk tuples without an explicitrepresentation (those rows from before the column was added to the table)during subsequent SELECTstatements. In other words, the "NULLdefault requires no rewrite" case was extended to include defaults consisting of any expressionwith a value that can be precalculated up front. This is not limited tosimple scalar values — only rare cases where a rewrite is trulynecessary (e.g., DEFAULT random()) will require a rewrite.
General purpose database systems must support a wide variety of usecases, especially since users seldom know what will be important to theirapplication well ahead of time. I expect that progress on generalperformance and scalability will continue or pick up in future releases,though the true rate of progress depends in no small part on theparticipation of the wider user community. Many enhancements require thatsome excruciating tradeoff be made. As the progenitor of PostgreSQL,Dr. Michael Stonebraker put it in his 2015 ACM Turing Award lecture: "Ifyou are going to solve real-world problems, you got to talk to real peopleto know what they are. The rubber meets the road as opposed to the rubbermeets the sky."
Prognostication is a tricky business, but there are some early signs onhow version 12 will shape up. Even more operations will beparallelized. A patch for parallel VACUUMis in the works forversion 12, written by Masahiko Sawada. We expect to see progress onthe zheap project, a longer term effort led by Robert Haas and Amit Kapilato address problems with VACUUMby completely re-architecting howgarbage collection of old row versions takes place.
An index skip scan patch has also been proposed, which enables B-Treeindex scans that omit the leading column from a particular available indexwhen the column happens to have few distinct values. A similar patch byAlexander Korotkov to recognize that the ordering provided by a B-Treeindex can be made to work by performing many small sorts on an omittedtrailing column may finally leave patch purgatory in the v12 cycle. Thereis a duality here: the first enhancement helps the case where we want to doan index scan using the B-Tree sort order "(B, C)", but only have an indexon "(A,B,C)", whereas the second enhancement helps the case where we wantto do an index scan with the sort order "(A,B,C)", but only have an indexon "(A,B)". The former enhancement "skips", while the latter enhancement"extends".
I strongly suspect that both patches will help real-world performancefor online transaction processing ( OLTP)and web applications quite noticeably, especially by making the indexes that areavailable work well enough without expert database administratorintervention. Early feedback from users may validate that intuition, as well as other intuitions held byother contributors.
Another area that's ripe for further improvements, which would likewisebenefit from feedback from more sophisticated users, is the queryplanner. Remaining gaps in the optimizer's capabilities include an inabilityto push down aggregates below joinsand an inability toproduce plans that are optimalfor certain kinds of star schemaqueries.
While building a feature complete relational database system that dealsgracefully with real-world production performance and scalabilitychallenges is an enormous undertaking, PostgreSQL now provides all themajor pieces.
I would like to thank Joe Conway for his helpful review of this article.
[Peter Geoghegan works for Crunchy Data.]( Log in
to post comments)