Re: [HACKERS] avoiding tuple copying in btree index builds
On Tue, Jun 17, 2014 at 10:08 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Mon, Jun 16, 2014 at 8:10 PM, Tom Lane t...@sss.pgh.pa.us wrote: On a micro-optimization level, it might be worth passing the TID as ItemPointer not ItemPointerData (ie, pass a pointer until we get to the point of actually inserting the TID into the index tuple). I'm not sure that copying odd-size structs should be assumed to be efficient. Yeah, true. Checking existing precedent, it looks like we usually pass ItemPointer rather than ItemPointerData, so it's probably a good idea to do this that way too for reasons of style if nothing else. I kind of wonder whether it's really more efficient to pass an 8-byte pointer to a 6-byte structure than to just pass the structure itself, but it might be. The pointer will certainly be passed in a register, or whatever passes for registers on the particular machine architecture. Weird-size structs, though, tend to have arcane and not-so-efficient rules for being passed by value. It's not unlikely that what the compiler will do under the hood is pass a pointer anyhow, and then do a memcpy to make a local copy in the called function. OK, committed with the suggested changes. (Thanks to Abhijit for pinging me about this, and more generally for his active and effective management of this CommitFest!) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] avoiding tuple copying in btree index builds
On Mon, Jun 16, 2014 at 8:10 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On further review, this is definitely the way to go: it's a straight-up win. The isnull array is never more than one element in length, so testing the single element is quite trivial. The attached, revised patch provides a modest but useful speedup for both hash and btree index builds. Anyone see any reason NOT to do this? So far it looks like a slam-dunk from here. If index_form_tuple leaks any memory in the sort context, this would be not so much a slam-dunk win as a complete disaster. I'm not sure that no-leak is a safe assumption, especially in cases where we do toast compression of the index datums. (In fact, I'm pretty sure that the reason it was coded like this originally was exactly to avoid that assumption.) Yes, that would be bad. The reason I think it's probably OK is that index_form_tuple seems to have had, from ancient times (cf. f67e79045), code of non-trivial to free any memory that's allocated as a result of de-TOASTing, which we presumably would not bother to do if we were counting on memory context resets to clean things up. It calls toast_compress_datum(), which is similarly careful. The most complicated code path appears to be the VARATT_IS_EXTERNAL() case, where we call heap_tuple_fetch_attr(), which calls toast_fetch_datum(). I'm kind of wondering whether that can really happen, though, because I didn't think indexes could use TOAST storage. Even if it does happen I don't see any obvious reason why it should leak, but it'd be harder to be certain that it doesn't. Assuming that you inspect the code and determine it's safe, the patch had better decorate index_form_tuple with dire warnings about not leaking memory; because even if it's safe today, somebody might break it tomorrow. On a micro-optimization level, it might be worth passing the TID as ItemPointer not ItemPointerData (ie, pass a pointer until we get to the point of actually inserting the TID into the index tuple). I'm not sure that copying odd-size structs should be assumed to be efficient. Yeah, true. Checking existing precedent, it looks like we usually pass ItemPointer rather than ItemPointerData, so it's probably a good idea to do this that way too for reasons of style if nothing else. I kind of wonder whether it's really more efficient to pass an 8-byte pointer to a 6-byte structure than to just pass the structure itself, but it might be. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] avoiding tuple copying in btree index builds
Robert Haas robertmh...@gmail.com writes: On Mon, Jun 16, 2014 at 8:10 PM, Tom Lane t...@sss.pgh.pa.us wrote: On a micro-optimization level, it might be worth passing the TID as ItemPointer not ItemPointerData (ie, pass a pointer until we get to the point of actually inserting the TID into the index tuple). I'm not sure that copying odd-size structs should be assumed to be efficient. Yeah, true. Checking existing precedent, it looks like we usually pass ItemPointer rather than ItemPointerData, so it's probably a good idea to do this that way too for reasons of style if nothing else. I kind of wonder whether it's really more efficient to pass an 8-byte pointer to a 6-byte structure than to just pass the structure itself, but it might be. The pointer will certainly be passed in a register, or whatever passes for registers on the particular machine architecture. Weird-size structs, though, tend to have arcane and not-so-efficient rules for being passed by value. It's not unlikely that what the compiler will do under the hood is pass a pointer anyhow, and then do a memcpy to make a local copy in the called function. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] avoiding tuple copying in btree index builds
On Tue, Jun 3, 2014 at 4:38 PM, Robert Haas robertmh...@gmail.com wrote: On Sun, Jun 1, 2014 at 3:26 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, May 6, 2014 at 12:04 AM, Robert Haas robertmh...@gmail.com wrote: On Mon, May 5, 2014 at 2:13 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-05-05 13:52:39 -0400, Robert Haas wrote: Today, I discovered that when building a btree index, the btree code uses index_form_tuple() to create an index tuple from the heap tuple, calls tuplesort_putindextuple() to copy that tuple into the sort's memory context, and then frees the original one it built. This seemed inefficient, so I wrote a patch to eliminate the tuple copying. It works by adding a function tuplesort_putindextuplevalues(), which builds the tuple in the sort's memory context and thus avoids the need for a separate copy. I'm not sure if that's the best approach, but the optimization seems wortwhile. Hm. It looks like we could quite easily just get rid of tuplesort_putindextuple(). The hash usage doesn't look hard to convert. I glanced at that, but it wasn't obvious to me how to convert the hash usage. If you have an idea, I'm all ears. I also think it's possible to have similar optimization for hash index incase it has to spool the tuple for sorting. In function hashbuildCallback(), when buildstate-spool is true, we can avoid to form index tuple. To check for nulls before calling _h_spool(), we can traverse the isnull array. Hmm, that might work. Arguably it's less efficient, but on the other hand if it avoids forming the tuple sometimes it might be MORE efficient. And anyway the difference might not be enough to matter. On further review, this is definitely the way to go: it's a straight-up win. The isnull array is never more than one element in length, so testing the single element is quite trivial. The attached, revised patch provides a modest but useful speedup for both hash and btree index builds. Anyone see any reason NOT to do this? So far it looks like a slam-dunk from here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c index 7abb7a4..c30c6f9 100644 --- a/src/backend/access/hash/hash.c +++ b/src/backend/access/hash/hash.c @@ -142,26 +142,23 @@ hashbuildCallback(Relation index, HashBuildState *buildstate = (HashBuildState *) state; IndexTuple itup; - /* form an index tuple and point it at the heap tuple */ - itup = _hash_form_tuple(index, values, isnull); - itup-t_tid = htup-t_self; - /* Hash indexes don't index nulls, see notes in hashinsert */ - if (IndexTupleHasNulls(itup)) - { - pfree(itup); + if (isnull[0]) return; - } /* Either spool the tuple for sorting, or just put it into the index */ if (buildstate-spool) - _h_spool(itup, buildstate-spool); + _h_spool(buildstate-spool, htup-t_self, values, isnull); else + { + /* form an index tuple and point it at the heap tuple */ + itup = _hash_form_tuple(index, values, isnull); + itup-t_tid = htup-t_self; _hash_doinsert(index, itup); + pfree(itup); + } buildstate-indtuples += 1; - - pfree(itup); } /* @@ -184,10 +181,6 @@ hashinsert(PG_FUNCTION_ARGS) #endif IndexTuple itup; - /* generate an index tuple */ - itup = _hash_form_tuple(rel, values, isnull); - itup-t_tid = *ht_ctid; - /* * If the single index key is null, we don't insert it into the index. * Hash tables support scans on '='. Relational algebra says that A = B @@ -197,11 +190,12 @@ hashinsert(PG_FUNCTION_ARGS) * NOTNULL scans, but that's an artifact of the strategy map architecture * chosen in 1986, not of the way nulls are handled here. */ - if (IndexTupleHasNulls(itup)) - { - pfree(itup); + if (isnull[0]) PG_RETURN_BOOL(false); - } + + /* generate an index tuple */ + itup = _hash_form_tuple(rel, values, isnull); + itup-t_tid = *ht_ctid; _hash_doinsert(rel, itup); diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c index c0d6fec..3a70bc1 100644 --- a/src/backend/access/hash/hashsort.c +++ b/src/backend/access/hash/hashsort.c @@ -90,9 +90,10 @@ _h_spooldestroy(HSpool *hspool) * spool an index entry into the sort file. */ void -_h_spool(IndexTuple itup, HSpool *hspool) +_h_spool(HSpool *hspool, ItemPointerData self, Datum *values, bool *isnull) { - tuplesort_putindextuple(hspool-sortstate, itup); + tuplesort_putindextuplevalues(hspool-sortstate, hspool-index, + self, values, isnull); } /* diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index 36dc6c2..1ea4a2c 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -171,28 +171,21 @@ btbuildCallback(Relation index, void *state) { BTBuildState *buildstate = (BTBuildState *) state; - IndexTuple itup; - - /* form an index tuple and point it at the heap tuple
Re: [HACKERS] avoiding tuple copying in btree index builds
Robert Haas robertmh...@gmail.com writes: On further review, this is definitely the way to go: it's a straight-up win. The isnull array is never more than one element in length, so testing the single element is quite trivial. The attached, revised patch provides a modest but useful speedup for both hash and btree index builds. Anyone see any reason NOT to do this? So far it looks like a slam-dunk from here. If index_form_tuple leaks any memory in the sort context, this would be not so much a slam-dunk win as a complete disaster. I'm not sure that no-leak is a safe assumption, especially in cases where we do toast compression of the index datums. (In fact, I'm pretty sure that the reason it was coded like this originally was exactly to avoid that assumption.) Assuming that you inspect the code and determine it's safe, the patch had better decorate index_form_tuple with dire warnings about not leaking memory; because even if it's safe today, somebody might break it tomorrow. On a micro-optimization level, it might be worth passing the TID as ItemPointer not ItemPointerData (ie, pass a pointer until we get to the point of actually inserting the TID into the index tuple). I'm not sure that copying odd-size structs should be assumed to be efficient. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] avoiding tuple copying in btree index builds
On Wed, Jun 4, 2014 at 2:08 AM, Robert Haas robertmh...@gmail.com wrote: On Sun, Jun 1, 2014 at 3:26 AM, Amit Kapila amit.kapil...@gmail.com wrote: I also think it's possible to have similar optimization for hash index incase it has to spool the tuple for sorting. In function hashbuildCallback(), when buildstate-spool is true, we can avoid to form index tuple. To check for nulls before calling _h_spool(), we can traverse the isnull array. Hmm, that might work. Arguably it's less efficient, but on the other hand if it avoids forming the tuple sometimes it might be MORE efficient. And anyway the difference might not be enough to matter. Considering the fact that for hash indexes, this optimization would only get triggered for large indexes (atleast greater than shared buffer's), I agree that it will not be as useful as it will be for btree indexes. The another minor advantage could have been that we can remove tuplesort_putindextuple() from code, if this optimization is done for hash indexes. However we can leave it as it is for now as there doesn't seem to be any gain by doing so. You seem to have removed puttuple_common() call from function tuplesort_putindextuple(), is there any reason for doing so? Apart from that patch looks good to me. Below performance data on various size of indexes shows that this patch can improve performance upto ~7%. The performance difference becomes lesser when the index size is too big and I think that is probably due to the reason that we have to write all the data at end of operation, so if the data is big the improvement is not shown due to large I/O. The performance improvement is shown considering median value of 3 runs and time is taken by enabling \timing option of psql. Performance Data --- Configuration: IBM POWER-7 16 cores, 64 hardware threads RAM = 64GB shared_buffers=8GB pgbench_accounts No. of Records -10million Index - on integer Operation - Reindex Master 16043.500 ms 16058.723 ms 15941.057 ms Patch 15525.054 ms 15551.935 ms 15492.879 ms Perf Improvement: 3.43% pgbench_accounts No. of Records -30million Index - on integer Operation - Reindex Master 51258.338 ms 50520.328 ms 50562.022 ms Patch 49610.710 ms 49302.571 ms 49301.390 ms Perf Improvement: 2.41% table (c1 int, c2 char(10)) No. of records = 300,000 Index - on char(10) Operation - Reindex Master 443.584 ms 444.798 ms 452.888 ms Patch 421.554 ms 430.528 ms 447.558 ms Performance Improvement: 3.2% table (c1 int, c2 char(10)) No. of records = 500,000 Index - on char(10) Operation - Reindex Master 663.621 ms 661.299 ms 657.754 ms Patch 652.325 ms 644.782 ms 643.218 ms Performance Improvement: 2.5% table (c1 int, c2 char(10)) No. of records = 1000,000 Index - on char(10) Operation - Reindex Master 16554.076 ms 16686.528 ms 16571.129 ms Patch 16556.852 ms 16513.543 ms 16610.615 ms Performance Improvement: less than 1% table (c1 int, c2 char(20)) No. of records = 300,000 Index - on char(20) Operation - Reindex Master 429.670 ms 441.445 ms 411.539 ms Patch 401.801 ms 412.716 ms 395.002 ms Performance Improvement: 6.48% table (c1 int, c2 char(20)) No. of records = 500,000 Index - on char(20) Operation - Reindex Master 724.541 ms 731.582 ms 704.934 ms Patch 686.004 ms 677.361 ms 686.915 ms Performance Improvement: 5.31% table (c1 int, c2 char(20)) No. of records = 1000,000 Index - on char(20) Operation - Reindex Master 20728.758 ms 20665.289 ms 20656.375 ms Patch 20594.022 ms 20617.383 ms 20628.181 ms Performance Improvement: 0.2% Apart from this, I have taken data for unlogged tables and much larger indexes and all the data shows similar results as above. So I think above data is quite good representation of value for this patch, however if you or anybody else still feels that more data is required for any other configuration, please do let me know about the same. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] avoiding tuple copying in btree index builds
On Sun, Jun 1, 2014 at 3:26 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, May 6, 2014 at 12:04 AM, Robert Haas robertmh...@gmail.com wrote: On Mon, May 5, 2014 at 2:13 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-05-05 13:52:39 -0400, Robert Haas wrote: Today, I discovered that when building a btree index, the btree code uses index_form_tuple() to create an index tuple from the heap tuple, calls tuplesort_putindextuple() to copy that tuple into the sort's memory context, and then frees the original one it built. This seemed inefficient, so I wrote a patch to eliminate the tuple copying. It works by adding a function tuplesort_putindextuplevalues(), which builds the tuple in the sort's memory context and thus avoids the need for a separate copy. I'm not sure if that's the best approach, but the optimization seems wortwhile. Hm. It looks like we could quite easily just get rid of tuplesort_putindextuple(). The hash usage doesn't look hard to convert. I glanced at that, but it wasn't obvious to me how to convert the hash usage. If you have an idea, I'm all ears. I also think it's possible to have similar optimization for hash index incase it has to spool the tuple for sorting. In function hashbuildCallback(), when buildstate-spool is true, we can avoid to form index tuple. To check for nulls before calling _h_spool(), we can traverse the isnull array. Hmm, that might work. Arguably it's less efficient, but on the other hand if it avoids forming the tuple sometimes it might be MORE efficient. And anyway the difference might not be enough to matter. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] avoiding tuple copying in btree index builds
On Tue, May 6, 2014 at 12:04 AM, Robert Haas robertmh...@gmail.com wrote: On Mon, May 5, 2014 at 2:13 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-05-05 13:52:39 -0400, Robert Haas wrote: Today, I discovered that when building a btree index, the btree code uses index_form_tuple() to create an index tuple from the heap tuple, calls tuplesort_putindextuple() to copy that tuple into the sort's memory context, and then frees the original one it built. This seemed inefficient, so I wrote a patch to eliminate the tuple copying. It works by adding a function tuplesort_putindextuplevalues(), which builds the tuple in the sort's memory context and thus avoids the need for a separate copy. I'm not sure if that's the best approach, but the optimization seems wortwhile. Hm. It looks like we could quite easily just get rid of tuplesort_putindextuple(). The hash usage doesn't look hard to convert. I glanced at that, but it wasn't obvious to me how to convert the hash usage. If you have an idea, I'm all ears. I also think it's possible to have similar optimization for hash index incase it has to spool the tuple for sorting. In function hashbuildCallback(), when buildstate-spool is true, we can avoid to form index tuple. To check for nulls before calling _h_spool(), we can traverse the isnull array. It seems converting hash index usage is not as straightforward as btree index, but doesn't look too complex either. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] avoiding tuple copying in btree index builds
Hi, On 2014-05-05 13:52:39 -0400, Robert Haas wrote: Today, I discovered that when building a btree index, the btree code uses index_form_tuple() to create an index tuple from the heap tuple, calls tuplesort_putindextuple() to copy that tuple into the sort's memory context, and then frees the original one it built. This seemed inefficient, so I wrote a patch to eliminate the tuple copying. It works by adding a function tuplesort_putindextuplevalues(), which builds the tuple in the sort's memory context and thus avoids the need for a separate copy. I'm not sure if that's the best approach, but the optimization seems wortwhile. Hm. It looks like we could quite easily just get rid of tuplesort_putindextuple(). The hash usage doesn't look hard to convert. I tested it by repeatedly executing REINDEX INDEX pgbench_accounts_pkey on a PPC64 machine. pgbench_accounts contains 10 million records. With unpatched master as of b2f7bd72c4d3e80065725c72e85778d5f4bdfd4a, I got times of 6.159s, 6.177s, and 6.201s. With the attached patch, I got times of 5.787s, 5.972s, and 5.913s, a savings of almost 5%. Not bad considering the amount of work involved. Yes, that's certainly worthwile. Nice. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] avoiding tuple copying in btree index builds
On Mon, May 5, 2014 at 2:13 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-05-05 13:52:39 -0400, Robert Haas wrote: Today, I discovered that when building a btree index, the btree code uses index_form_tuple() to create an index tuple from the heap tuple, calls tuplesort_putindextuple() to copy that tuple into the sort's memory context, and then frees the original one it built. This seemed inefficient, so I wrote a patch to eliminate the tuple copying. It works by adding a function tuplesort_putindextuplevalues(), which builds the tuple in the sort's memory context and thus avoids the need for a separate copy. I'm not sure if that's the best approach, but the optimization seems wortwhile. Hm. It looks like we could quite easily just get rid of tuplesort_putindextuple(). The hash usage doesn't look hard to convert. I glanced at that, but it wasn't obvious to me how to convert the hash usage. If you have an idea, I'm all ears. I tested it by repeatedly executing REINDEX INDEX pgbench_accounts_pkey on a PPC64 machine. pgbench_accounts contains 10 million records. With unpatched master as of b2f7bd72c4d3e80065725c72e85778d5f4bdfd4a, I got times of 6.159s, 6.177s, and 6.201s. With the attached patch, I got times of 5.787s, 5.972s, and 5.913s, a savings of almost 5%. Not bad considering the amount of work involved. Yes, that's certainly worthwile. Nice. Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers