Re: GIN pending list pages not recycled promptly (was Re: [HACKERS] GIN improvements part 1: additional information)
On Wed, Jan 22, 2014 at 9:12 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/22/2014 03:39 AM, Tomas Vondra wrote: What annoys me a bit is the huge size difference between the index updated incrementally (by a sequence of INSERT commands), and the index rebuilt from scratch using VACUUM FULL. It's a bit better with the patch (2288 vs. 2035 MB), but is there a chance to improve this? Hmm. What seems to be happening is that pending item list pages that the fast update mechanism uses are not getting recycled. When enough list pages are filled up, they are flushed into the main index and the list pages are marked as deleted. But they are not recorded in the FSM, so they won't be recycled until the index is vacuumed. Almost all of the difference can be attributed to deleted pages left behind like that. So this isn't actually related to the packed postinglists patch at all. It just makes the bloat more obvious, because it makes the actual size of the index size, excluding deleted pages, smaller. But it can be observed on git master as well: I created a simple test table and index like this: create table foo (intarr int[]); create index i_foo on foo using gin(intarr) with (fastupdate=on); I filled the table like this: insert into foo select array[-1] from generate_series(1, 1000) g; postgres=# \d+i List of relations Schema | Name | Type | Owner | Size | Description +--+---+++- public | foo | table | heikki | 575 MB | (1 row) postgres=# \di+ List of relations Schema | Name | Type | Owner | Table | Size | Description +---+---++---++- public | i_foo | index | heikki | foo | 251 MB | (1 row) I wrote a little utility that scans all pages in a gin index, and prints out the flags indicating what kind of a page it is. The distribution looks like this: 19 DATA 7420 DATA LEAF 24701 DELETED 1 LEAF 1 META I think we need to add the deleted pages to the FSM more aggressively. I tried simply adding calls to RecordFreeIndexPage, after the list pages have been marked as deleted, but unfortunately that didn't help. The problem is that the FSM is organized into a three-level tree, and RecordFreeIndexPage only updates the bottom level. The upper levels are not updated until the FSM is vacuumed, so the pages are still not visible to GetFreeIndexPage calls until next vacuum. The simplest fix would be to add a call to IndexFreeSpaceMapVacuum after flushing the pending list, per attached patch. I'm slightly worried about the performance impact of the IndexFreeSpaceMapVacuum() call. It scans the whole FSM of the index, which isn't exactly free. So perhaps we should teach RecordFreeIndexPage to update the upper levels of the FSM in a retail-fashion instead. I wonder if you pursued this further? You recently added a number of TODO items related to GIN index; is it worth adding this to the list? -- Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Thu, Jan 23, 2014 at 9:27 AM, Alexander Korotkov aekorot...@gmail.comwrote: On Wed, Jan 22, 2014 at 9:28 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/22/2014 02:17 PM, Alexander Korotkov wrote: We already spent a lot of time with compression. Now we need to figure out the result we want see. I spent quite long time debugging varbyte encoding without segments. Finally, it passed very many tests without any problems. Now, it is just piece of junk. I'm afraid that we will have to reimplement everything from scratch another two or three times because code doesn't look perfect. For sure, it's incompatible with getting something into 9.4. That's a bit harsh :-). But yes, I understand what you're saying. It's quite common for large patches like this to be rewritten several times before being committed; you just don't know what works best until you've tried many designs. Remember we have also subsequent fast-scan which is very needed for hstore and other application. I propose to do final decisions now and concentrate forces on making committable patch with these decisions. And don't change these decisions even if somebody have idea how to improve code readability in 100 times and potential extendability in 1000 times. I propose following decisions: 1) Leave uncompressed area, allow duplicates in it 2) Don't introduce Items on page. Well, I got the insertions to work now without the uncompressed area, so in the spirit of Just Getting This Damn Patch Committed, I'm going to go ahead with that. We can add the uncompressed area later if performance testing shows it to be necessary. And I agree about the Items on page idea - we can come back to that too still in 9.4 timeframe if necessary, but probably not. So, committed. It's the same design as in the most recent patch I posted, but with a bunch of bugs fixed, and cleaned up all over. I'm going to move to the fast scan patch now, but please do test and review the committed version to see if I missed something. Great! Thanks a lot! Assertion in dataPlaceToPageLeafRecompress is too strong. Page can contain GinDataLeafMaxContentSize bytes. Patch is attached. My test-suite don't run correctly. I'm debugging now. ITSM I found this bug. ginVacuumPostingTreeLeaf re-encodes only some segments. Others are not even re-palloced. They are moved left in dataPlaceToPageLeafRecompress by memcpy. But it's incorrect to with memcpy, proper solution is memmove. Using memcpy platform dependently can lead to page corruption. Another solution is to re-palloc segments in ginVacuumPostingTreeLeaf. -- With best regards, Alexander Korotkov. gin-memmove-fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 01/24/2014 10:03 AM, Alexander Korotkov wrote: ITSM I found this bug. ginVacuumPostingTreeLeaf re-encodes only some segments. Others are not even re-palloced. They are moved left in dataPlaceToPageLeafRecompress by memcpy. But it's incorrect to with memcpy, proper solution is memmove. Using memcpy platform dependently can lead to page corruption. Another solution is to re-palloc segments in ginVacuumPostingTreeLeaf. Good catch. Thanks, committed, changing memcpy to memmove. Will have to switch to pallocing everything in the future, if we make leafRepackItems smarter, so that it doesn't rewrite all the segments after the first modified one. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Fri, Jan 24, 2014 at 12:50 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/24/2014 10:03 AM, Alexander Korotkov wrote: ITSM I found this bug. ginVacuumPostingTreeLeaf re-encodes only some segments. Others are not even re-palloced. They are moved left in dataPlaceToPageLeafRecompress by memcpy. But it's incorrect to with memcpy, proper solution is memmove. Using memcpy platform dependently can lead to page corruption. Another solution is to re-palloc segments in ginVacuumPostingTreeLeaf. Good catch. Thanks, committed, changing memcpy to memmove. Will have to switch to pallocing everything in the future, if we make leafRepackItems smarter, so that it doesn't rewrite all the segments after the first modified one. OK. What about previous fix in assert? -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On 01/24/2014 10:53 AM, Alexander Korotkov wrote: OK. What about previous fix in assert? Ah right, fixed that too now. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Fri, Jan 24, 2014 at 1:19 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/24/2014 10:53 AM, Alexander Korotkov wrote: OK. What about previous fix in assert? Ah right, fixed that too now. Good, now my test-suite passed. Results are so. Time of operations event | period ---+- index_build | 00:01:45.709546 index_build_recovery | 00:00:09 index_update | 00:05:45.732214 index_update_recovery | 00:01:17 search_new| 00:24:02.191049 search_updated| 00:26:54.122852 (6 rows) Index sizes label | size ---+--- new | 387547136 after_updates | 610877440 (2 rows) Before compression results were following. Time of operations event | period ---+- index_build | 00:01:51.116722 index_build_recovery | 00:00:08 index_update | 00:05:12.124758 index_update_recovery | 00:01:43 search_new| 00:23:44.281424 search_updated| 00:25:41.96251 (6 rows) Index sizes label |size ---+ new | 884514816 after_updates | 1595252736 (2 rows) So, we can see little regression. However, I'm not sure if it's very significant. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On 01/22/2014 09:25 AM, Alexander Korotkov wrote: On Wed, Jan 22, 2014 at 1:21 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/21/2014 11:35 AM, Heikki Linnakangas wrote: Oh, I see what's going on. I had assumed that there cannot be duplicate insertions into the posting tree, but that's dead wrong. The fast insertion mechanism depends on a duplicate insertion to do nothing. Ok, this turned out to be a much bigger change than I thought... In principle, it's not difficult to eliminate duplicates. However, to detect a duplicate, you have to check if the item is present in the uncompressed items array, or in the compressed lists. That requires decoding the segment where it should be. But if we decode the segment, what's the purpose of the uncompressed items array? Its purpose was to speed up insertions, by buffering them so that we don't need to decode and re-encode the page/segment on every inserted item. But if we're paying the price of decoding it anyway, we might as well include the new item and re-encode the segment. The uncompressed area saves some effort in WAL-logging, as the record of inserting an entry into the uncompressed area is much smaller than that of re-encoding part of the page, but if that really is a concern, we could track more carefully which parts of the page is modified, and only WAL-log the required parts. And hopefully, the fast-update lists buffer inserts so that you insert many items at a time to the posting tree, and the price of re-encoding is only paid once. So, now that I think about this once more, I don't think the uncompressed area in every leaf page is a good idea. I didn't get why insertion of duplicate TIDs to uncompressed area and eliminate them of re-encoding? It would be somewhat more work during updates, more work during scan, more WAL records. But is it really significant for real-life workloads? Hmm, so you would merrily insert duplicate TIDs into the uncompressed area, and weed them out when reading or recompressing the page? I had not thought of that. Yeah, it might be a good trade-off, duplicates are not expected to happen very often. I refactored the way the recompression routine works again. It is now more clearly a multi-step process. First, the existing page is disassembled into an in-memory struct, which is a linked list of all the segments. In-memory each segment can be represented as an array of item pointers, or in the compressed format. In the next phase, all the new items are added to the segments where they belong. This naturally can lead to overly large segments; in the third phase, the items are redistributed among the segments, and compressed back to the compressed format. Finally, the finished segments are written back to the page, or pages if it had to be split. The same subroutines to disassemble and recompress are also used in vacuum. Attached is what I've got now. This is again quite heavily-changed from the previous version - sorry for repeatedly rewriting this. I'll continue testing and re-reviewing this tomorrow. Let's clarify where we are. We had quite debugged and tested version with hard-wired varbyte encoding. Now we're reimplementing and debugging segmented version of varbyte encoding in a hope that one day we can easily replace it with something better that meets our restrictions (but we didn't find it already). Is it right? The segmentation was a sensible change on code-readability grounds alone. Yes, it made it easier to experiment with different encodings, and will make it easier to replace the encoding in the future, but that wasn't the only reason for doing it. It's nice to have the encoding/decoding stuff in ginpostinglists.c, so that the rest of the code just passes around opaque GinPostingList structs (previously known as PostingListSegment). One thing I have been pondering, though: Instead of storing the posting lists one after each other on the leaf page, it might be better to store them as Items on the page, with normal ItemIds pointing to each. So the page layout would be more standard, and you could use PageAddItem/PageIndexMultiDelete to add/remove posting lists to page. The immediate advantage of that would be that it would make it possible to do a binary search of the segments, to quickly locate the right segment where a tuple belongs to. That might not be significant in practice - linearly scanning 32 items is not slow either. And it would add some overhead, one line pointer per segment, 4 * 32 = 128 bytes per page with the current segment size of 256 bytes. But then again, it might be a good idea just because it would make the pages look more like any other page, which is generally a good thing. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
GIN pending list pages not recycled promptly (was Re: [HACKERS] GIN improvements part 1: additional information)
On 01/22/2014 03:39 AM, Tomas Vondra wrote: What annoys me a bit is the huge size difference between the index updated incrementally (by a sequence of INSERT commands), and the index rebuilt from scratch using VACUUM FULL. It's a bit better with the patch (2288 vs. 2035 MB), but is there a chance to improve this? Hmm. What seems to be happening is that pending item list pages that the fast update mechanism uses are not getting recycled. When enough list pages are filled up, they are flushed into the main index and the list pages are marked as deleted. But they are not recorded in the FSM, so they won't be recycled until the index is vacuumed. Almost all of the difference can be attributed to deleted pages left behind like that. So this isn't actually related to the packed postinglists patch at all. It just makes the bloat more obvious, because it makes the actual size of the index size, excluding deleted pages, smaller. But it can be observed on git master as well: I created a simple test table and index like this: create table foo (intarr int[]); create index i_foo on foo using gin(intarr) with (fastupdate=on); I filled the table like this: insert into foo select array[-1] from generate_series(1, 1000) g; postgres=# \d+i List of relations Schema | Name | Type | Owner | Size | Description +--+---+++- public | foo | table | heikki | 575 MB | (1 row) postgres=# \di+ List of relations Schema | Name | Type | Owner | Table | Size | Description +---+---++---++- public | i_foo | index | heikki | foo | 251 MB | (1 row) I wrote a little utility that scans all pages in a gin index, and prints out the flags indicating what kind of a page it is. The distribution looks like this: 19 DATA 7420 DATA LEAF 24701 DELETED 1 LEAF 1 META I think we need to add the deleted pages to the FSM more aggressively. I tried simply adding calls to RecordFreeIndexPage, after the list pages have been marked as deleted, but unfortunately that didn't help. The problem is that the FSM is organized into a three-level tree, and RecordFreeIndexPage only updates the bottom level. The upper levels are not updated until the FSM is vacuumed, so the pages are still not visible to GetFreeIndexPage calls until next vacuum. The simplest fix would be to add a call to IndexFreeSpaceMapVacuum after flushing the pending list, per attached patch. I'm slightly worried about the performance impact of the IndexFreeSpaceMapVacuum() call. It scans the whole FSM of the index, which isn't exactly free. So perhaps we should teach RecordFreeIndexPage to update the upper levels of the FSM in a retail-fashion instead. - Heikki *** a/src/backend/access/gin/ginfast.c --- b/src/backend/access/gin/ginfast.c *** *** 21,26 --- 21,27 #include access/gin_private.h #include commands/vacuum.h #include miscadmin.h + #include storage/indexfsm.h #include utils/memutils.h #include utils/rel.h *** *** 434,440 ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector) --- 435,453 END_CRIT_SECTION(); if (needCleanup) + { ginInsertCleanup(ginstate, false, NULL); + + /* + * Vacuum the FSM to make the deleted list pages available for re-use. + * + * gininsertCleanup marked them as free in the FSM by calling + * RecordFreeIndexPage, but that only updates the bottom FSM level. + * Since GetFreeIndexPage scans from the top, they won't be visible + * for reuse until the upper levels have been updated. + */ + IndexFreeSpaceMapVacuum(index); + } } /* *** *** 599,608 shiftList(Relation index, Buffer metabuffer, BlockNumber newHead, } } for (i = 0; i data.ndeleted; i++) UnlockReleaseBuffer(buffers[i]); ! ! END_CRIT_SECTION(); } while (blknoToDelete != newHead); return false; --- 612,625 } } + END_CRIT_SECTION(); + for (i = 0; i data.ndeleted; i++) + { + BlockNumber blkno = BufferGetBlockNumber(buffers[i]); UnlockReleaseBuffer(buffers[i]); ! RecordFreeIndexPage(index, blkno); ! } } while (blknoToDelete != newHead); return false; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Wed, Jan 22, 2014 at 12:30 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/22/2014 09:25 AM, Alexander Korotkov wrote: On Wed, Jan 22, 2014 at 1:21 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/21/2014 11:35 AM, Heikki Linnakangas wrote: Oh, I see what's going on. I had assumed that there cannot be duplicate insertions into the posting tree, but that's dead wrong. The fast insertion mechanism depends on a duplicate insertion to do nothing. Ok, this turned out to be a much bigger change than I thought... In principle, it's not difficult to eliminate duplicates. However, to detect a duplicate, you have to check if the item is present in the uncompressed items array, or in the compressed lists. That requires decoding the segment where it should be. But if we decode the segment, what's the purpose of the uncompressed items array? Its purpose was to speed up insertions, by buffering them so that we don't need to decode and re-encode the page/segment on every inserted item. But if we're paying the price of decoding it anyway, we might as well include the new item and re-encode the segment. The uncompressed area saves some effort in WAL-logging, as the record of inserting an entry into the uncompressed area is much smaller than that of re-encoding part of the page, but if that really is a concern, we could track more carefully which parts of the page is modified, and only WAL-log the required parts. And hopefully, the fast-update lists buffer inserts so that you insert many items at a time to the posting tree, and the price of re-encoding is only paid once. So, now that I think about this once more, I don't think the uncompressed area in every leaf page is a good idea. I didn't get why insertion of duplicate TIDs to uncompressed area and eliminate them of re-encoding? It would be somewhat more work during updates, more work during scan, more WAL records. But is it really significant for real-life workloads? Hmm, so you would merrily insert duplicate TIDs into the uncompressed area, and weed them out when reading or recompressing the page? I had not thought of that. Yeah, it might be a good trade-off, duplicates are not expected to happen very often. I refactored the way the recompression routine works again. It is now more clearly a multi-step process. First, the existing page is disassembled into an in-memory struct, which is a linked list of all the segments. In-memory each segment can be represented as an array of item pointers, or in the compressed format. In the next phase, all the new items are added to the segments where they belong. This naturally can lead to overly large segments; in the third phase, the items are redistributed among the segments, and compressed back to the compressed format. Finally, the finished segments are written back to the page, or pages if it had to be split. The same subroutines to disassemble and recompress are also used in vacuum. Attached is what I've got now. This is again quite heavily-changed from the previous version - sorry for repeatedly rewriting this. I'll continue testing and re-reviewing this tomorrow. Let's clarify where we are. We had quite debugged and tested version with hard-wired varbyte encoding. Now we're reimplementing and debugging segmented version of varbyte encoding in a hope that one day we can easily replace it with something better that meets our restrictions (but we didn't find it already). Is it right? The segmentation was a sensible change on code-readability grounds alone. Yes, it made it easier to experiment with different encodings, and will make it easier to replace the encoding in the future, but that wasn't the only reason for doing it. It's nice to have the encoding/decoding stuff in ginpostinglists.c, so that the rest of the code just passes around opaque GinPostingList structs (previously known as PostingListSegment). One thing I have been pondering, though: Instead of storing the posting lists one after each other on the leaf page, it might be better to store them as Items on the page, with normal ItemIds pointing to each. So the page layout would be more standard, and you could use PageAddItem/PageIndexMultiDelete to add/remove posting lists to page. The immediate advantage of that would be that it would make it possible to do a binary search of the segments, to quickly locate the right segment where a tuple belongs to. That might not be significant in practice - linearly scanning 32 items is not slow either. And it would add some overhead, one line pointer per segment, 4 * 32 = 128 bytes per page with the current segment size of 256 bytes. But then again, it might be a good idea just because it would make the pages look more like any other page, which is generally a good thing. We already spent a lot of time with compression. Now we need to figure out the result we want see. I spent quite
Re: GIN pending list pages not recycled promptly (was Re: [HACKERS] GIN improvements part 1: additional information)
Heikki Linnakangas wrote: I wrote a little utility that scans all pages in a gin index, and prints out the flags indicating what kind of a page it is. The distribution looks like this: 19 DATA 7420 DATA LEAF 24701 DELETED 1 LEAF 1 META Hah. I think we need to add the deleted pages to the FSM more aggressively. I tried simply adding calls to RecordFreeIndexPage, after the list pages have been marked as deleted, but unfortunately that didn't help. The problem is that the FSM is organized into a three-level tree, and RecordFreeIndexPage only updates the bottom level. Interesting. I think the idea of having an option for RecordFreeIndexPage to update upper levels makes sense (no need to force it for other users.) Some time ago I proposed an index-only cleanup for vacuum. That would help GIN get this kind of treatment (vacuuming its FSM and processing the pending list) separately from vacuuming the index. It's probably too late for 9.4 though. One other thing worth considering in this area is that making the pending list size depend on work_mem appears to have been a really bad idea. I know one case where the server is really large and seems to run mostly OLAP type stuff with occasional updates, so they globally set work_mem=2GB; they have GIN indexes for text search, and the result is horrible performance 90% of the time, then a vacuum cleans the pending list and it is blazing fast until the pending list starts getting big again. Now you can argue that setting work_mem to that value is a bad idea, but as it turns out, in this case other than the GIN pending list it seems to work fine. Not related to the patch at hand, but I thought I would out it for consideration, 'cause I'm not gonna start a new thread about it. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 01/22/2014 02:17 PM, Alexander Korotkov wrote: We already spent a lot of time with compression. Now we need to figure out the result we want see. I spent quite long time debugging varbyte encoding without segments. Finally, it passed very many tests without any problems. Now, it is just piece of junk. I'm afraid that we will have to reimplement everything from scratch another two or three times because code doesn't look perfect. For sure, it's incompatible with getting something into 9.4. That's a bit harsh :-). But yes, I understand what you're saying. It's quite common for large patches like this to be rewritten several times before being committed; you just don't know what works best until you've tried many designs. Remember we have also subsequent fast-scan which is very needed for hstore and other application. I propose to do final decisions now and concentrate forces on making committable patch with these decisions. And don't change these decisions even if somebody have idea how to improve code readability in 100 times and potential extendability in 1000 times. I propose following decisions: 1) Leave uncompressed area, allow duplicates in it 2) Don't introduce Items on page. Well, I got the insertions to work now without the uncompressed area, so in the spirit of Just Getting This Damn Patch Committed, I'm going to go ahead with that. We can add the uncompressed area later if performance testing shows it to be necessary. And I agree about the Items on page idea - we can come back to that too still in 9.4 timeframe if necessary, but probably not. So, committed. It's the same design as in the most recent patch I posted, but with a bunch of bugs fixed, and cleaned up all over. I'm going to move to the fast scan patch now, but please do test and review the committed version to see if I missed something. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Wed, Jan 22, 2014 at 9:28 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/22/2014 02:17 PM, Alexander Korotkov wrote: We already spent a lot of time with compression. Now we need to figure out the result we want see. I spent quite long time debugging varbyte encoding without segments. Finally, it passed very many tests without any problems. Now, it is just piece of junk. I'm afraid that we will have to reimplement everything from scratch another two or three times because code doesn't look perfect. For sure, it's incompatible with getting something into 9.4. That's a bit harsh :-). But yes, I understand what you're saying. It's quite common for large patches like this to be rewritten several times before being committed; you just don't know what works best until you've tried many designs. Remember we have also subsequent fast-scan which is very needed for hstore and other application. I propose to do final decisions now and concentrate forces on making committable patch with these decisions. And don't change these decisions even if somebody have idea how to improve code readability in 100 times and potential extendability in 1000 times. I propose following decisions: 1) Leave uncompressed area, allow duplicates in it 2) Don't introduce Items on page. Well, I got the insertions to work now without the uncompressed area, so in the spirit of Just Getting This Damn Patch Committed, I'm going to go ahead with that. We can add the uncompressed area later if performance testing shows it to be necessary. And I agree about the Items on page idea - we can come back to that too still in 9.4 timeframe if necessary, but probably not. So, committed. It's the same design as in the most recent patch I posted, but with a bunch of bugs fixed, and cleaned up all over. I'm going to move to the fast scan patch now, but please do test and review the committed version to see if I missed something. Great! Thanks a lot! Assertion in dataPlaceToPageLeafRecompress is too strong. Page can contain GinDataLeafMaxContentSize bytes. Patch is attached. My test-suite don't run correctly. I'm debugging now. -- With best regards, Alexander Korotkov. gin-assert-fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 01/21/2014 04:02 AM, Tomas Vondra wrote: On 20.1.2014 19:30, Heikki Linnakangas wrote: Attached is a yet another version, with more bugs fixed and more comments added and updated. I would appreciate some heavy-testing of this patch now. If you could re-run the tests you've been using, that could be great. I've tested the WAL replay by replicating GIN operations over streaming replication. That doesn't guarantee it's correct, but it's a good smoke test. I gave it a try - the OOM error seems to be gone, but now get this PANIC: cannot insert duplicate items to GIN index page This only happens when building the index incrementally (i.e. using a sequence of INSERT statements into a table with GIN index). When I create a new index on a table (already containing the same dataset) it works just fine. Also, I tried to reproduce the issue by running a simple plpgsql loop (instead of a complex python script): DO LANGUAGE plpgsql $$ DECLARE r tsvector; BEGIN FOR r IN SELECT body_tsvector FROM data_table LOOP INSERT INTO idx_table (body_tsvector) VALUES (r); END LOOP; END$$; where data_table is the table with imported data (the same data I mentioned in the post about OOM errors), and index_table is an empty table with a GIN index. And indeed it fails, but only if I run the block in multiple sessions in parallel. Oh, I see what's going on. I had assumed that there cannot be duplicate insertions into the posting tree, but that's dead wrong. The fast insertion mechanism depends on a duplicate insertion to do nothing. Will fix, thanks for the testing! - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Mon, Jan 20, 2014 at 10:30 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/17/2014 08:49 PM, Alexander Korotkov wrote: On Fri, Jan 17, 2014 at 10:38 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/17/2014 01:05 PM, Alexander Korotkov wrote: Seems to be fixed in the attached version of patch. Another improvement in this version: only left-most segments and further are re-encoded. Left part of page are left untouched. I'm looking into this now. A few quick notes: * Even when you don't re-encode the whole page, you still WAL-log the whole page. While correct, it'd be a pretty obvious optimization to only WAL-log the modified part. Oh, I overlooked it. I wrote code in ginRedoInsertData which finds appropriate place fro data but didn't notice that I still wrote whole page to xlog record. Yeah. I think ginRedoInsertData was too cute to be bug-free. I added an explicit unmodifiedsize field to the WAL record, so that ginRedoInsertData doesn't need to calculate it. * When new items are appended to the end of the page so that only the last existing compressed segment is re-encoded, and the page has to be split, the items aren't divided 50/50 on the pages. The loop that moves segments destined for the left page to the right won't move any existing, untouched, segments. I think this loop will move one segment. However, it's too small. Implemented this. I noticed that the gin vacuum redo routine is dead code, except for the data-leaf page handling, because we never remove entries or internal nodes (page deletion is a separate wal record type). And the data-leaf case is functionally equivalent to heap newpage records. I removed the dead code and made it more clear that it resembles heap newpage. Attached is a yet another version, with more bugs fixed and more comments added and updated. I would appreciate some heavy-testing of this patch now. If you could re-run the tests you've been using, that could be great. I've tested the WAL replay by replicating GIN operations over streaming replication. That doesn't guarantee it's correct, but it's a good smoke test. I tried my test-suite but it hangs on index scan with infinite loop. I re-tried it on my laptop with -O0. I found it to crash on update and vacuum in some random places like: Assert(GinPageIsData(page)); in xlogVacuumPage Assert(ndecoded == totalpacked); in ginCompressPostingList Trying to debug it. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On Tue, Jan 21, 2014 at 4:28 PM, Alexander Korotkov aekorot...@gmail.comwrote: I noticed that the gin vacuum redo routine is dead code, except for the data-leaf page handling, because we never remove entries or internal nodes (page deletion is a separate wal record type). And the data-leaf case is functionally equivalent to heap newpage records. I removed the dead code and made it more clear that it resembles heap newpage. Attached is a yet another version, with more bugs fixed and more comments added and updated. I would appreciate some heavy-testing of this patch now. If you could re-run the tests you've been using, that could be great. I've tested the WAL replay by replicating GIN operations over streaming replication. That doesn't guarantee it's correct, but it's a good smoke test. I tried my test-suite but it hangs on index scan with infinite loop. I re-tried it on my laptop with -O0. I found it to crash on update and vacuum in some random places like: Assert(GinPageIsData(page)); in xlogVacuumPage Assert(ndecoded == totalpacked); in ginCompressPostingList Trying to debug it. Another question is about dataPlaceToPageLeaf: while ((Pointer) seg segend) { if (ginCompareItemPointers(minNewItem, seg-first) 0) break; Shouldn't we adjust seg to previous segment? If minNewItem is less than seg-first we should insert it to previous segment. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On 01/21/2014 11:35 AM, Heikki Linnakangas wrote: On 01/21/2014 04:02 AM, Tomas Vondra wrote: On 20.1.2014 19:30, Heikki Linnakangas wrote: Attached is a yet another version, with more bugs fixed and more comments added and updated. I would appreciate some heavy-testing of this patch now. If you could re-run the tests you've been using, that could be great. I've tested the WAL replay by replicating GIN operations over streaming replication. That doesn't guarantee it's correct, but it's a good smoke test. I gave it a try - the OOM error seems to be gone, but now get this PANIC: cannot insert duplicate items to GIN index page This only happens when building the index incrementally (i.e. using a sequence of INSERT statements into a table with GIN index). When I create a new index on a table (already containing the same dataset) it works just fine. Also, I tried to reproduce the issue by running a simple plpgsql loop (instead of a complex python script): DO LANGUAGE plpgsql $$ DECLARE r tsvector; BEGIN FOR r IN SELECT body_tsvector FROM data_table LOOP INSERT INTO idx_table (body_tsvector) VALUES (r); END LOOP; END$$; where data_table is the table with imported data (the same data I mentioned in the post about OOM errors), and index_table is an empty table with a GIN index. And indeed it fails, but only if I run the block in multiple sessions in parallel. Oh, I see what's going on. I had assumed that there cannot be duplicate insertions into the posting tree, but that's dead wrong. The fast insertion mechanism depends on a duplicate insertion to do nothing. Ok, this turned out to be a much bigger change than I thought... In principle, it's not difficult to eliminate duplicates. However, to detect a duplicate, you have to check if the item is present in the uncompressed items array, or in the compressed lists. That requires decoding the segment where it should be. But if we decode the segment, what's the purpose of the uncompressed items array? Its purpose was to speed up insertions, by buffering them so that we don't need to decode and re-encode the page/segment on every inserted item. But if we're paying the price of decoding it anyway, we might as well include the new item and re-encode the segment. The uncompressed area saves some effort in WAL-logging, as the record of inserting an entry into the uncompressed area is much smaller than that of re-encoding part of the page, but if that really is a concern, we could track more carefully which parts of the page is modified, and only WAL-log the required parts. And hopefully, the fast-update lists buffer inserts so that you insert many items at a time to the posting tree, and the price of re-encoding is only paid once. So, now that I think about this once more, I don't think the uncompressed area in every leaf page is a good idea. I refactored the way the recompression routine works again. It is now more clearly a multi-step process. First, the existing page is disassembled into an in-memory struct, which is a linked list of all the segments. In-memory each segment can be represented as an array of item pointers, or in the compressed format. In the next phase, all the new items are added to the segments where they belong. This naturally can lead to overly large segments; in the third phase, the items are redistributed among the segments, and compressed back to the compressed format. Finally, the finished segments are written back to the page, or pages if it had to be split. The same subroutines to disassemble and recompress are also used in vacuum. Attached is what I've got now. This is again quite heavily-changed from the previous version - sorry for repeatedly rewriting this. I'll continue testing and re-reviewing this tomorrow. - Heikki gin-packed-postinglists-20140121.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 21.1.2014 22:21, Heikki Linnakangas wrote: On 01/21/2014 11:35 AM, Heikki Linnakangas wrote: On 01/21/2014 04:02 AM, Tomas Vondra wrote: On 20.1.2014 19:30, Heikki Linnakangas wrote: Attached is a yet another version, with more bugs fixed and more comments added and updated. I would appreciate some heavy-testing of this patch now. If you could re-run the tests you've been using, that could be great. I've tested the WAL replay by replicating GIN operations over streaming replication. That doesn't guarantee it's correct, but it's a good smoke test. I gave it a try - the OOM error seems to be gone, but now get this PANIC: cannot insert duplicate items to GIN index page This only happens when building the index incrementally (i.e. using a sequence of INSERT statements into a table with GIN index). When I create a new index on a table (already containing the same dataset) it works just fine. Also, I tried to reproduce the issue by running a simple plpgsql loop (instead of a complex python script): DO LANGUAGE plpgsql $$ DECLARE r tsvector; BEGIN FOR r IN SELECT body_tsvector FROM data_table LOOP INSERT INTO idx_table (body_tsvector) VALUES (r); END LOOP; END$$; where data_table is the table with imported data (the same data I mentioned in the post about OOM errors), and index_table is an empty table with a GIN index. And indeed it fails, but only if I run the block in multiple sessions in parallel. Oh, I see what's going on. I had assumed that there cannot be duplicate insertions into the posting tree, but that's dead wrong. The fast insertion mechanism depends on a duplicate insertion to do nothing. Ok, this turned out to be a much bigger change than I thought... In principle, it's not difficult to eliminate duplicates. However, to detect a duplicate, you have to check if the item is present in the uncompressed items array, or in the compressed lists. That requires decoding the segment where it should be. But if we decode the segment, what's the purpose of the uncompressed items array? Its purpose was to speed up insertions, by buffering them so that we don't need to decode and re-encode the page/segment on every inserted item. But if we're paying the price of decoding it anyway, we might as well include the new item and re-encode the segment. The uncompressed area saves some effort in WAL-logging, as the record of inserting an entry into the uncompressed area is much smaller than that of re-encoding part of the page, but if that really is a concern, we could track more carefully which parts of the page is modified, and only WAL-log the required parts. And hopefully, the fast-update lists buffer inserts so that you insert many items at a time to the posting tree, and the price of re-encoding is only paid once. So, now that I think about this once more, I don't think the uncompressed area in every leaf page is a good idea. I refactored the way the recompression routine works again. It is now more clearly a multi-step process. First, the existing page is disassembled into an in-memory struct, which is a linked list of all the segments. In-memory each segment can be represented as an array of item pointers, or in the compressed format. In the next phase, all the new items are added to the segments where they belong. This naturally can lead to overly large segments; in the third phase, the items are redistributed among the segments, and compressed back to the compressed format. Finally, the finished segments are written back to the page, or pages if it had to be split. The same subroutines to disassemble and recompress are also used in vacuum. Attached is what I've got now. This is again quite heavily-changed from the previous version - sorry for repeatedly rewriting this. I'll continue testing and re-reviewing this tomorrow. I've repeated the tests, and it actually finishes OK with this patch. The incremental load works OK, does not fail because of OOM or PANIC errors, or any other issues. Subsequent querying seems to work fine too. The sizes before / after applying the patch look like this: object | raw size (old) | raw size (new) | diff --|-- table | 6093 | 6093 | 00.0% index A | 2288 | 2035 | -11.0% index B | 96 | 96 | 00.0% and after running VACUUM FULL object | vacuumed (old) | vacuumed (new) | diff --|-- table | 6416 | 6416 | 00.0% index A |678 |415 | -38.8% index B | 42 | 20 | -52.4% There results are slightly different than on my previous runs - for example in my post from 2013/10/12 I've posted this (after vacuuming) | HEAD | patched | subject
Re: [HACKERS] GIN improvements part 1: additional information
On Wed, Jan 22, 2014 at 1:21 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/21/2014 11:35 AM, Heikki Linnakangas wrote: On 01/21/2014 04:02 AM, Tomas Vondra wrote: On 20.1.2014 19:30, Heikki Linnakangas wrote: Attached is a yet another version, with more bugs fixed and more comments added and updated. I would appreciate some heavy-testing of this patch now. If you could re-run the tests you've been using, that could be great. I've tested the WAL replay by replicating GIN operations over streaming replication. That doesn't guarantee it's correct, but it's a good smoke test. I gave it a try - the OOM error seems to be gone, but now get this PANIC: cannot insert duplicate items to GIN index page This only happens when building the index incrementally (i.e. using a sequence of INSERT statements into a table with GIN index). When I create a new index on a table (already containing the same dataset) it works just fine. Also, I tried to reproduce the issue by running a simple plpgsql loop (instead of a complex python script): DO LANGUAGE plpgsql $$ DECLARE r tsvector; BEGIN FOR r IN SELECT body_tsvector FROM data_table LOOP INSERT INTO idx_table (body_tsvector) VALUES (r); END LOOP; END$$; where data_table is the table with imported data (the same data I mentioned in the post about OOM errors), and index_table is an empty table with a GIN index. And indeed it fails, but only if I run the block in multiple sessions in parallel. Oh, I see what's going on. I had assumed that there cannot be duplicate insertions into the posting tree, but that's dead wrong. The fast insertion mechanism depends on a duplicate insertion to do nothing. Ok, this turned out to be a much bigger change than I thought... In principle, it's not difficult to eliminate duplicates. However, to detect a duplicate, you have to check if the item is present in the uncompressed items array, or in the compressed lists. That requires decoding the segment where it should be. But if we decode the segment, what's the purpose of the uncompressed items array? Its purpose was to speed up insertions, by buffering them so that we don't need to decode and re-encode the page/segment on every inserted item. But if we're paying the price of decoding it anyway, we might as well include the new item and re-encode the segment. The uncompressed area saves some effort in WAL-logging, as the record of inserting an entry into the uncompressed area is much smaller than that of re-encoding part of the page, but if that really is a concern, we could track more carefully which parts of the page is modified, and only WAL-log the required parts. And hopefully, the fast-update lists buffer inserts so that you insert many items at a time to the posting tree, and the price of re-encoding is only paid once. So, now that I think about this once more, I don't think the uncompressed area in every leaf page is a good idea. I didn't get why insertion of duplicate TIDs to uncompressed area and eliminate them of re-encoding? It would be somewhat more work during updates, more work during scan, more WAL records. But is it really significant for real-life workloads? I refactored the way the recompression routine works again. It is now more clearly a multi-step process. First, the existing page is disassembled into an in-memory struct, which is a linked list of all the segments. In-memory each segment can be represented as an array of item pointers, or in the compressed format. In the next phase, all the new items are added to the segments where they belong. This naturally can lead to overly large segments; in the third phase, the items are redistributed among the segments, and compressed back to the compressed format. Finally, the finished segments are written back to the page, or pages if it had to be split. The same subroutines to disassemble and recompress are also used in vacuum. Attached is what I've got now. This is again quite heavily-changed from the previous version - sorry for repeatedly rewriting this. I'll continue testing and re-reviewing this tomorrow. Let's clarify where we are. We had quite debugged and tested version with hard-wired varbyte encoding. Now we're reimplementing and debugging segmented version of varbyte encoding in a hope that one day we can easily replace it with something better that meets our restrictions (but we didn't find it already). Is it right? -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On 01/17/2014 08:49 PM, Alexander Korotkov wrote: On Fri, Jan 17, 2014 at 10:38 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/17/2014 01:05 PM, Alexander Korotkov wrote: Seems to be fixed in the attached version of patch. Another improvement in this version: only left-most segments and further are re-encoded. Left part of page are left untouched. I'm looking into this now. A few quick notes: * Even when you don't re-encode the whole page, you still WAL-log the whole page. While correct, it'd be a pretty obvious optimization to only WAL-log the modified part. Oh, I overlooked it. I wrote code in ginRedoInsertData which finds appropriate place fro data but didn't notice that I still wrote whole page to xlog record. Yeah. I think ginRedoInsertData was too cute to be bug-free. I added an explicit unmodifiedsize field to the WAL record, so that ginRedoInsertData doesn't need to calculate it. * When new items are appended to the end of the page so that only the last existing compressed segment is re-encoded, and the page has to be split, the items aren't divided 50/50 on the pages. The loop that moves segments destined for the left page to the right won't move any existing, untouched, segments. I think this loop will move one segment. However, it's too small. Implemented this. I noticed that the gin vacuum redo routine is dead code, except for the data-leaf page handling, because we never remove entries or internal nodes (page deletion is a separate wal record type). And the data-leaf case is functionally equivalent to heap newpage records. I removed the dead code and made it more clear that it resembles heap newpage. Attached is a yet another version, with more bugs fixed and more comments added and updated. I would appreciate some heavy-testing of this patch now. If you could re-run the tests you've been using, that could be great. I've tested the WAL replay by replicating GIN operations over streaming replication. That doesn't guarantee it's correct, but it's a good smoke test. - Heikki gin-packed-postinglists-20140120.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 20.1.2014 19:30, Heikki Linnakangas wrote: Attached is a yet another version, with more bugs fixed and more comments added and updated. I would appreciate some heavy-testing of this patch now. If you could re-run the tests you've been using, that could be great. I've tested the WAL replay by replicating GIN operations over streaming replication. That doesn't guarantee it's correct, but it's a good smoke test. I gave it a try - the OOM error seems to be gone, but now get this PANIC: cannot insert duplicate items to GIN index page This only happens when building the index incrementally (i.e. using a sequence of INSERT statements into a table with GIN index). When I create a new index on a table (already containing the same dataset) it works just fine. Also, I tried to reproduce the issue by running a simple plpgsql loop (instead of a complex python script): DO LANGUAGE plpgsql $$ DECLARE r tsvector; BEGIN FOR r IN SELECT body_tsvector FROM data_table LOOP INSERT INTO idx_table (body_tsvector) VALUES (r); END LOOP; END$$; where data_table is the table with imported data (the same data I mentioned in the post about OOM errors), and index_table is an empty table with a GIN index. And indeed it fails, but only if I run the block in multiple sessions in parallel. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Wed, Jan 15, 2014 at 10:47 AM, Alexander Korotkov aekorot...@gmail.comwrote: On Wed, Jan 15, 2014 at 5:17 AM, Tomas Vondra t...@fuzzy.cz wrote: On 14.1.2014 00:38, Tomas Vondra wrote: On 13.1.2014 18:07, Alexander Korotkov wrote: On Sat, Jan 11, 2014 at 6:15 AM, Tomas Vondra t...@fuzzy.cz mailto:t...@fuzzy.cz wrote: On 8.1.2014 22:58, Alexander Korotkov wrote: Thanks for reporting. Fixed version is attached. I've tried to rerun the 'archie' benchmark with the current patch, and once again I got PANIC: could not split GIN page, didn't fit I reran it with '--enable-cassert' and with that I got TRAP: FailedAssertion(!(ginCompareItemPointers(items[i - 1], items[i]) 0), File: gindatapage.c, Line: 149) LOG: server process (PID 5364) was terminated by signal 6: Aborted DETAIL: Failed process was running: INSERT INTO messages ... so the assert in GinDataLeafPageGetUncompressed fails for some reason. I can easily reproduce it, but my knowledge in this area is rather limited so I'm not entirely sure what to look for. I've fixed this bug and many other bug. Now patch passes test suite that I've used earlier. The results are so: OK, it seems the bug is gone. However now there's a memory leak somewhere. I'm loading pgsql mailing list archives (~600k messages) using this script https://bitbucket.org/tvondra/archie/src/1bbeb920/bin/load.py And after loading about 1/5 of the data, all the memory gets filled by the pgsql backends (loading the data in parallel) and the DB gets killed by the OOM killer. I've spent a fair amount of time trying to locate the memory leak, but so far no luck. I'm not sufficiently familiar with the GIN code. I can however demonstrate that it's there, and I have rather simple test case to reproduce it - basically just a CREATE INDEX on a table with ~1M email message bodies (in a tsvector column). The data is available here (360MB compressed, 1GB raw): http://www.fuzzy.cz/tmp/message-b.data.gz Simply create a single-column table, load data and create the index CREATE TABLE test ( body_tsvector TSVECTOR ); COPY test FROM '/tmp/message-b.data'; CREATE test_idx ON test USING gin test ( body_tsvector ); I'm running this on a machine with 8GB of RAM, with these settings shared_buffers=1GB maintenance_work_mem=1GB According to top, CREATE INDEX from the current HEAD never consumes more than ~25% of RAM: PID USER PR NIVIRTRESSHR %CPU %MEM COMMAND 32091 tomas 20 0 2026032 1,817g 1,040g 56,2 23,8 postgres which is about right, as (shared_buffers + maintenance_work_mem) is about 1/4 of RAM. With the v5 patch version applied, the CREATE INDEX process eventually goes crazy and allocates almost all the available memory (but somesimes finishes, mostly by pure luck). This is what I was able to get from top PID USER PR NIVIRTRESSHR S %CPU %MEM COMMAND 14090 tomas 20 0 7913820 6,962g 955036 D 4,3 91,1 postgres while the system was still reasonably responsive. Thanks a lot for your help! I believe problem is that each decompressed item pointers array is palloc'd but not freed. I hope to fix it today. Seems to be fixed in the attached version of patch. Another improvement in this version: only left-most segments and further are re-encoded. Left part of page are left untouched. -- With best regards, Alexander Korotkov. gin-packed-postinglists-varbyte6.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 01/17/2014 01:05 PM, Alexander Korotkov wrote: Seems to be fixed in the attached version of patch. Another improvement in this version: only left-most segments and further are re-encoded. Left part of page are left untouched. I'm looking into this now. A few quick notes: * Even when you don't re-encode the whole page, you still WAL-log the whole page. While correct, it'd be a pretty obvious optimization to only WAL-log the modified part. * When new items are appended to the end of the page so that only the last existing compressed segment is re-encoded, and the page has to be split, the items aren't divided 50/50 on the pages. The loop that moves segments destined for the left page to the right won't move any existing, untouched, segments. ! /* !* Didn't fit uncompressed. We'll have to encode them. Check if both !* new items and uncompressed items can be placed starting from last !* segment of page. Then re-encode only last segment of page. !*/ ! minNewItem = newItems[0]; ! if (nolduncompressed == 0 ! ginCompareItemPointers(olduncompressed[0], minNewItem) 0) ! minNewItem = olduncompressed[0]; That looks wrong. If I'm understanding it right, it's trying to do minNewItem = Min(newItems[0], olduncompressed[0]). The test should be nolduncompressed 0 ... No need to send a new patch, I'll just fix those while reviewing... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Fri, Jan 17, 2014 at 10:38 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/17/2014 01:05 PM, Alexander Korotkov wrote: Seems to be fixed in the attached version of patch. Another improvement in this version: only left-most segments and further are re-encoded. Left part of page are left untouched. I'm looking into this now. A few quick notes: * Even when you don't re-encode the whole page, you still WAL-log the whole page. While correct, it'd be a pretty obvious optimization to only WAL-log the modified part. Oh, I overlooked it. I wrote code in ginRedoInsertData which finds appropriate place fro data but didn't notice that I still wrote whole page to xlog record. * When new items are appended to the end of the page so that only the last existing compressed segment is re-encoded, and the page has to be split, the items aren't divided 50/50 on the pages. The loop that moves segments destined for the left page to the right won't move any existing, untouched, segments. I think this loop will move one segment. However, it's too small. ! /* !* Didn't fit uncompressed. We'll have to encode them. Check if both !* new items and uncompressed items can be placed starting from last !* segment of page. Then re-encode only last segment of page. !*/ ! minNewItem = newItems[0]; ! if (nolduncompressed == 0 ! ginCompareItemPointers(olduncompressed[0], minNewItem) 0) ! minNewItem = olduncompressed[0]; That looks wrong. If I'm understanding it right, it's trying to do minNewItem = Min(newItems[0], olduncompressed[0]). The test should be nolduncompressed 0 ... Yes, definitely a bug. No need to send a new patch, I'll just fix those while reviewing... Ok, thanks! -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On 01/13/2014 07:07 PM, Alexander Korotkov wrote: I've fixed this bug and many other bug. Now patch passes test suite that I've used earlier. The results are so: Operations time: event | period ---+- index_build | 00:01:47.53915 index_build_recovery | 00:00:04 index_update | 00:05:24.388163 index_update_recovery | 00:00:53 search_new| 00:24:02.289384 search_updated| 00:27:09.193343 (6 rows) Index sizes: label | size ---+--- new | 384761856 after_updates | 667942912 (2 rows) Also, I made following changes in algorithms: - Now, there is a limit to number of uncompressed TIDs in the page. After reaching this limit, they are encoded independent on if they can fit page. That seems to me more desirable behaviour and somehow it accelerates search speed. Before this change times were following: event | period ---+- index_build | 00:01:51.467888 index_build_recovery | 00:00:04 index_update | 00:05:03.315155 index_update_recovery | 00:00:51 search_new| 00:24:43.194882 search_updated| 00:28:36.316784 (6 rows) Hmm, that's strange. Any idea why that's happening? One theory is that when you re-encode the pages more aggressively, there are fewer pages with a mix of packed and unpacked items. Mixed pages are somewhat slower to scan than fully packed or fully unpacked pages, because GinDataLeafPageGetItems() has to merge the packed and unpacked items into a single list. But I wouldn't expect that effect to be large enough to explain the results you got. - Page are not fully re-encoded if it's enough to re-encode just last segment. Great! We should also take advantage of that in the WAL record that's written; no point in WAL-logging all the segments, if we know that only last one was modified. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Tue, Jan 14, 2014 at 12:34 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/13/2014 07:07 PM, Alexander Korotkov wrote: I've fixed this bug and many other bug. Now patch passes test suite that I've used earlier. The results are so: Operations time: event | period ---+- index_build | 00:01:47.53915 index_build_recovery | 00:00:04 index_update | 00:05:24.388163 index_update_recovery | 00:00:53 search_new| 00:24:02.289384 search_updated| 00:27:09.193343 (6 rows) Index sizes: label | size ---+--- new | 384761856 after_updates | 667942912 (2 rows) Also, I made following changes in algorithms: - Now, there is a limit to number of uncompressed TIDs in the page. After reaching this limit, they are encoded independent on if they can fit page. That seems to me more desirable behaviour and somehow it accelerates search speed. Before this change times were following: event | period ---+- index_build | 00:01:51.467888 index_build_recovery | 00:00:04 index_update | 00:05:03.315155 index_update_recovery | 00:00:51 search_new| 00:24:43.194882 search_updated| 00:28:36.316784 (6 rows) Hmm, that's strange. Any idea why that's happening? One theory is that when you re-encode the pages more aggressively, there are fewer pages with a mix of packed and unpacked items. Mixed pages are somewhat slower to scan than fully packed or fully unpacked pages, because GinDataLeafPageGetItems() has to merge the packed and unpacked items into a single list. But I wouldn't expect that effect to be large enough to explain the results you got. Probably, it's because of less work in ginMergeItemPointers. - Page are not fully re-encoded if it's enough to re-encode just last segment. Great! We should also take advantage of that in the WAL record that's written; no point in WAL-logging all the segments, if we know that only last one was modified. Already. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On 14.1.2014 00:38, Tomas Vondra wrote: On 13.1.2014 18:07, Alexander Korotkov wrote: On Sat, Jan 11, 2014 at 6:15 AM, Tomas Vondra t...@fuzzy.cz mailto:t...@fuzzy.cz wrote: On 8.1.2014 22:58, Alexander Korotkov wrote: Thanks for reporting. Fixed version is attached. I've tried to rerun the 'archie' benchmark with the current patch, and once again I got PANIC: could not split GIN page, didn't fit I reran it with '--enable-cassert' and with that I got TRAP: FailedAssertion(!(ginCompareItemPointers(items[i - 1], items[i]) 0), File: gindatapage.c, Line: 149) LOG: server process (PID 5364) was terminated by signal 6: Aborted DETAIL: Failed process was running: INSERT INTO messages ... so the assert in GinDataLeafPageGetUncompressed fails for some reason. I can easily reproduce it, but my knowledge in this area is rather limited so I'm not entirely sure what to look for. I've fixed this bug and many other bug. Now patch passes test suite that I've used earlier. The results are so: OK, it seems the bug is gone. However now there's a memory leak somewhere. I'm loading pgsql mailing list archives (~600k messages) using this script https://bitbucket.org/tvondra/archie/src/1bbeb920/bin/load.py And after loading about 1/5 of the data, all the memory gets filled by the pgsql backends (loading the data in parallel) and the DB gets killed by the OOM killer. I've spent a fair amount of time trying to locate the memory leak, but so far no luck. I'm not sufficiently familiar with the GIN code. I can however demonstrate that it's there, and I have rather simple test case to reproduce it - basically just a CREATE INDEX on a table with ~1M email message bodies (in a tsvector column). The data is available here (360MB compressed, 1GB raw): http://www.fuzzy.cz/tmp/message-b.data.gz Simply create a single-column table, load data and create the index CREATE TABLE test ( body_tsvector TSVECTOR ); COPY test FROM '/tmp/message-b.data'; CREATE test_idx ON test USING gin test ( body_tsvector ); I'm running this on a machine with 8GB of RAM, with these settings shared_buffers=1GB maintenance_work_mem=1GB According to top, CREATE INDEX from the current HEAD never consumes more than ~25% of RAM: PID USER PR NIVIRTRESSHR %CPU %MEM COMMAND 32091 tomas 20 0 2026032 1,817g 1,040g 56,2 23,8 postgres which is about right, as (shared_buffers + maintenance_work_mem) is about 1/4 of RAM. With the v5 patch version applied, the CREATE INDEX process eventually goes crazy and allocates almost all the available memory (but somesimes finishes, mostly by pure luck). This is what I was able to get from top PID USER PR NIVIRTRESSHR S %CPU %MEM COMMAND 14090 tomas 20 0 7913820 6,962g 955036 D 4,3 91,1 postgres while the system was still reasonably responsive. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Wed, Jan 15, 2014 at 5:17 AM, Tomas Vondra t...@fuzzy.cz wrote: On 14.1.2014 00:38, Tomas Vondra wrote: On 13.1.2014 18:07, Alexander Korotkov wrote: On Sat, Jan 11, 2014 at 6:15 AM, Tomas Vondra t...@fuzzy.cz mailto:t...@fuzzy.cz wrote: On 8.1.2014 22:58, Alexander Korotkov wrote: Thanks for reporting. Fixed version is attached. I've tried to rerun the 'archie' benchmark with the current patch, and once again I got PANIC: could not split GIN page, didn't fit I reran it with '--enable-cassert' and with that I got TRAP: FailedAssertion(!(ginCompareItemPointers(items[i - 1], items[i]) 0), File: gindatapage.c, Line: 149) LOG: server process (PID 5364) was terminated by signal 6: Aborted DETAIL: Failed process was running: INSERT INTO messages ... so the assert in GinDataLeafPageGetUncompressed fails for some reason. I can easily reproduce it, but my knowledge in this area is rather limited so I'm not entirely sure what to look for. I've fixed this bug and many other bug. Now patch passes test suite that I've used earlier. The results are so: OK, it seems the bug is gone. However now there's a memory leak somewhere. I'm loading pgsql mailing list archives (~600k messages) using this script https://bitbucket.org/tvondra/archie/src/1bbeb920/bin/load.py And after loading about 1/5 of the data, all the memory gets filled by the pgsql backends (loading the data in parallel) and the DB gets killed by the OOM killer. I've spent a fair amount of time trying to locate the memory leak, but so far no luck. I'm not sufficiently familiar with the GIN code. I can however demonstrate that it's there, and I have rather simple test case to reproduce it - basically just a CREATE INDEX on a table with ~1M email message bodies (in a tsvector column). The data is available here (360MB compressed, 1GB raw): http://www.fuzzy.cz/tmp/message-b.data.gz Simply create a single-column table, load data and create the index CREATE TABLE test ( body_tsvector TSVECTOR ); COPY test FROM '/tmp/message-b.data'; CREATE test_idx ON test USING gin test ( body_tsvector ); I'm running this on a machine with 8GB of RAM, with these settings shared_buffers=1GB maintenance_work_mem=1GB According to top, CREATE INDEX from the current HEAD never consumes more than ~25% of RAM: PID USER PR NIVIRTRESSHR %CPU %MEM COMMAND 32091 tomas 20 0 2026032 1,817g 1,040g 56,2 23,8 postgres which is about right, as (shared_buffers + maintenance_work_mem) is about 1/4 of RAM. With the v5 patch version applied, the CREATE INDEX process eventually goes crazy and allocates almost all the available memory (but somesimes finishes, mostly by pure luck). This is what I was able to get from top PID USER PR NIVIRTRESSHR S %CPU %MEM COMMAND 14090 tomas 20 0 7913820 6,962g 955036 D 4,3 91,1 postgres while the system was still reasonably responsive. Thanks a lot for your help! I believe problem is that each decompressed item pointers array is palloc'd but not freed. I hope to fix it today. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On Sat, Jan 11, 2014 at 6:15 AM, Tomas Vondra t...@fuzzy.cz wrote: On 8.1.2014 22:58, Alexander Korotkov wrote: Thanks for reporting. Fixed version is attached. I've tried to rerun the 'archie' benchmark with the current patch, and once again I got PANIC: could not split GIN page, didn't fit I reran it with '--enable-cassert' and with that I got TRAP: FailedAssertion(!(ginCompareItemPointers(items[i - 1], items[i]) 0), File: gindatapage.c, Line: 149) LOG: server process (PID 5364) was terminated by signal 6: Aborted DETAIL: Failed process was running: INSERT INTO messages ... so the assert in GinDataLeafPageGetUncompressed fails for some reason. I can easily reproduce it, but my knowledge in this area is rather limited so I'm not entirely sure what to look for. I've fixed this bug and many other bug. Now patch passes test suite that I've used earlier. The results are so: Operations time: event | period ---+- index_build | 00:01:47.53915 index_build_recovery | 00:00:04 index_update | 00:05:24.388163 index_update_recovery | 00:00:53 search_new| 00:24:02.289384 search_updated| 00:27:09.193343 (6 rows) Index sizes: label | size ---+--- new | 384761856 after_updates | 667942912 (2 rows) Also, I made following changes in algorithms: - Now, there is a limit to number of uncompressed TIDs in the page. After reaching this limit, they are encoded independent on if they can fit page. That seems to me more desirable behaviour and somehow it accelerates search speed. Before this change times were following: event | period ---+- index_build | 00:01:51.467888 index_build_recovery | 00:00:04 index_update | 00:05:03.315155 index_update_recovery | 00:00:51 search_new| 00:24:43.194882 search_updated| 00:28:36.316784 (6 rows) - Page are not fully re-encoded if it's enough to re-encode just last segment. README is updated. -- With best regards, Alexander Korotkov. gin-packed-postinglists-varbyte5.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 13.1.2014 18:07, Alexander Korotkov wrote: On Sat, Jan 11, 2014 at 6:15 AM, Tomas Vondra t...@fuzzy.cz mailto:t...@fuzzy.cz wrote: On 8.1.2014 22:58, Alexander Korotkov wrote: Thanks for reporting. Fixed version is attached. I've tried to rerun the 'archie' benchmark with the current patch, and once again I got PANIC: could not split GIN page, didn't fit I reran it with '--enable-cassert' and with that I got TRAP: FailedAssertion(!(ginCompareItemPointers(items[i - 1], items[i]) 0), File: gindatapage.c, Line: 149) LOG: server process (PID 5364) was terminated by signal 6: Aborted DETAIL: Failed process was running: INSERT INTO messages ... so the assert in GinDataLeafPageGetUncompressed fails for some reason. I can easily reproduce it, but my knowledge in this area is rather limited so I'm not entirely sure what to look for. I've fixed this bug and many other bug. Now patch passes test suite that I've used earlier. The results are so: OK, it seems the bug is gone. However now there's a memory leak somewhere. I'm loading pgsql mailing list archives (~600k messages) using this script https://bitbucket.org/tvondra/archie/src/1bbeb920/bin/load.py And after loading about 1/5 of the data, all the memory gets filled by the pgsql backends (loading the data in parallel) and the DB gets killed by the OOM killer. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
Peter Eisentraut wrote: On 12/10/13, 2:44 PM, Alexander Korotkov wrote: However, patch didn't apply to head. Corrected version is attached. Update the pgstattuple regression test results. The latest version of the patch still doesn't pass the test. I'll look at the patch in further detail. Thanks, Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 8.1.2014 22:58, Alexander Korotkov wrote: Thanks for reporting. Fixed version is attached. I've tried to rerun the 'archie' benchmark with the current patch, and once again I got PANIC: could not split GIN page, didn't fit I reran it with '--enable-cassert' and with that I got TRAP: FailedAssertion(!(ginCompareItemPointers(items[i - 1], items[i]) 0), File: gindatapage.c, Line: 149) LOG: server process (PID 5364) was terminated by signal 6: Aborted DETAIL: Failed process was running: INSERT INTO messages ... so the assert in GinDataLeafPageGetUncompressed fails for some reason. I can easily reproduce it, but my knowledge in this area is rather limited so I'm not entirely sure what to look for. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Mon, Jan 6, 2014 at 12:35 PM, Amit Langote amitlangot...@gmail.comwrote: On Sat, Dec 21, 2013 at 4:36 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Yet another version. The encoding/decoding code is now quite isolated in ginpostinglist.c, so it's easy to experiment with different encodings. This patch uses varbyte encoding again. I got a bit carried away, experimented with a bunch of different encodings. I tried rice encoding, rice encoding with block and offset number delta stored separately, the simple9 variant, and varbyte encoding. The compressed size obviously depends a lot on the distribution of the items, but in the test set I used, the differences between different encodings were quite small. One fatal problem with many encodings is VACUUM. If a page is completely full and you remove one item, the result must still fit. In other words, removing an item must never enlarge the space needed. Otherwise we have to be able to split on vacuum, which adds a lot of code, and also makes it possible for VACUUM to fail if there is no disk space left. That's unpleasant if you're trying to run VACUUM to release disk space. (gin fast updates already has that problem BTW, but let's not make it worse) I believe that eliminates all encodings in the Simple family, as well as PForDelta, and surprisingly also Rice encoding. For example, if you have three items in consecutive offsets, the differences between them are encoded as 11 in rice encoding. If you remove the middle item, the encoding for the next item becomes 010, which takes more space than the original. AFAICS varbyte encoding is safe from that. (a formal proof would be nice though). So, I'm happy to go with varbyte encoding now, indeed I don't think we have much choice unless someone can come up with an alternative that's VACUUM-safe. I have to put this patch aside for a while now, I spent a lot more time on these encoding experiments than I intended. If you could take a look at this latest version, spend some time reviewing it and cleaning up any obsolete comments, and re-run the performance tests you did earlier, that would be great. One thing I'm slightly worried about is the overhead of merging the compressed and uncompressed posting lists in a scan. This patch will be in good shape for the final commitfest, or even before that. I just tried out the patch gin-packed-postinglists-varbyte2.patch (which looks like the latest one in this thread) as follows: 1) Applied patch to the HEAD (on commit 94b899b829657332bda856ac3f06153d09077bd1) 2) Created a test table and index create table test (a text); copy test from '/usr/share/dict/words'; create index test_trgm_idx on test using gin (a gin_trgm_ops); 3) Got the following error on a wildcard query: postgres=# explain (buffers, analyze) select count(*) from test where a like '%tio%'; ERROR: lock 9447 is not held STATEMENT: explain (buffers, analyze) select count(*) from test where a like '%tio%'; ERROR: lock 9447 is not held Thanks for reporting. Fixed version is attached. -- With best regards, Alexander Korotkov. gin-packed-postinglists-varbyte3.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Sat, Dec 21, 2013 at 4:36 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Yet another version. The encoding/decoding code is now quite isolated in ginpostinglist.c, so it's easy to experiment with different encodings. This patch uses varbyte encoding again. I got a bit carried away, experimented with a bunch of different encodings. I tried rice encoding, rice encoding with block and offset number delta stored separately, the simple9 variant, and varbyte encoding. The compressed size obviously depends a lot on the distribution of the items, but in the test set I used, the differences between different encodings were quite small. One fatal problem with many encodings is VACUUM. If a page is completely full and you remove one item, the result must still fit. In other words, removing an item must never enlarge the space needed. Otherwise we have to be able to split on vacuum, which adds a lot of code, and also makes it possible for VACUUM to fail if there is no disk space left. That's unpleasant if you're trying to run VACUUM to release disk space. (gin fast updates already has that problem BTW, but let's not make it worse) I believe that eliminates all encodings in the Simple family, as well as PForDelta, and surprisingly also Rice encoding. For example, if you have three items in consecutive offsets, the differences between them are encoded as 11 in rice encoding. If you remove the middle item, the encoding for the next item becomes 010, which takes more space than the original. AFAICS varbyte encoding is safe from that. (a formal proof would be nice though). So, I'm happy to go with varbyte encoding now, indeed I don't think we have much choice unless someone can come up with an alternative that's VACUUM-safe. I have to put this patch aside for a while now, I spent a lot more time on these encoding experiments than I intended. If you could take a look at this latest version, spend some time reviewing it and cleaning up any obsolete comments, and re-run the performance tests you did earlier, that would be great. One thing I'm slightly worried about is the overhead of merging the compressed and uncompressed posting lists in a scan. This patch will be in good shape for the final commitfest, or even before that. I just tried out the patch gin-packed-postinglists-varbyte2.patch (which looks like the latest one in this thread) as follows: 1) Applied patch to the HEAD (on commit 94b899b829657332bda856ac3f06153d09077bd1) 2) Created a test table and index create table test (a text); copy test from '/usr/share/dict/words'; create index test_trgm_idx on test using gin (a gin_trgm_ops); 3) Got the following error on a wildcard query: postgres=# explain (buffers, analyze) select count(*) from test where a like '%tio%'; ERROR: lock 9447 is not held STATEMENT: explain (buffers, analyze) select count(*) from test where a like '%tio%'; ERROR: lock 9447 is not held Following is the bt: #0 LWLockRelease (lockid=9447) at lwlock.c:747 #1 0x0074a356 in LockBuffer (buffer=4638, mode=0) at bufmgr.c:2760 #2 0x00749d6e in UnlockReleaseBuffer (buffer=4638) at bufmgr.c:2551 #3 0x00478bcc in entryGetNextItem (ginstate=0x2380100, entry=0x23832a8) at ginget.c:555 #4 0x00479346 in entryGetItem (ginstate=0x2380100, entry=0x23832a8) at ginget.c:695 #5 0x0047987e in scanGetItem (scan=0x237fee8, advancePast=0x7fffa1a46b80, item=0x7fffa1a46b80, recheck=0x7fffa1a46b7f \001) at ginget.c:925 #6 0x0047ae3f in gingetbitmap (fcinfo=0x7fffa1a46be0) at ginget.c:1540 #7 0x008a9486 in FunctionCall2Coll (flinfo=0x236f558, collation=0, arg1=37224168, arg2=37236160) at fmgr.c:1323 #8 0x004bd091 in index_getbitmap (scan=0x237fee8, bitmap=0x2382dc0) at indexam.c:649 #9 0x0064827c in MultiExecBitmapIndexScan (node=0x237fce0) at nodeBitmapIndexscan.c:89 #10 0x006313b6 in MultiExecProcNode (node=0x237fce0) at execProcnode.c:550 #11 0x0064713a in BitmapHeapNext (node=0x237e998) at nodeBitmapHeapscan.c:104 #12 0x0063c348 in ExecScanFetch (node=0x237e998, accessMtd=0x6470ac BitmapHeapNext, recheckMtd=0x647cbc BitmapHeapRecheck) at execScan.c:82 #13 0x0063c3bc in ExecScan (node=0x237e998, accessMtd=0x6470ac BitmapHeapNext, recheckMtd=0x647cbc BitmapHeapRecheck) at execScan.c:132 #14 0x00647d3a in ExecBitmapHeapScan (node=0x237e998) at nodeBitmapHeapscan.c:436 #15 0x00631121 in ExecProcNode (node=0x237e998) at execProcnode.c:414 #16 0x00644992 in agg_retrieve_direct (aggstate=0x237e200) at nodeAgg.c:1073 #17 0x006448cc in ExecAgg (node=0x237e200) at nodeAgg.c:1026 #18 0x00631247 in ExecProcNode (node=0x237e200) at execProcnode.c:476 #19 0x0062ef2a in ExecutePlan (estate=0x237e0e8, planstate=0x237e200, operation=CMD_SELECT, sendTuples=1 '\001', numberTuples=0, direction=ForwardScanDirection, dest=0xd0c360) at execMain.c:1474 #20 0x0062cfeb
Re: [HACKERS] GIN improvements part 1: additional information
On 12/10/13, 2:44 PM, Alexander Korotkov wrote: However, patch didn't apply to head. Corrected version is attached. Update the pgstattuple regression test results. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 12/19/2013 10:44 AM, Heikki Linnakangas wrote: On 12/19/2013 08:37 AM, Oleg Bartunov wrote: Guys, before digging deep into the art of comp/decomp world I'd like to know if you familiar with results of http://wwwconference.org/www2008/papers/pdf/p387-zhangA.pdf paper and some newer research ? Yeah, I saw that paper. Do we agree in what we really want ? Basically, there are three main features: size, compression, decompression speed - we should take two :) According to that Zhang et al paper you linked, the Vbyte method actually performs the worst on all of those measures. The other algorithms are quite similar in terms of size (PForDelta being the most efficient), while PForDelta is significantly faster to compress/decompress. Just by looking at those numbers, PForDelta looks like a clear winner. However, it operates on much bigger batches than the other algorithms; I haven't looked at it in detail, but Zhang et al used 128 integer batches, and they say that 32 integers is the minimum batch size. If we want to use it for the inline posting lists stored in entry tuples, that would be quite wasteful if there are only a few item pointers on the tuple. Also, in the tests I've run, the compression/decompression speed is not a significant factor in total performance, with either varbyte encoding or Simple9-like encoding I hacked together. One disadvantage of Simple9 and other encodings that operate in batches is that removing a value from the middle can increase the number of bytes required for the remaining values. That's a problem during vacuum; it's possible that after vacuuming away one item pointer, the remaining items no longer fit on the page. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 12/19/2013 03:33 PM, Heikki Linnakangas wrote: On 12/17/2013 12:49 AM, Heikki Linnakangas wrote: On 12/17/2013 12:22 AM, Alexander Korotkov wrote: On Mon, Dec 16, 2013 at 3:30 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/12/2013 06:44 PM, Alexander Korotkov wrote: When values are packed into small groups, we have to either insert inefficiently encoded value or re-encode whole right part of values. It would probably be simplest to store newly inserted items uncompressed, in a separate area in the page. For example, grow the list of uncompressed items downwards from pg_upper, and the compressed items upwards from pg_lower. When the page fills up, re-encode the whole page. I hacked together an implementation of a variant of Simple9, to see how it performs. Insertions are handled per the above scheme. Here's an updated version of that, using the page layout without item-indexes that I described in the other post. This is much less buggy than that earlier crude version I posted - and unfortunately it doesn't compress as well. The earlier version lost some items :-(. Nevertheless, I think this page layout and code formatting is better, even if we switch the encoding back to the varbyte encoding in the end. Yet another version. The encoding/decoding code is now quite isolated in ginpostinglist.c, so it's easy to experiment with different encodings. This patch uses varbyte encoding again. I got a bit carried away, experimented with a bunch of different encodings. I tried rice encoding, rice encoding with block and offset number delta stored separately, the simple9 variant, and varbyte encoding. The compressed size obviously depends a lot on the distribution of the items, but in the test set I used, the differences between different encodings were quite small. One fatal problem with many encodings is VACUUM. If a page is completely full and you remove one item, the result must still fit. In other words, removing an item must never enlarge the space needed. Otherwise we have to be able to split on vacuum, which adds a lot of code, and also makes it possible for VACUUM to fail if there is no disk space left. That's unpleasant if you're trying to run VACUUM to release disk space. (gin fast updates already has that problem BTW, but let's not make it worse) I believe that eliminates all encodings in the Simple family, as well as PForDelta, and surprisingly also Rice encoding. For example, if you have three items in consecutive offsets, the differences between them are encoded as 11 in rice encoding. If you remove the middle item, the encoding for the next item becomes 010, which takes more space than the original. AFAICS varbyte encoding is safe from that. (a formal proof would be nice though). So, I'm happy to go with varbyte encoding now, indeed I don't think we have much choice unless someone can come up with an alternative that's VACUUM-safe. I have to put this patch aside for a while now, I spent a lot more time on these encoding experiments than I intended. If you could take a look at this latest version, spend some time reviewing it and cleaning up any obsolete comments, and re-run the performance tests you did earlier, that would be great. One thing I'm slightly worried about is the overhead of merging the compressed and uncompressed posting lists in a scan. This patch will be in good shape for the final commitfest, or even before that. - Heikki gin-packed-postinglists-varbyte2.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
Heikki Linnakangas escribió: I believe that eliminates all encodings in the Simple family, as well as PForDelta, and surprisingly also Rice encoding. For example, if you have three items in consecutive offsets, the differences between them are encoded as 11 in rice encoding. If you remove the middle item, the encoding for the next item becomes 010, which takes more space than the original. I don't understand this. If you have three consecutive entries, and the differences between them are 11, you need to store two 11s. But if you have two items, you only need to store 010 once. So the difference is larger, but since you need to store only one of them then overall it's still shorter than the original. No? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Fri, Dec 20, 2013 at 11:43 PM, Alvaro Herrera alvhe...@2ndquadrant.comwrote: Heikki Linnakangas escribió: I believe that eliminates all encodings in the Simple family, as well as PForDelta, and surprisingly also Rice encoding. For example, if you have three items in consecutive offsets, the differences between them are encoded as 11 in rice encoding. If you remove the middle item, the encoding for the next item becomes 010, which takes more space than the original. I don't understand this. If you have three consecutive entries, and the differences between them are 11, you need to store two 11s. But if you have two items, you only need to store 010 once. So the difference is larger, but since you need to store only one of them then overall it's still shorter than the original. No? I believe Heikki mean both differences are encoded as 11, each one is 1. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
Alexander Korotkov escribió: On Fri, Dec 20, 2013 at 11:43 PM, Alvaro Herrera alvhe...@2ndquadrant.comwrote: Heikki Linnakangas escribió: I believe that eliminates all encodings in the Simple family, as well as PForDelta, and surprisingly also Rice encoding. For example, if you have three items in consecutive offsets, the differences between them are encoded as 11 in rice encoding. If you remove the middle item, the encoding for the next item becomes 010, which takes more space than the original. I don't understand this. If you have three consecutive entries, and the differences between them are 11, you need to store two 11s. But if you have two items, you only need to store 010 once. So the difference is larger, but since you need to store only one of them then overall it's still shorter than the original. No? I believe Heikki mean both differences are encoded as 11, each one is 1. Oh, that sucks (or it's great, depending on perspective). -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Fri, Dec 20, 2013 at 11:36 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/19/2013 03:33 PM, Heikki Linnakangas wrote: On 12/17/2013 12:49 AM, Heikki Linnakangas wrote: On 12/17/2013 12:22 AM, Alexander Korotkov wrote: On Mon, Dec 16, 2013 at 3:30 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/12/2013 06:44 PM, Alexander Korotkov wrote: When values are packed into small groups, we have to either insert inefficiently encoded value or re-encode whole right part of values. It would probably be simplest to store newly inserted items uncompressed, in a separate area in the page. For example, grow the list of uncompressed items downwards from pg_upper, and the compressed items upwards from pg_lower. When the page fills up, re-encode the whole page. I hacked together an implementation of a variant of Simple9, to see how it performs. Insertions are handled per the above scheme. Here's an updated version of that, using the page layout without item-indexes that I described in the other post. This is much less buggy than that earlier crude version I posted - and unfortunately it doesn't compress as well. The earlier version lost some items :-(. Nevertheless, I think this page layout and code formatting is better, even if we switch the encoding back to the varbyte encoding in the end. Yet another version. The encoding/decoding code is now quite isolated in ginpostinglist.c, so it's easy to experiment with different encodings. This patch uses varbyte encoding again. I got a bit carried away, experimented with a bunch of different encodings. I tried rice encoding, rice encoding with block and offset number delta stored separately, the simple9 variant, and varbyte encoding. The compressed size obviously depends a lot on the distribution of the items, but in the test set I used, the differences between different encodings were quite small. One fatal problem with many encodings is VACUUM. If a page is completely full and you remove one item, the result must still fit. In other words, removing an item must never enlarge the space needed. Otherwise we have to be able to split on vacuum, which adds a lot of code, and also makes it possible for VACUUM to fail if there is no disk space left. That's unpleasant if you're trying to run VACUUM to release disk space. (gin fast updates already has that problem BTW, but let's not make it worse) I believe that eliminates all encodings in the Simple family, as well as PForDelta, and surprisingly also Rice encoding. For example, if you have three items in consecutive offsets, the differences between them are encoded as 11 in rice encoding. If you remove the middle item, the encoding for the next item becomes 010, which takes more space than the original. AFAICS varbyte encoding is safe from that. (a formal proof would be nice though). Formal proof is so. Removing number is actually replacement of two numbers with their sum. We have to proof that varbyte encoding of sum can't be longer than varbyte encoding of summands.High bit number of sum is at most one more than high bit number of larger summand. So varbyte encoding of sum is at most one byte longer than varbyte encoding of larger summand. Lesser summand is at least one byte. So, I'm happy to go with varbyte encoding now, indeed I don't think we have much choice unless someone can come up with an alternative that's VACUUM-safe. I have to put this patch aside for a while now, I spent a lot more time on these encoding experiments than I intended. If you could take a look at this latest version, spend some time reviewing it and cleaning up any obsolete comments, and re-run the performance tests you did earlier, that would be great. One thing I'm slightly worried about is the overhead of merging the compressed and uncompressed posting lists in a scan. This patch will be in good shape for the final commitfest, or even before that. OK. I'll make tests for scans on mix of compressed and uncompressed posting lists. However, I don't expect it to be slower than both pure compressed and pure uncompressed posting lists. Overhead of compressing uncompressed posting lists is evident. But it also would be nice to measure. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On 12/19/2013 08:37 AM, Oleg Bartunov wrote: Guys, before digging deep into the art of comp/decomp world I'd like to know if you familiar with results of http://wwwconference.org/www2008/papers/pdf/p387-zhangA.pdf paper and some newer research ? Yeah, I saw that paper. Do we agree in what we really want ? Basically, there are three main features: size, compression, decompression speed - we should take two :) According to that Zhang et al paper you linked, the Vbyte method actually performs the worst on all of those measures. The other algorithms are quite similar in terms of size (PForDelta being the most efficient), while PForDelta is significantly faster to compress/decompress. Just by looking at those numbers, PForDelta looks like a clear winner. However, it operates on much bigger batches than the other algorithms; I haven't looked at it in detail, but Zhang et al used 128 integer batches, and they say that 32 integers is the minimum batch size. If we want to use it for the inline posting lists stored in entry tuples, that would be quite wasteful if there are only a few item pointers on the tuple. Also, in the tests I've run, the compression/decompression speed is not a significant factor in total performance, with either varbyte encoding or Simple9-like encoding I hacked together. Actually, now that I think about this a bit more, maybe we should go with Rice encoding after all? It's the most efficient in terms of size, and I believe it would be fast enough. Should we design sort of plugin, which could support independent storage on disk, so users can apply different techniques, depending on data. What I want to say is that we certainly can play with this very challenged task, but we have limited time before 9.4 and we should think in positive direction. Once we have the code in place to deal with one encoding, it's easy to switch the implementation. Making it user-configurable or pluggable would be overkill IMHO. What I'm saying is that we should make sure we get the page format right (in particular, I strongly feel we should use the self-contained PostingListSegment struct instead of item-indees that I mentioned in the other post), with the implementation details hidden in the functions in ginpostinglist.c. Then we can easily experiment with different algorithms. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 12/17/2013 12:49 AM, Heikki Linnakangas wrote: On 12/17/2013 12:22 AM, Alexander Korotkov wrote: On Mon, Dec 16, 2013 at 3:30 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/12/2013 06:44 PM, Alexander Korotkov wrote: When values are packed into small groups, we have to either insert inefficiently encoded value or re-encode whole right part of values. It would probably be simplest to store newly inserted items uncompressed, in a separate area in the page. For example, grow the list of uncompressed items downwards from pg_upper, and the compressed items upwards from pg_lower. When the page fills up, re-encode the whole page. I hacked together an implementation of a variant of Simple9, to see how it performs. Insertions are handled per the above scheme. Here's an updated version of that, using the page layout without item-indexes that I described in the other post. This is much less buggy than that earlier crude version I posted - and unfortunately it doesn't compress as well. The earlier version lost some items :-(. Nevertheless, I think this page layout and code formatting is better, even if we switch the encoding back to the varbyte encoding in the end. I haven't tested WAL replay or VACUUM with this version yet, so those are likely broken. - Heikki gin-packed-postinglists-simple8-segments-1.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Tue, Dec 17, 2013 at 2:49 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/17/2013 12:22 AM, Alexander Korotkov wrote: On Mon, Dec 16, 2013 at 3:30 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/12/2013 06:44 PM, Alexander Korotkov wrote: When values are packed into small groups, we have to either insert inefficiently encoded value or re-encode whole right part of values. It would probably be simplest to store newly inserted items uncompressed, in a separate area in the page. For example, grow the list of uncompressed items downwards from pg_upper, and the compressed items upwards from pg_lower. When the page fills up, re-encode the whole page. I hacked together an implementation of a variant of Simple9, to see how it performs. Insertions are handled per the above scheme. In a limited pg_trgm test case I've been using a lot for this, this reduces the index size about 20%, compared to varbyte encoding. It might be possible to squeeze it a bit more, I handcrafted the selectors in the encoding algorithm to suite our needs, but I don't actually have a good idea of how to choose them optimally. Also, the encoding can encode 0 values, but we never need to do that, so you could take advantage of that to pack items tighter. Compression and decompression speed seems to be about the same. Patch attached if you want to play with it. WAL replay is still broken, and there are probably bugs. Good idea. But: 1) We'll still need item indexes in the end of page for fast scan. Sure. 2) Storage would be easily extendable to hold additional information as well. Better compression shouldn't block more serious improvements. I'm not sure I agree with that. For all the cases where you don't care about additional information - which covers all existing users for example - reducing disk size is pretty important. How are you planning to store the additional information, and how does using another encoding gets in the way of that? I was planned to store additional information datums between varbyte-encoded tids. I was expected it would be hard to do with PFOR. However, I don't see significant problems in your implementation of Simple9 encoding. I'm going to dig deeper in your version of patch. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On 12/18/2013 01:45 PM, Alexander Korotkov wrote: On Tue, Dec 17, 2013 at 2:49 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/17/2013 12:22 AM, Alexander Korotkov wrote: 2) Storage would be easily extendable to hold additional information as well. Better compression shouldn't block more serious improvements. I'm not sure I agree with that. For all the cases where you don't care about additional information - which covers all existing users for example - reducing disk size is pretty important. How are you planning to store the additional information, and how does using another encoding gets in the way of that? I was planned to store additional information datums between varbyte-encoded tids. I was expected it would be hard to do with PFOR. However, I don't see significant problems in your implementation of Simple9 encoding. I'm going to dig deeper in your version of patch. Ok, thanks. I had another idea about the page format this morning. Instead of having the item-indexes at the end of the page, it would be more flexible to store a bunch of self-contained posting list segments on the page. So I propose that we get rid of the item-indexes, and instead store a bunch of independent posting lists on the page: typedef struct { ItemPointerData first; /* first item in this segment (unpacked) */ uint16 nwords; /* number of words that follow */ uint64 words[1];/* var length */ } PostingListSegment; Each segment can be encoded and decoded independently. When searching for a particular item (like on insertion), you skip over segments where 'first' the item you're searching for. This format offers a lot more flexibility compared to the separate item indexes. First, we don't need to have another fixed sized area on the page, which simplifies the page format. Second, we can more easily re-encode only one segment on the page, on insertion or vacuum. The latter is particularly important with the Simple-9 encoding, which operates one word at a time rather than one item at a time; removing or inserting an item in the middle can require a complete re-encoding of everything that follows. Third, when a page is being inserted into and contains only uncompressed items, you don't waste any space for unused item indexes. While we're at it, I think we should use the above struct in the inline posting lists stored directly in entry tuples. That wastes a few bytes compared to the current approach in the patch (more alignment, and 'words' is redundant with the number of items stored on the tuple header), but it simplifies the functions handling these lists. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
Guys, before digging deep into the art of comp/decomp world I'd like to know if you familiar with results of http://wwwconference.org/www2008/papers/pdf/p387-zhangA.pdf paper and some newer research ? Do we agree in what we really want ? Basically, there are three main features: size, compression, decompression speed - we should take two :) Should we design sort of plugin, which could support independent storage on disk, so users can apply different techniques, depending on data. What I want to say is that we certainly can play with this very challenged task, but we have limited time before 9.4 and we should think in positive direction. Oleg On Wed, Dec 18, 2013 at 6:50 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/18/2013 01:45 PM, Alexander Korotkov wrote: On Tue, Dec 17, 2013 at 2:49 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/17/2013 12:22 AM, Alexander Korotkov wrote: 2) Storage would be easily extendable to hold additional information as well. Better compression shouldn't block more serious improvements. I'm not sure I agree with that. For all the cases where you don't care about additional information - which covers all existing users for example - reducing disk size is pretty important. How are you planning to store the additional information, and how does using another encoding gets in the way of that? I was planned to store additional information datums between varbyte-encoded tids. I was expected it would be hard to do with PFOR. However, I don't see significant problems in your implementation of Simple9 encoding. I'm going to dig deeper in your version of patch. Ok, thanks. I had another idea about the page format this morning. Instead of having the item-indexes at the end of the page, it would be more flexible to store a bunch of self-contained posting list segments on the page. So I propose that we get rid of the item-indexes, and instead store a bunch of independent posting lists on the page: typedef struct { ItemPointerData first; /* first item in this segment (unpacked) */ uint16 nwords; /* number of words that follow */ uint64 words[1];/* var length */ } PostingListSegment; Each segment can be encoded and decoded independently. When searching for a particular item (like on insertion), you skip over segments where 'first' the item you're searching for. This format offers a lot more flexibility compared to the separate item indexes. First, we don't need to have another fixed sized area on the page, which simplifies the page format. Second, we can more easily re-encode only one segment on the page, on insertion or vacuum. The latter is particularly important with the Simple-9 encoding, which operates one word at a time rather than one item at a time; removing or inserting an item in the middle can require a complete re-encoding of everything that follows. Third, when a page is being inserted into and contains only uncompressed items, you don't waste any space for unused item indexes. While we're at it, I think we should use the above struct in the inline posting lists stored directly in entry tuples. That wastes a few bytes compared to the current approach in the patch (more alignment, and 'words' is redundant with the number of items stored on the tuple header), but it simplifies the functions handling these lists. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 12/12/2013 06:44 PM, Alexander Korotkov wrote: I've thought about different algorithms little more. General problem I see is online update. We need it while it is typically not covered by researches at all. We already have to invent small index in the end of page. Different encoding methods adds more challenges. In general, methods can be classified in two groups: 1) Values aren't aligned by bytes (gamma-codes, PFOR etc.) 2) Multiple values are packed together in small group (simple-9, simple-18) Ok. For the first group of methods when inserting in the middle of the page we would have to do not byte-aligned shift of right part of values. I don't know how expensive is this shift but I expect that it would be much slower than memmove. Agreed. When values are packed into small groups, we have to either insert inefficiently encoded value or re-encode whole right part of values. It would probably be simplest to store newly inserted items uncompressed, in a separate area in the page. For example, grow the list of uncompressed items downwards from pg_upper, and the compressed items upwards from pg_lower. When the page fills up, re-encode the whole page. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Mon, Dec 16, 2013 at 3:30 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/12/2013 06:44 PM, Alexander Korotkov wrote: I've thought about different algorithms little more. General problem I see is online update. We need it while it is typically not covered by researches at all. We already have to invent small index in the end of page. Different encoding methods adds more challenges. In general, methods can be classified in two groups: 1) Values aren't aligned by bytes (gamma-codes, PFOR etc.) 2) Multiple values are packed together in small group (simple-9, simple-18) Ok. For the first group of methods when inserting in the middle of the page we would have to do not byte-aligned shift of right part of values. I don't know how expensive is this shift but I expect that it would be much slower than memmove. Agreed. When values are packed into small groups, we have to either insert inefficiently encoded value or re-encode whole right part of values. It would probably be simplest to store newly inserted items uncompressed, in a separate area in the page. For example, grow the list of uncompressed items downwards from pg_upper, and the compressed items upwards from pg_lower. When the page fills up, re-encode the whole page. Good idea. But: 1) We'll still need item indexes in the end of page for fast scan. 2) Storage would be easily extendable to hold additional information as well. Better compression shouldn't block more serious improvements. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On 12/17/2013 12:22 AM, Alexander Korotkov wrote: On Mon, Dec 16, 2013 at 3:30 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/12/2013 06:44 PM, Alexander Korotkov wrote: When values are packed into small groups, we have to either insert inefficiently encoded value or re-encode whole right part of values. It would probably be simplest to store newly inserted items uncompressed, in a separate area in the page. For example, grow the list of uncompressed items downwards from pg_upper, and the compressed items upwards from pg_lower. When the page fills up, re-encode the whole page. I hacked together an implementation of a variant of Simple9, to see how it performs. Insertions are handled per the above scheme. In a limited pg_trgm test case I've been using a lot for this, this reduces the index size about 20%, compared to varbyte encoding. It might be possible to squeeze it a bit more, I handcrafted the selectors in the encoding algorithm to suite our needs, but I don't actually have a good idea of how to choose them optimally. Also, the encoding can encode 0 values, but we never need to do that, so you could take advantage of that to pack items tighter. Compression and decompression speed seems to be about the same. Patch attached if you want to play with it. WAL replay is still broken, and there are probably bugs. Good idea. But: 1) We'll still need item indexes in the end of page for fast scan. Sure. 2) Storage would be easily extendable to hold additional information as well. Better compression shouldn't block more serious improvements. I'm not sure I agree with that. For all the cases where you don't care about additional information - which covers all existing users for example - reducing disk size is pretty important. How are you planning to store the additional information, and how does using another encoding gets in the way of that? - Heikki gin-packed-postinglists-simple9-1.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Tue, Dec 10, 2013 at 12:26 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/09/2013 11:34 AM, Alexander Korotkov wrote: On Mon, Dec 9, 2013 at 1:18 PM, Heikki Linnakangas hlinnakan...@vmware.comwrote: Even if we use varbyte encoding, I wonder if it would be better to treat block + offset number as a single 48-bit integer, rather than encode them separately. That would allow the delta of two items on the same page to be stored as a single byte, rather than two bytes. Naturally it would be a loss on other values, but would be nice to see some kind of an analysis on that. I suspect it might make the code simpler, too. Yeah, I had that idea, but I thought it's not a better option. Will try to do some analysis. The more I think about that, the more convinced I am that it's a good idea. I don't think it will ever compress worse than the current approach of treating block and offset numbers separately, and, although I haven't actually tested it, I doubt it's any slower. About the same amount of arithmetic is required in both versions. Attached is a version that does that. Plus some other minor cleanup. (we should still investigate using a completely different algorithm, though) I've thought about different algorithms little more. General problem I see is online update. We need it while it is typically not covered by researches at all. We already have to invent small index in the end of page. Different encoding methods adds more challenges. In general, methods can be classified in two groups: 1) Values aren't aligned by bytes (gamma-codes, PFOR etc.) 2) Multiple values are packed together in small group (simple-9, simple-18) For the first group of methods when inserting in the middle of the page we would have to do not byte-aligned shift of right part of values. I don't know how expensive is this shift but I expect that it would be much slower than memmove. When values are packed into small groups, we have to either insert inefficiently encoded value or re-encode whole right part of values. The option I see is to encode bins between item indexes separately. This still might be slower and require much more complicated maintenance. And also this much complicates further work on storing additional information in GIN. Any other thoughts? -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On Tue, Dec 10, 2013 at 12:26 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/09/2013 11:34 AM, Alexander Korotkov wrote: On Mon, Dec 9, 2013 at 1:18 PM, Heikki Linnakangas hlinnakan...@vmware.comwrote: Even if we use varbyte encoding, I wonder if it would be better to treat block + offset number as a single 48-bit integer, rather than encode them separately. That would allow the delta of two items on the same page to be stored as a single byte, rather than two bytes. Naturally it would be a loss on other values, but would be nice to see some kind of an analysis on that. I suspect it might make the code simpler, too. Yeah, I had that idea, but I thought it's not a better option. Will try to do some analysis. The more I think about that, the more convinced I am that it's a good idea. I don't think it will ever compress worse than the current approach of treating block and offset numbers separately, and, although I haven't actually tested it, I doubt it's any slower. About the same amount of arithmetic is required in both versions. Attached is a version that does that. Plus some other minor cleanup. (we should still investigate using a completely different algorithm, though) Yes, when I though about that, I didn't realize that we can reserve less than 16 bits for offset number. I rerun my benchmark and got following results: event | period ---+- index_build | 00:01:46.39056 index_build_recovery | 00:00:05 index_update | 00:06:01.557708 index_update_recovery | 00:01:23 search_new| 00:24:05.600366 search_updated| 00:25:29.520642 (6 rows) label | blocks_mark +- search_new | 847509920 search_updated | 883789826 (2 rows) label | size ---+--- new | 364560384 after_updates | 642736128 (2 rows) Speed is same while index size is less. In previous format it was: label | size ---+--- new | 419299328 after_updates | 715915264 (2 rows) Good optimization, thanks. I'll try another datasets but I expect similar results. However, patch didn't apply to head. Corrected version is attached. -- With best regards, Alexander Korotkov. gin-packed-postinglists-20.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 12/08/2013 09:56 PM, Alexander Korotkov wrote: On Fri, Nov 29, 2013 at 11:17 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I'll continue reviewing next week.. Got dragged into other things and didn't make any progress on this last week. I'm trying again now. Good. Thanks for debug and fixing bugs. Can I do anything for this patch now? I wonder if we're leaving some money on the table, by using varbyte encoding. Googling around, there are many other compression methods out there for compressing integer deltas that compress better, and/or decompress faster. Even if we use varbyte encoding, I wonder if it would be better to treat block + offset number as a single 48-bit integer, rather than encode them separately. That would allow the delta of two items on the same page to be stored as a single byte, rather than two bytes. Naturally it would be a loss on other values, but would be nice to see some kind of an analysis on that. I suspect it might make the code simpler, too. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Mon, Dec 9, 2013 at 1:18 PM, Heikki Linnakangas hlinnakan...@vmware.comwrote: On 12/08/2013 09:56 PM, Alexander Korotkov wrote: On Fri, Nov 29, 2013 at 11:17 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I'll continue reviewing next week.. Got dragged into other things and didn't make any progress on this last week. I'm trying again now. Good. Thanks for debug and fixing bugs. Can I do anything for this patch now? I wonder if we're leaving some money on the table, by using varbyte encoding. Googling around, there are many other compression methods out there for compressing integer deltas that compress better, and/or decompress faster. Ok, I'll try to google around. Probably, there are some better options. Even if we use varbyte encoding, I wonder if it would be better to treat block + offset number as a single 48-bit integer, rather than encode them separately. That would allow the delta of two items on the same page to be stored as a single byte, rather than two bytes. Naturally it would be a loss on other values, but would be nice to see some kind of an analysis on that. I suspect it might make the code simpler, too. Yeah, I had that idea, but I thought it's not a better option. Will try to do some analysis. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On 12/09/2013 11:34 AM, Alexander Korotkov wrote: On Mon, Dec 9, 2013 at 1:18 PM, Heikki Linnakangas hlinnakan...@vmware.comwrote: Even if we use varbyte encoding, I wonder if it would be better to treat block + offset number as a single 48-bit integer, rather than encode them separately. That would allow the delta of two items on the same page to be stored as a single byte, rather than two bytes. Naturally it would be a loss on other values, but would be nice to see some kind of an analysis on that. I suspect it might make the code simpler, too. Yeah, I had that idea, but I thought it's not a better option. Will try to do some analysis. The more I think about that, the more convinced I am that it's a good idea. I don't think it will ever compress worse than the current approach of treating block and offset numbers separately, and, although I haven't actually tested it, I doubt it's any slower. About the same amount of arithmetic is required in both versions. Attached is a version that does that. Plus some other minor cleanup. (we should still investigate using a completely different algorithm, though) - Heikki gin-packed-postinglists-19.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Fri, Nov 29, 2013 at 11:17 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 11/29/2013 11:41 AM, Heikki Linnakangas wrote: On 11/28/2013 09:19 AM, Alexander Korotkov wrote: On Wed, Nov 27, 2013 at 1:14 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 11/26/13 15:34, Alexander Korotkov wrote: What's your plans about GIN now? I tried to rebase packed posting lists with head. But I found that you've changed interface of placeToPage function. That's conflicts with packed posting lists, because dataPlaceToPageLeaf needs not only offset number to describe place to insert item pointer. Do you like to commit rework of handling GIN incomplete splits before? Yeah, I'm planning to get back to this patch after committing the incomplete splits patch. I think the refactoring of the WAL-logging that I did in that patch will simplify this patch, too. I'll take a look at Michael's latest comments on the incomplete splits patch tomorrow, so I should get back to this on Thursday or Friday. Should I try to rebase this patch now or you plan to do it yourself? Some changes like void *insertdata argument make me think you have some particular plan to rebase this patch, but I didn't get it exactly. Here's rebased version. I'll continue reviewing it now.. Another update. Fixes a bunch of bugs. Mostly introduced by me, but a couple were present in your v16: * When allocating the entry-list buffer in a scan, it must be large enough for the max number of items that can fit on a compressed page, whether the current page is compressed or not. That's because the same buffer is reused on subsequent pages, which might be compressed. * When splitting a leaf page during index creation, missed the trick that's present in current master, to choose the split point so that left page is packed as full as possible. I put that back, it makes newly-built indexes somewhat smaller. (I wonder if we should leave some free space for future updates. But that's a separate patch, let's keep the current behavior in this patch) I'll continue reviewing next week.. Good. Thanks for debug and fixing bugs. Can I do anything for this patch now? -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On 11/29/2013 11:41 AM, Heikki Linnakangas wrote: On 11/28/2013 09:19 AM, Alexander Korotkov wrote: On Wed, Nov 27, 2013 at 1:14 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 11/26/13 15:34, Alexander Korotkov wrote: What's your plans about GIN now? I tried to rebase packed posting lists with head. But I found that you've changed interface of placeToPage function. That's conflicts with packed posting lists, because dataPlaceToPageLeaf needs not only offset number to describe place to insert item pointer. Do you like to commit rework of handling GIN incomplete splits before? Yeah, I'm planning to get back to this patch after committing the incomplete splits patch. I think the refactoring of the WAL-logging that I did in that patch will simplify this patch, too. I'll take a look at Michael's latest comments on the incomplete splits patch tomorrow, so I should get back to this on Thursday or Friday. Should I try to rebase this patch now or you plan to do it yourself? Some changes like void *insertdata argument make me think you have some particular plan to rebase this patch, but I didn't get it exactly. Here's rebased version. I'll continue reviewing it now.. Another update. Fixes a bunch of bugs. Mostly introduced by me, but a couple were present in your v16: * When allocating the entry-list buffer in a scan, it must be large enough for the max number of items that can fit on a compressed page, whether the current page is compressed or not. That's because the same buffer is reused on subsequent pages, which might be compressed. * When splitting a leaf page during index creation, missed the trick that's present in current master, to choose the split point so that left page is packed as full as possible. I put that back, it makes newly-built indexes somewhat smaller. (I wonder if we should leave some free space for future updates. But that's a separate patch, let's keep the current behavior in this patch) I'll continue reviewing next week.. - Heikki gin-packed-postinglists-18.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Wed, Nov 27, 2013 at 1:14 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 11/26/13 15:34, Alexander Korotkov wrote: What's your plans about GIN now? I tried to rebase packed posting lists with head. But I found that you've changed interface of placeToPage function. That's conflicts with packed posting lists, because dataPlaceToPageLeaf needs not only offset number to describe place to insert item pointer. Do you like to commit rework of handling GIN incomplete splits before? Yeah, I'm planning to get back to this patch after committing the incomplete splits patch. I think the refactoring of the WAL-logging that I did in that patch will simplify this patch, too. I'll take a look at Michael's latest comments on the incomplete splits patch tomorrow, so I should get back to this on Thursday or Friday. Should I try to rebase this patch now or you plan to do it yourself? Some changes like void *insertdata argument make me think you have some particular plan to rebase this patch, but I didn't get it exactly. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
Hi! On Wed, Nov 20, 2013 at 9:02 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 06.11.2013 17:36, Alvaro Herrera wrote: Just for my own illumination, can someone explain this bit? + If a posting list is too large to store in-line in a key entry, a posting tree + is created. A posting tree is a B-tree structure, where the ItemPointer is + used as the key. At the leaf-level, item pointers are stored compressed, in + varbyte encoding. I think the first ItemPointer mentioned (the key) refers to a TID pointing to the index, and item pointers stored compressed refers to the TIDs pointing to the heap (the data). Is that correct? No, they both refer to TIDs pointing to the heap. I'm also interested in the FIXME explain varbyte encoding explanation currently missing, if somebody can write it down ... Alexander's latest version filled in that explanation (haven't read it myself yet) off-list What's your plans about GIN now? I tried to rebase packed posting lists with head. But I found that you've changed interface of placeToPage function. That's conflicts with packed posting lists, because dataPlaceToPageLeaf needs not only offset number to describe place to insert item pointer. Do you like to commit rework of handling GIN incomplete splits before? -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On Tue, Nov 26, 2013 at 5:34 PM, Alexander Korotkov aekorot...@gmail.comwrote: On Wed, Nov 20, 2013 at 9:02 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 06.11.2013 17:36, Alvaro Herrera wrote: Just for my own illumination, can someone explain this bit? + If a posting list is too large to store in-line in a key entry, a posting tree + is created. A posting tree is a B-tree structure, where the ItemPointer is + used as the key. At the leaf-level, item pointers are stored compressed, in + varbyte encoding. I think the first ItemPointer mentioned (the key) refers to a TID pointing to the index, and item pointers stored compressed refers to the TIDs pointing to the heap (the data). Is that correct? No, they both refer to TIDs pointing to the heap. I'm also interested in the FIXME explain varbyte encoding explanation currently missing, if somebody can write it down ... Alexander's latest version filled in that explanation (haven't read it myself yet) off-list It appears to be not actually off-list, sorry :) What's your plans about GIN now? I tried to rebase packed posting lists with head. But I found that you've changed interface of placeToPage function. That's conflicts with packed posting lists, because dataPlaceToPageLeaf needs not only offset number to describe place to insert item pointer. Do you like to commit rework of handling GIN incomplete splits before? -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On 11/26/13 15:34, Alexander Korotkov wrote: What's your plans about GIN now? I tried to rebase packed posting lists with head. But I found that you've changed interface of placeToPage function. That's conflicts with packed posting lists, because dataPlaceToPageLeaf needs not only offset number to describe place to insert item pointer. Do you like to commit rework of handling GIN incomplete splits before? Yeah, I'm planning to get back to this patch after committing the incomplete splits patch. I think the refactoring of the WAL-logging that I did in that patch will simplify this patch, too. I'll take a look at Michael's latest comments on the incomplete splits patch tomorrow, so I should get back to this on Thursday or Friday. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
This patch needs to be rebased. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 06.11.2013 17:36, Alvaro Herrera wrote: Just for my own illumination, can someone explain this bit? + If a posting list is too large to store in-line in a key entry, a posting tree + is created. A posting tree is a B-tree structure, where the ItemPointer is + used as the key. At the leaf-level, item pointers are stored compressed, in + varbyte encoding. I think the first ItemPointer mentioned (the key) refers to a TID pointing to the index, and item pointers stored compressed refers to the TIDs pointing to the heap (the data). Is that correct? No, they both refer to TIDs pointing to the heap. I'm also interested in the FIXME explain varbyte encoding explanation currently missing, if somebody can write it down ... Alexander's latest version filled in that explanation (haven't read it myself yet) - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 11/14/13, 6:00 AM, Alexander Korotkov wrote: Sorry, I posted buggy version of patch. Attached version is correct. This patch crashes the hstore the pg_trgm regression tests. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Fri, Nov 15, 2013 at 8:58 PM, Peter Eisentraut pete...@gmx.net wrote: On 11/14/13, 6:00 AM, Alexander Korotkov wrote: Sorry, I posted buggy version of patch. Attached version is correct. This patch crashes the hstore the pg_trgm regression tests. What exact version did you try 14 or 16? If it was 16, could you please post a stacktrace, because it doesn't crash for me. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On 11/15/13, 12:24 PM, Alexander Korotkov wrote: On Fri, Nov 15, 2013 at 8:58 PM, Peter Eisentraut pete...@gmx.net mailto:pete...@gmx.net wrote: On 11/14/13, 6:00 AM, Alexander Korotkov wrote: Sorry, I posted buggy version of patch. Attached version is correct. This patch crashes the hstore the pg_trgm regression tests. What exact version did you try 14 or 16? If it was 16, could you please post a stacktrace, because it doesn't crash for me. The one that's the latest in the commitfest: http://www.postgresql.org/message-id/capphfdveq-jhe_2z9pvw22bp6h_o8xoaxcbjagf87gs4p4j...@mail.gmail.com If you have a newer one, please add it there. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Fri, Nov 15, 2013 at 9:56 PM, Peter Eisentraut pete...@gmx.net wrote: On 11/15/13, 12:24 PM, Alexander Korotkov wrote: On Fri, Nov 15, 2013 at 8:58 PM, Peter Eisentraut pete...@gmx.net mailto:pete...@gmx.net wrote: On 11/14/13, 6:00 AM, Alexander Korotkov wrote: Sorry, I posted buggy version of patch. Attached version is correct. This patch crashes the hstore the pg_trgm regression tests. What exact version did you try 14 or 16? If it was 16, could you please post a stacktrace, because it doesn't crash for me. The one that's the latest in the commitfest: http://www.postgresql.org/message-id/capphfdveq-jhe_2z9pvw22bp6h_o8xoaxcbjagf87gs4p4j...@mail.gmail.com If you have a newer one, please add it there. Ok, I actualized information in commitfest app. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On Tue, Nov 5, 2013 at 9:49 PM, Heikki Linnakangas hlinnakan...@vmware.comwrote: On 04.11.2013 23:44, Alexander Korotkov wrote: On Mon, Oct 21, 2013 at 11:12 PM, Alexander Korotkov aekorot...@gmail.comwrote: Attached version of patch is debugged. I would like to note that number of bugs was low and it wasn't very hard to debug. I've rerun tests on it. You can see that numbers are improved as the result of your refactoring. event | period ---+- index_build | 00:01:45.416822 index_build_recovery | 00:00:06 index_update | 00:05:17.263606 index_update_recovery | 00:01:22 search_new| 00:24:07.706482 search_updated| 00:26:25.364494 (6 rows) label | blocks_mark +- search_new | 847587636 search_updated | 881210486 (2 rows) label | size ---+--- new | 419299328 after_updates | 715915264 (2 rows) Beside simple bugs, there was a subtle bug in incremental item indexes update. I've added some more comments including ASCII picture about how item indexes are shifted. Now, I'm trying to implement support of old page format. Then we can decide which approach to use. Attached version of patch has support of old page format. Patch still needs more documentation and probably refactoring, but I believe idea is clear and it can be reviewed. In the patch I have to revert change of null category placement for compatibility with old posting list format. Thanks, just glanced at this quickly. If I'm reading the patch correctly, old-style non-compressed posting tree leaf pages are not vacuumed at all; that's clearly not right. Fixed. Now separate function handles uncompressed posting lists and compress them if as least one TID is deleted. I argued earlier that we don't need to handle the case that compressing a page makes it larger, so that the existing items don't fit on the page anymore. I'm having some second thoughts on that; I didn't take into account the fact that the mini-index in the new page format takes up some space. I think it's still highly unlikely that there isn't enough space on a 8k page, but it's not totally crazy with a non-standard small page size. So at least that situation needs to be checked for with an ereport(), rather than Assert. Now this situation is ereported before any change in page. To handle splitting a non-compressed page, it seems that you're relying on the fact that before splitting, we try to insert, which compresses the page. The problem with that is that it's not correctly WAL-logged. The split record that follows includes a full copy of both page halves, but if the split fails for some reason, e.g you run out of disk space, there is no WAL record at all of the the compression. I'd suggest doing the compression in the insert phase on a temporary copy of the page, leaving the original page untouched if there isn't enough space for the insertion to complete. (You could argue that this case can't happen because the compression must always create enough space to insert one more item. maybe so, but at least there should be an explicit check for that.) Good point. Yes, if we don't handle specially inserting item indexes, I see no point to do special handling for single TID which is much smaller. In the attached patch dataCompressLeafPage just reserves space for single TID. Also, many changes in comments and README. Unfortunally, I didn't understand what is FIXME about in ginVacuumEntryPage. So, it's not fixed :) -- With best regards, Alexander Korotkov. gin-packed-postinglists-13.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Thu, Nov 14, 2013 at 2:17 PM, Alexander Korotkov aekorot...@gmail.comwrote: On Tue, Nov 5, 2013 at 9:49 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 04.11.2013 23:44, Alexander Korotkov wrote: On Mon, Oct 21, 2013 at 11:12 PM, Alexander Korotkov aekorot...@gmail.comwrote: Attached version of patch is debugged. I would like to note that number of bugs was low and it wasn't very hard to debug. I've rerun tests on it. You can see that numbers are improved as the result of your refactoring. event | period ---+- index_build | 00:01:45.416822 index_build_recovery | 00:00:06 index_update | 00:05:17.263606 index_update_recovery | 00:01:22 search_new| 00:24:07.706482 search_updated| 00:26:25.364494 (6 rows) label | blocks_mark +- search_new | 847587636 search_updated | 881210486 (2 rows) label | size ---+--- new | 419299328 after_updates | 715915264 (2 rows) Beside simple bugs, there was a subtle bug in incremental item indexes update. I've added some more comments including ASCII picture about how item indexes are shifted. Now, I'm trying to implement support of old page format. Then we can decide which approach to use. Attached version of patch has support of old page format. Patch still needs more documentation and probably refactoring, but I believe idea is clear and it can be reviewed. In the patch I have to revert change of null category placement for compatibility with old posting list format. Thanks, just glanced at this quickly. If I'm reading the patch correctly, old-style non-compressed posting tree leaf pages are not vacuumed at all; that's clearly not right. Fixed. Now separate function handles uncompressed posting lists and compress them if as least one TID is deleted. I argued earlier that we don't need to handle the case that compressing a page makes it larger, so that the existing items don't fit on the page anymore. I'm having some second thoughts on that; I didn't take into account the fact that the mini-index in the new page format takes up some space. I think it's still highly unlikely that there isn't enough space on a 8k page, but it's not totally crazy with a non-standard small page size. So at least that situation needs to be checked for with an ereport(), rather than Assert. Now this situation is ereported before any change in page. To handle splitting a non-compressed page, it seems that you're relying on the fact that before splitting, we try to insert, which compresses the page. The problem with that is that it's not correctly WAL-logged. The split record that follows includes a full copy of both page halves, but if the split fails for some reason, e.g you run out of disk space, there is no WAL record at all of the the compression. I'd suggest doing the compression in the insert phase on a temporary copy of the page, leaving the original page untouched if there isn't enough space for the insertion to complete. (You could argue that this case can't happen because the compression must always create enough space to insert one more item. maybe so, but at least there should be an explicit check for that.) Good point. Yes, if we don't handle specially inserting item indexes, I see no point to do special handling for single TID which is much smaller. In the attached patch dataCompressLeafPage just reserves space for single TID. Also, many changes in comments and README. Unfortunally, I didn't understand what is FIXME about in ginVacuumEntryPage. So, it's not fixed :) Sorry, I posted buggy version of patch. Attached version is correct. -- With best regards, Alexander Korotkov. gin-packed-postinglists-14.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Thu, Nov 14, 2013 at 3:00 PM, Alexander Korotkov aekorot...@gmail.comwrote: On Thu, Nov 14, 2013 at 2:17 PM, Alexander Korotkov aekorot...@gmail.comwrote: On Tue, Nov 5, 2013 at 9:49 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 04.11.2013 23:44, Alexander Korotkov wrote: On Mon, Oct 21, 2013 at 11:12 PM, Alexander Korotkov aekorot...@gmail.comwrote: Attached version of patch is debugged. I would like to note that number of bugs was low and it wasn't very hard to debug. I've rerun tests on it. You can see that numbers are improved as the result of your refactoring. event | period ---+- index_build | 00:01:45.416822 index_build_recovery | 00:00:06 index_update | 00:05:17.263606 index_update_recovery | 00:01:22 search_new| 00:24:07.706482 search_updated| 00:26:25.364494 (6 rows) label | blocks_mark +- search_new | 847587636 search_updated | 881210486 (2 rows) label | size ---+--- new | 419299328 after_updates | 715915264 (2 rows) Beside simple bugs, there was a subtle bug in incremental item indexes update. I've added some more comments including ASCII picture about how item indexes are shifted. Now, I'm trying to implement support of old page format. Then we can decide which approach to use. Attached version of patch has support of old page format. Patch still needs more documentation and probably refactoring, but I believe idea is clear and it can be reviewed. In the patch I have to revert change of null category placement for compatibility with old posting list format. Thanks, just glanced at this quickly. If I'm reading the patch correctly, old-style non-compressed posting tree leaf pages are not vacuumed at all; that's clearly not right. Fixed. Now separate function handles uncompressed posting lists and compress them if as least one TID is deleted. I argued earlier that we don't need to handle the case that compressing a page makes it larger, so that the existing items don't fit on the page anymore. I'm having some second thoughts on that; I didn't take into account the fact that the mini-index in the new page format takes up some space. I think it's still highly unlikely that there isn't enough space on a 8k page, but it's not totally crazy with a non-standard small page size. So at least that situation needs to be checked for with an ereport(), rather than Assert. Now this situation is ereported before any change in page. To handle splitting a non-compressed page, it seems that you're relying on the fact that before splitting, we try to insert, which compresses the page. The problem with that is that it's not correctly WAL-logged. The split record that follows includes a full copy of both page halves, but if the split fails for some reason, e.g you run out of disk space, there is no WAL record at all of the the compression. I'd suggest doing the compression in the insert phase on a temporary copy of the page, leaving the original page untouched if there isn't enough space for the insertion to complete. (You could argue that this case can't happen because the compression must always create enough space to insert one more item. maybe so, but at least there should be an explicit check for that.) Good point. Yes, if we don't handle specially inserting item indexes, I see no point to do special handling for single TID which is much smaller. In the attached patch dataCompressLeafPage just reserves space for single TID. Also, many changes in comments and README. Unfortunally, I didn't understand what is FIXME about in ginVacuumEntryPage. So, it's not fixed :) Sorry, I posted buggy version of patch. Attached version is correct. Another bug fix by report from Rod Taylor. -- With best regards, Alexander Korotkov. gin-packed-postinglists-16.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 04.11.2013 23:44, Alexander Korotkov wrote: Attached version of patch has support of old page format. Patch still needs more documentation and probably refactoring, but I believe idea is clear and it can be reviewed. In the patch I have to revert change of null category placement for compatibility with old posting list format. I committed some of the refactorings and cleanup included in this patch. Attached is a rebased version that applies over current head. I hope I didn't joggle your elbow.. - Heikki gin-packed-postinglists-12-rebased.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
Heikki Linnakangas escribió: On 04.11.2013 23:44, Alexander Korotkov wrote: Attached version of patch has support of old page format. Patch still needs more documentation and probably refactoring, but I believe idea is clear and it can be reviewed. In the patch I have to revert change of null category placement for compatibility with old posting list format. I committed some of the refactorings and cleanup included in this patch. Attached is a rebased version that applies over current head. I hope I didn't joggle your elbow.. Just for my own illumination, can someone explain this bit? + If a posting list is too large to store in-line in a key entry, a posting tree + is created. A posting tree is a B-tree structure, where the ItemPointer is + used as the key. At the leaf-level, item pointers are stored compressed, in + varbyte encoding. I think the first ItemPointer mentioned (the key) refers to a TID pointing to the index, and item pointers stored compressed refers to the TIDs pointing to the heap (the data). Is that correct? I'm also interested in the FIXME explain varbyte encoding explanation currently missing, if somebody can write it down ... Thanks, -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 04.11.2013 23:44, Alexander Korotkov wrote: On Mon, Oct 21, 2013 at 11:12 PM, Alexander Korotkov aekorot...@gmail.comwrote: Attached version of patch is debugged. I would like to note that number of bugs was low and it wasn't very hard to debug. I've rerun tests on it. You can see that numbers are improved as the result of your refactoring. event | period ---+- index_build | 00:01:45.416822 index_build_recovery | 00:00:06 index_update | 00:05:17.263606 index_update_recovery | 00:01:22 search_new| 00:24:07.706482 search_updated| 00:26:25.364494 (6 rows) label | blocks_mark +- search_new | 847587636 search_updated | 881210486 (2 rows) label | size ---+--- new | 419299328 after_updates | 715915264 (2 rows) Beside simple bugs, there was a subtle bug in incremental item indexes update. I've added some more comments including ASCII picture about how item indexes are shifted. Now, I'm trying to implement support of old page format. Then we can decide which approach to use. Attached version of patch has support of old page format. Patch still needs more documentation and probably refactoring, but I believe idea is clear and it can be reviewed. In the patch I have to revert change of null category placement for compatibility with old posting list format. Thanks, just glanced at this quickly. If I'm reading the patch correctly, old-style non-compressed posting tree leaf pages are not vacuumed at all; that's clearly not right. I argued earlier that we don't need to handle the case that compressing a page makes it larger, so that the existing items don't fit on the page anymore. I'm having some second thoughts on that; I didn't take into account the fact that the mini-index in the new page format takes up some space. I think it's still highly unlikely that there isn't enough space on a 8k page, but it's not totally crazy with a non-standard small page size. So at least that situation needs to be checked for with an ereport(), rather than Assert. To handle splitting a non-compressed page, it seems that you're relying on the fact that before splitting, we try to insert, which compresses the page. The problem with that is that it's not correctly WAL-logged. The split record that follows includes a full copy of both page halves, but if the split fails for some reason, e.g you run out of disk space, there is no WAL record at all of the the compression. I'd suggest doing the compression in the insert phase on a temporary copy of the page, leaving the original page untouched if there isn't enough space for the insertion to complete. (You could argue that this case can't happen because the compression must always create enough space to insert one more item. maybe so, but at least there should be an explicit check for that.) - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Mon, Oct 21, 2013 at 11:12 PM, Alexander Korotkov aekorot...@gmail.comwrote: Attached version of patch is debugged. I would like to note that number of bugs was low and it wasn't very hard to debug. I've rerun tests on it. You can see that numbers are improved as the result of your refactoring. event | period ---+- index_build | 00:01:45.416822 index_build_recovery | 00:00:06 index_update | 00:05:17.263606 index_update_recovery | 00:01:22 search_new| 00:24:07.706482 search_updated| 00:26:25.364494 (6 rows) label | blocks_mark +- search_new | 847587636 search_updated | 881210486 (2 rows) label | size ---+--- new | 419299328 after_updates | 715915264 (2 rows) Beside simple bugs, there was a subtle bug in incremental item indexes update. I've added some more comments including ASCII picture about how item indexes are shifted. Now, I'm trying to implement support of old page format. Then we can decide which approach to use. Attached version of patch has support of old page format. Patch still needs more documentation and probably refactoring, but I believe idea is clear and it can be reviewed. In the patch I have to revert change of null category placement for compatibility with old posting list format. -- With best regards, Alexander Korotkov. gin-packed-postinglists-12.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Thu, Oct 10, 2013 at 3:57 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 09.10.2013 02:04, Tomas Vondra wrote: On 8.10.2013 21:59, Heikki Linnakangas wrote: On 08.10.2013 17:47, Alexander Korotkov wrote: Hi, Tomas! On Sun, Oct 6, 2013 at 3:58 AM, Tomas Vondrat...@fuzzy.cz wrote: I've attempted to rerun the benchmarks tests I did a few weeks ago, but I got repeated crashes when loading the data (into a table with tsvector+gin index). Right before a crash, theres this message in the log: PANIC: not enough space in leaf page! Thanks for testing. Heikki's version of patch don't works for me too on even much more simplier examples. I can try to get it working if he answer my question about GinDataLeafPageGetPostingList* macros. The new macros in that patch version were quite botched. Here's a new attempt. Nope, still the same errors :-( PANIC: not enough space in leaf page! LOG: server process (PID 29722) was terminated by signal 6: Aborted DETAIL: Failed process was running: autovacuum: ANALYZE public.messages I've continued hacking away at the patch, here's yet another version. I've done a lot of cleanup and refactoring to make the code more readable (I hope). I'm not sure what caused the panic you saw, but it's probably fixed now. Let me know if not. Some notable changes since my last patch version: * I changed the ginbtree interface so that isEnoughSpace method is no more. Instead, placeToPage returns false without doing anything if the item doesn't fit. It was slightly simpler this way when working with the compressed posting lists, as placeToPage had to calculate tuple sizes anyway to decide how many items fit on the page. * I rewrote incrUpdateItemIndexes(). It now operates in two stages: 1. adjust the pageOffsets and 2. split the bin. I found it more readable this way. * I changed the WAL format of insert records so that there is a separate struct for data-leaf, data-internal, and entry pages. The information recorded for each of those was so different that it was confusing to cram them all into the same struct. I'm going to step back a bit from the details of the patch now. I think there's enough work for you, Alexander, until the next commitfest: * Try to make the code also work with the old page format, so that you don't need to REINDEX after pg_upgrade. * Documentation. The new compressed posting list format needs to be explained somewhere. At the top of ginpostinglist.c would be a good place. The README should be improved too. * Fix whatever I broke (sorry) Are we committed to go ahead with this patch in 9.4 timeframe, in one form or another? If we are, then I'd like to start committing refactoring patches that just move code around in preparation for the Patch, to make the review of the core of the patch easier. For example, merging the isEnoughSpace and placeToPage methods in the ginbtree interface. Without the patch, it's unnecessary code churn - it won't do any harm but it won't do any good either. I'm pretty confident that this patch will land in the 9.4 timeframe, so barring objections I'll start committing such refactorings. Attached version of patch is debugged. I would like to note that number of bugs was low and it wasn't very hard to debug. I've rerun tests on it. You can see that numbers are improved as the result of your refactoring. event | period ---+- index_build | 00:01:45.416822 index_build_recovery | 00:00:06 index_update | 00:05:17.263606 index_update_recovery | 00:01:22 search_new| 00:24:07.706482 search_updated| 00:26:25.364494 (6 rows) label | blocks_mark +- search_new | 847587636 search_updated | 881210486 (2 rows) label | size ---+--- new | 419299328 after_updates | 715915264 (2 rows) Beside simple bugs, there was a subtle bug in incremental item indexes update. I've added some more comments including ASCII picture about how item indexes are shifted. Now, I'm trying to implement support of old page format. Then we can decide which approach to use. -- With best regards, Alexander Korotkov. gin-packed-postinglists-11.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Sat, Oct 19, 2013 at 05:11:55PM -0400, Bruce Momjian wrote: I am in Moscow with Alexander and we were discussing GIN pg_upgrade issues. One option we have already discussed was to take the old GIN index code and put it in a separate directory, and call the new GIN index something different, but that got hung up on how users were going to create indexes of the new type. One nice trick would be for the index creation code to check the global variable IsBinaryUpgrade that pg_upgrade sets. If CREATE INDEX ... GIN finds IsBinaryUpgrade set, it should create an (empty) index of type gin-v1/old. If it is not set, it should create a new gin index. This will allow pg_upgrade to work, and allow REINDEX to create a new-type GIN index from an old one. We would need to append -v1 to the old index directory, system catalog, and function names. We could then remove the old GIN index code in some later major release, and pg_upgrade will then mark the indexes as invalid and create a REINDEX script. This allows users to reindex their GIN indexes over time, but it doesn't lock us into keeping the gin-v1 code around forever. Correction --- the above is not going to work because the backend isn't going to know, during a binary upgrade, if CREATE INDEX ... GIN is coming from a 9.3 or 9.4 database. We are going to need pg_dump code to do the mapping, and also pg_dump code to map gin_v1 back to gin once we remove the gin_v1 code. We will also need index creation code to map gin_v1 to gin to properly handle REINDEX and CLUSTER, but only if not in binary upgrade mode. If it would help, I would be happy to write a simple patch for the above items once I return from Europe in November. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Thu, Oct 3, 2013 at 05:24:49PM -0400, Bruce Momjian wrote: On Thu, Oct 3, 2013 at 02:48:20PM -0400, Robert Haas wrote: On Thu, Oct 3, 2013 at 2:43 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: It seems we've all but decided that we'll require reindexing GIN indexes in 9.4. I thought the consensus in Ottawa was strongly against that. I'm not aware that anyone has subsequently changed their position on the topic. Bruce is right to point out that we've done such things before and can therefore do it again, but just because we have the technical means to do it doesn't make it good policy. That having been said, if we do decide to break it... Agreed. I was stating only that this is easy for pg_upgrade. One cool thing is that the upgrades completes, and the indexes are there, but just marked as invalid until the REINDEX. One other point Alexander made is that the new GIN indexes will be smaller so most people would want the new format in the new cluster anyway. I am in Moscow with Alexander and we were discussing GIN pg_upgrade issues. One option we have already discussed was to take the old GIN index code and put it in a separate directory, and call the new GIN index something different, but that got hung up on how users were going to create indexes of the new type. One nice trick would be for the index creation code to check the global variable IsBinaryUpgrade that pg_upgrade sets. If CREATE INDEX ... GIN finds IsBinaryUpgrade set, it should create an (empty) index of type gin-v1/old. If it is not set, it should create a new gin index. This will allow pg_upgrade to work, and allow REINDEX to create a new-type GIN index from an old one. We would need to append -v1 to the old index directory, system catalog, and function names. We could then remove the old GIN index code in some later major release, and pg_upgrade will then mark the indexes as invalid and create a REINDEX script. This allows users to reindex their GIN indexes over time, but it doesn't lock us into keeping the gin-v1 code around forever. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Sat, Oct 12, 2013 at 1:55 AM, Tomas Vondra t...@fuzzy.cz wrote: On 10.10.2013 13:57, Heikki Linnakangas wrote: On 09.10.2013 02:04, Tomas Vondra wrote: On 8.10.2013 21:59, Heikki Linnakangas wrote: On 08.10.2013 17:47, Alexander Korotkov wrote: Hi, Tomas! On Sun, Oct 6, 2013 at 3:58 AM, Tomas Vondrat...@fuzzy.cz wrote: I've attempted to rerun the benchmarks tests I did a few weeks ago, but I got repeated crashes when loading the data (into a table with tsvector+gin index). Right before a crash, theres this message in the log: PANIC: not enough space in leaf page! Thanks for testing. Heikki's version of patch don't works for me too on even much more simplier examples. I can try to get it working if he answer my question about GinDataLeafPageGetPostingList* macros. The new macros in that patch version were quite botched. Here's a new attempt. Nope, still the same errors :-( PANIC: not enough space in leaf page! LOG: server process (PID 29722) was terminated by signal 6: Aborted DETAIL: Failed process was running: autovacuum: ANALYZE public.messages I've continued hacking away at the patch, here's yet another version. I've done a lot of cleanup and refactoring to make the code more readable (I hope). I'm not sure what caused the panic you saw, but it's probably fixed now. Let me know if not. Yup, this version fixed the issues. I haven't been able to do any benchmarks yet, all I have is some basic stats | HEAD | patched == load duration | 1084 s | 1086 s subject index | 96 MB | 96 MB body index | 2349 MB | 2051 MB So there's virtually no difference in speed (which is expected, AFAIK) and the large index on full message bodies is significantly smaller. Yes, it should be no significant difference in speed. But difference in index sizes seems to be too small. Could you share database dump somewhere? -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On 12.10.2013 12:11, Alexander Korotkov wrote: On Sat, Oct 12, 2013 at 1:55 AM, Tomas Vondra t...@fuzzy.cz mailto:t...@fuzzy.cz wrote: Yup, this version fixed the issues. I haven't been able to do any benchmarks yet, all I have is some basic stats | HEAD | patched == load duration | 1084 s | 1086 s subject index | 96 MB | 96 MB body index | 2349 MB | 2051 MB So there's virtually no difference in speed (which is expected, AFAIK) and the large index on full message bodies is significantly smaller. Yes, it should be no significant difference in speed. But difference in index sizes seems to be too small. Could you share database dump somewhere? Turns out that if I do VACUUM FULL after loading the data (a sequence of INSERT commands), the index sizes drop significantly. | HEAD | patched == subject index | 42 MB |15 MB body index | 624 MB | 328 MB So there's a significant improvement, as expected. I'm wondering if the bloat is expected too? Is that the consequence of incremental index updates vs. rebuilding the whole index at once during VACUUM FULL? Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 10.10.2013 13:57, Heikki Linnakangas wrote: On 09.10.2013 02:04, Tomas Vondra wrote: On 8.10.2013 21:59, Heikki Linnakangas wrote: On 08.10.2013 17:47, Alexander Korotkov wrote: Hi, Tomas! On Sun, Oct 6, 2013 at 3:58 AM, Tomas Vondrat...@fuzzy.cz wrote: I've attempted to rerun the benchmarks tests I did a few weeks ago, but I got repeated crashes when loading the data (into a table with tsvector+gin index). Right before a crash, theres this message in the log: PANIC: not enough space in leaf page! Thanks for testing. Heikki's version of patch don't works for me too on even much more simplier examples. I can try to get it working if he answer my question about GinDataLeafPageGetPostingList* macros. The new macros in that patch version were quite botched. Here's a new attempt. Nope, still the same errors :-( PANIC: not enough space in leaf page! LOG: server process (PID 29722) was terminated by signal 6: Aborted DETAIL: Failed process was running: autovacuum: ANALYZE public.messages I've continued hacking away at the patch, here's yet another version. I've done a lot of cleanup and refactoring to make the code more readable (I hope). I'm not sure what caused the panic you saw, but it's probably fixed now. Let me know if not. Yup, this version fixed the issues. I haven't been able to do any benchmarks yet, all I have is some basic stats | HEAD | patched == load duration | 1084 s | 1086 s subject index | 96 MB | 96 MB body index | 2349 MB | 2051 MB So there's virtually no difference in speed (which is expected, AFAIK) and the large index on full message bodies is significantly smaller. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 09.10.2013 02:04, Tomas Vondra wrote: On 8.10.2013 21:59, Heikki Linnakangas wrote: On 08.10.2013 17:47, Alexander Korotkov wrote: Hi, Tomas! On Sun, Oct 6, 2013 at 3:58 AM, Tomas Vondrat...@fuzzy.cz wrote: I've attempted to rerun the benchmarks tests I did a few weeks ago, but I got repeated crashes when loading the data (into a table with tsvector+gin index). Right before a crash, theres this message in the log: PANIC: not enough space in leaf page! Thanks for testing. Heikki's version of patch don't works for me too on even much more simplier examples. I can try to get it working if he answer my question about GinDataLeafPageGetPostingList* macros. The new macros in that patch version were quite botched. Here's a new attempt. Nope, still the same errors :-( PANIC: not enough space in leaf page! LOG: server process (PID 29722) was terminated by signal 6: Aborted DETAIL: Failed process was running: autovacuum: ANALYZE public.messages I've continued hacking away at the patch, here's yet another version. I've done a lot of cleanup and refactoring to make the code more readable (I hope). I'm not sure what caused the panic you saw, but it's probably fixed now. Let me know if not. Some notable changes since my last patch version: * I changed the ginbtree interface so that isEnoughSpace method is no more. Instead, placeToPage returns false without doing anything if the item doesn't fit. It was slightly simpler this way when working with the compressed posting lists, as placeToPage had to calculate tuple sizes anyway to decide how many items fit on the page. * I rewrote incrUpdateItemIndexes(). It now operates in two stages: 1. adjust the pageOffsets and 2. split the bin. I found it more readable this way. * I changed the WAL format of insert records so that there is a separate struct for data-leaf, data-internal, and entry pages. The information recorded for each of those was so different that it was confusing to cram them all into the same struct. I'm going to step back a bit from the details of the patch now. I think there's enough work for you, Alexander, until the next commitfest: * Try to make the code also work with the old page format, so that you don't need to REINDEX after pg_upgrade. * Documentation. The new compressed posting list format needs to be explained somewhere. At the top of ginpostinglist.c would be a good place. The README should be improved too. * Fix whatever I broke (sorry) Are we committed to go ahead with this patch in 9.4 timeframe, in one form or another? If we are, then I'd like to start committing refactoring patches that just move code around in preparation for the Patch, to make the review of the core of the patch easier. For example, merging the isEnoughSpace and placeToPage methods in the ginbtree interface. Without the patch, it's unnecessary code churn - it won't do any harm but it won't do any good either. I'm pretty confident that this patch will land in the 9.4 timeframe, so barring objections I'll start committing such refactorings. - Heikki gin-packed-postinglists-10-heikki.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
You linked to this email from the commitfest entry, but there is no patch here. You probably meant a different email. Check please. On Tue, 2013-10-08 at 21:48 +0300, Heikki Linnakangas wrote: On 04.10.2013 14:13, Alexander Korotkov wrote: On Fri, Oct 4, 2013 at 12:28 PM, Heikki Linnakangashlinnakan...@vmware.com wrote: In the attached patch, I in fact already did that for data leaf pages, but didn't change the format of non-leaf pages yet. If we want to support pg_upgrade, we might want to refrain from changing the non-leaf format. In GinDataLeafPageGetPostingList* you use sizeof(ItemPointerData) without MAXALIGN. Is it an error or you especially use 2 extra bytes on leaf page? I didn't even think of it. Now that I do think of it, I don't see a reason to use MAXALIGN there. PostingItems only require 2-byte alignment. It's a bit fragile and underdocumented though. It probably would be good to have a struct to represent that layout. Something like: struct { ItemPointerData rightBound; PostingItem postingItems[1]; /* variable length array */ } PostingItemPageContent; And use that struct in the macros. Then again, we do currently use MAXALIGN there, so if we want to avoid changing the on-disk format, we have to keep it... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
Hi, Tomas! On Sun, Oct 6, 2013 at 3:58 AM, Tomas Vondra t...@fuzzy.cz wrote: I've attempted to rerun the benchmarks tests I did a few weeks ago, but I got repeated crashes when loading the data (into a table with tsvector+gin index). Right before a crash, theres this message in the log: PANIC: not enough space in leaf page! Thanks for testing. Heikki's version of patch don't works for me too on even much more simplier examples. I can try to get it working if he answer my question about GinDataLeafPageGetPostingList* macros. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On 04.10.2013 14:13, Alexander Korotkov wrote: On Fri, Oct 4, 2013 at 12:28 PM, Heikki Linnakangashlinnakan...@vmware.com wrote: In the attached patch, I in fact already did that for data leaf pages, but didn't change the format of non-leaf pages yet. If we want to support pg_upgrade, we might want to refrain from changing the non-leaf format. In GinDataLeafPageGetPostingList* you use sizeof(ItemPointerData) without MAXALIGN. Is it an error or you especially use 2 extra bytes on leaf page? I didn't even think of it. Now that I do think of it, I don't see a reason to use MAXALIGN there. PostingItems only require 2-byte alignment. It's a bit fragile and underdocumented though. It probably would be good to have a struct to represent that layout. Something like: struct { ItemPointerData rightBound; PostingItem postingItems[1]; /* variable length array */ } PostingItemPageContent; And use that struct in the macros. Then again, we do currently use MAXALIGN there, so if we want to avoid changing the on-disk format, we have to keep it... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 08.10.2013 17:47, Alexander Korotkov wrote: Hi, Tomas! On Sun, Oct 6, 2013 at 3:58 AM, Tomas Vondrat...@fuzzy.cz wrote: I've attempted to rerun the benchmarks tests I did a few weeks ago, but I got repeated crashes when loading the data (into a table with tsvector+gin index). Right before a crash, theres this message in the log: PANIC: not enough space in leaf page! Thanks for testing. Heikki's version of patch don't works for me too on even much more simplier examples. I can try to get it working if he answer my question about GinDataLeafPageGetPostingList* macros. The new macros in that patch version were quite botched. Here's a new attempt. - Heikki gin-packed-postinglists-9-heikki.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 8.10.2013 21:59, Heikki Linnakangas wrote: On 08.10.2013 17:47, Alexander Korotkov wrote: Hi, Tomas! On Sun, Oct 6, 2013 at 3:58 AM, Tomas Vondrat...@fuzzy.cz wrote: I've attempted to rerun the benchmarks tests I did a few weeks ago, but I got repeated crashes when loading the data (into a table with tsvector+gin index). Right before a crash, theres this message in the log: PANIC: not enough space in leaf page! Thanks for testing. Heikki's version of patch don't works for me too on even much more simplier examples. I can try to get it working if he answer my question about GinDataLeafPageGetPostingList* macros. The new macros in that patch version were quite botched. Here's a new attempt. Nope, still the same errors :-( PANIC: not enough space in leaf page! LOG: server process (PID 29722) was terminated by signal 6: Aborted DETAIL: Failed process was running: autovacuum: ANALYZE public.messages regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
Aside from the pg_upgrade discussion, here's an updated version of the patch, rebased over master. It also contains some other misc refactoring I've done while reading through the patch. I haven't tested this much, I may well have also broken something, but I wanted to post an update before the weekend. Thinking about the page format, I think we should start using the pd_lower/upper pointers in the data page format. For a non-leaf page, pd_upper would always point to the beginning of the special area, and pd_lower would indicate the end of PostingItems. For a leaf page, pd_lower would indicate the end of the compressed posting list, and pd_upper would point to the leaf-index at the end of the page. That matches the standard page layout in the sense that the space between pd_lower and pd_upper is free, although the data stored in the non-free areas would be quite different. That would allow us to mark full-page images with buffer_std, allowing the gap to be left out. I think that would be a more natural way to keep track of the used/unused space on the page, anyway, compared to the current maxoff/endoffset field in the special area. In the attached patch, I in fact already did that for data leaf pages, but didn't change the format of non-leaf pages yet. If we want to support pg_upgrade, we might want to refrain from changing the non-leaf format. - Heikki gin-packed-postinglists-8-heikki.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Fri, Oct 4, 2013 at 12:28 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Aside from the pg_upgrade discussion, here's an updated version of the patch, rebased over master. It also contains some other misc refactoring I've done while reading through the patch. I haven't tested this much, I may well have also broken something, but I wanted to post an update before the weekend. Thinking about the page format, I think we should start using the pd_lower/upper pointers in the data page format. For a non-leaf page, pd_upper would always point to the beginning of the special area, and pd_lower would indicate the end of PostingItems. For a leaf page, pd_lower would indicate the end of the compressed posting list, and pd_upper would point to the leaf-index at the end of the page. That matches the standard page layout in the sense that the space between pd_lower and pd_upper is free, although the data stored in the non-free areas would be quite different. That would allow us to mark full-page images with buffer_std, allowing the gap to be left out. I think that would be a more natural way to keep track of the used/unused space on the page, anyway, compared to the current maxoff/endoffset field in the special area. In the attached patch, I in fact already did that for data leaf pages, but didn't change the format of non-leaf pages yet. If we want to support pg_upgrade, we might want to refrain from changing the non-leaf format. In GinDataLeafPageGetPostingList* you use sizeof(ItemPointerData) without MAXALIGN. Is it an error or you especially use 2 extra bytes on leaf page? -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On 23.09.2013 18:35, Bruce Momjian wrote: On Sun, Sep 15, 2013 at 01:14:45PM +0400, Alexander Korotkov wrote: On Sat, Jun 29, 2013 at 12:56 PM, Heikki Linnakangashlinnakan...@vmware.com wrote: There's a few open questions: 1. How are we going to handle pg_upgrade? It would be nice to be able to read the old page format, or convert on-the-fly. OTOH, if it gets too complicated, might not be worth it. The indexes are much smaller with the patch, so anyone using GIN probably wants to rebuild them anyway, sooner or later. Still, I'd like to give it a shot. We have broken pg_upgrade index compatibility in the past. Specifically, hash and GIN index binary format changed from PG 8.3 to 8.4. I handled it by invalidating the indexes and providing a post-upgrade script to REINDEX all the changed indexes. The user message is: Your installation contains hash and/or GIN indexes. These indexes have different internal formats between your old and new clusters, so they must be reindexed with the REINDEX command. The file: ... when executed by psql by the database superuser will recreate all invalid indexes; until then, none of these indexes will be used. It would be very easy to do this from a pg_upgrade perspective. However, I know there has been complaints from others about making pg_upgrade more restrictive. In this specific case, even if you write code to read the old file format, we might want to create the REINDEX script to allow _optional_ reindexing to shrink the index files. If we do require the REINDEX, --check will clearly warn the user that this will be required. It seems we've all but decided that we'll require reindexing GIN indexes in 9.4. Let's take the opportunity to change some other annoyances with the current GIN on-disk format: 1. There's no explicit page id field in the opaque struct, like there is in other index types. This is for the benefit of debugging tools like pg_filedump. We've managed to tell GIN pages apart from other index types by the fact that the special size of GIN pages is 8 and it's not using all the high-order bits in the last byte on the page. But an explicit page id field would be nice, so let's add that. 2. I'd like to change the way incomplete splits are handled. Currently, WAL recovery keeps track of incomplete splits, and fixes any that remain at the end of recovery. That concept is slightly broken; it's not guaranteed that after you've split a leaf page, for example, you will succeed in inserting the downlink to its parent. You might e.g run out of disk space. To fix that, I'd like to add a flag to the page header to indicate if the split has been completed, ie. if the page's downlink has been inserted to the parent, and fix them lazily on the next insert. I did a similar change to GiST back in 9.1. (Strictly speaking this doesn't require changing the on-disk format, though.) 3. I noticed that the GIN b-trees, the main key entry tree and the posting trees, use a slightly different arrangement of the downlink than our regular nbtree code does. In nbtree, the downlink for a page is the *low* key of that page, ie. if the downlink is 10, all the items on that child page must be = 10. But in GIN, we store the *high* key in the downlink, ie. all the items on the child page must be = 10. That makes inserting new downlinks at a page split slightly more complicated. For example, when splitting a page containing keys between 1-10 into 1-5 and 5-10, you need to insert a new downlink with key 10 for the new right page, and also update the existing downlink to 5. The nbtree code doesn't require updating existing entries. Anything else? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Thu, Oct 3, 2013 at 2:43 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: It seems we've all but decided that we'll require reindexing GIN indexes in 9.4. I thought the consensus in Ottawa was strongly against that. I'm not aware that anyone has subsequently changed their position on the topic. Bruce is right to point out that we've done such things before and can therefore do it again, but just because we have the technical means to do it doesn't make it good policy. That having been said, if we do decide to break it... Let's take the opportunity to change some other annoyances with the current GIN on-disk format: ...then fixing as much as possible in one go-round is clearly a good plan. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Thu, Oct 3, 2013 at 10:48 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Oct 3, 2013 at 2:43 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: It seems we've all but decided that we'll require reindexing GIN indexes in 9.4. I thought the consensus in Ottawa was strongly against that. I'm not aware that anyone has subsequently changed their position on the topic. Bruce is right to point out that we've done such things before and can therefore do it again, but just because we have the technical means to do it doesn't make it good policy. That having been said, if we do decide to break it... Let's take the opportunity to change some other annoyances with the current GIN on-disk format: ...then fixing as much as possible in one go-round is clearly a good plan. Let's see what options we have at all. I see following: 1) Drop support old GIN on-disk format. But users will have to reindex after pg_upgrade. 2) Insert kluges into GIN to support both old and new formats. So, kluges are kluges :) I don't see elegant way to do it for now, because formats are very different. 3) Upgrade GIN on-disk format in pg_upgrade. However, it would be rewriting almost whole index. Is it much better than just reindex? 4) Fork GIN2, leave GIN as is. It would lead to much of duplicated code. Any other options? -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On 03.10.2013 23:37, Alexander Korotkov wrote: 2) Insert kluges into GIN to support both old and new formats. So, kluges are kluges :) I don't see elegant way to do it for now, because formats are very different. Hmm. All you need is some code to read the old format, and a function to convert a page to new format before updating. It doesn't seem *that* kludgey. It's a fair amount of work, for sure, but not insurmountable. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Fri, Oct 4, 2013 at 12:41 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 03.10.2013 23:37, Alexander Korotkov wrote: 2) Insert kluges into GIN to support both old and new formats. So, kluges are kluges :) I don't see elegant way to do it for now, because formats are very different. Hmm. All you need is some code to read the old format, and a function to convert a page to new format before updating. It doesn't seem *that* kludgey. It's a fair amount of work, for sure, but not insurmountable. My notice was not as much about amount of work as about result. ItemPointers compression reduce occupied space in all normal cases. It's not very realistic, but it could increase space in worst case. That would lead to page split after conversion. Are we going to support such case? -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On Thu, Oct 3, 2013 at 02:48:20PM -0400, Robert Haas wrote: On Thu, Oct 3, 2013 at 2:43 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: It seems we've all but decided that we'll require reindexing GIN indexes in 9.4. I thought the consensus in Ottawa was strongly against that. I'm not aware that anyone has subsequently changed their position on the topic. Bruce is right to point out that we've done such things before and can therefore do it again, but just because we have the technical means to do it doesn't make it good policy. That having been said, if we do decide to break it... Agreed. I was stating only that this is easy for pg_upgrade. One cool thing is that the upgrades completes, and the indexes are there, but just marked as invalid until the REINDEX. One other point Alexander made is that the new GIN indexes will be smaller so most people would want the new format in the new cluster anyway. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
Bruce Momjian escribió: Agreed. I was stating only that this is easy for pg_upgrade. One cool thing is that the upgrades completes, and the indexes are there, but just marked as invalid until the REINDEX. One other point Alexander made is that the new GIN indexes will be smaller so most people would want the new format in the new cluster anyway. But they're nonfunctional until after the reindex, which is bad for people who want a quick upgrade and return to operational mode immediately. If you could just keep the old indexes around, in working state, until they are REINDEX CONCURRENTLY'ed, that would be more practical than just marking them invalid. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 03.10.2013 23:54, Alexander Korotkov wrote: ItemPointers compression reduce occupied space in all normal cases. It's not very realistic, but it could increase space in worst case. That would lead to page split after conversion. Are we going to support such case? Hmm, that's probably rare enough that the number of such indexes in the real world where that could happen is exactly 0. A compressed item requires 7 bytes in the worst case; that is an offset 127, and distance to previous item 2^(4*7) = 268435456 blocks. With the default block size, that requires an index larger than 2TB. And that's just for one such item to appear - to actually cause a page to overflow, a page would need to be full of other items widely apart each other to take up 6 bytes each. So I think if you can make the conversion work with the assumption that the compressed format always fits in the old space, and throw an error if it doesn't, that's good enough. (That's for the posting trees - the posting lists attached to entry tuples is a different story.) Besides, if you convert the page when you insert to it, you might need to split it anyway. So it might not be very difficult to split if required. IMHO the main argument for not bothering with pg_upgrade is that the gain from the patch is so great that you'll want to REINDEX after the upgrade anyway, to shrink the index. I really don't have an opinion on whether we should attempt reading the old format. On one hand, it would be really nice to not have that caveat when you pg_upgrade (oh, you have GIN indexes, you have to reindex..). On the other hand, supporting the old format is a fair amount of extra code to maintain. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Fri, Oct 4, 2013 at 12:37 AM, Alexander Korotkov aekorot...@gmail.comwrote: On Thu, Oct 3, 2013 at 10:48 PM, Robert Haas robertmh...@gmail.comwrote: On Thu, Oct 3, 2013 at 2:43 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: It seems we've all but decided that we'll require reindexing GIN indexes in 9.4. I thought the consensus in Ottawa was strongly against that. I'm not aware that anyone has subsequently changed their position on the topic. Bruce is right to point out that we've done such things before and can therefore do it again, but just because we have the technical means to do it doesn't make it good policy. That having been said, if we do decide to break it... Let's take the opportunity to change some other annoyances with the current GIN on-disk format: ...then fixing as much as possible in one go-round is clearly a good plan. Let's see what options we have at all. I see following: 1) Drop support old GIN on-disk format. But users will have to reindex after pg_upgrade. 2) Insert kluges into GIN to support both old and new formats. So, kluges are kluges :) I don't see elegant way to do it for now, because formats are very different. 3) Upgrade GIN on-disk format in pg_upgrade. However, it would be rewriting almost whole index. Is it much better than just reindex? 4) Fork GIN2, leave GIN as is. It would lead to much of duplicated code. Any other options? I came to idea that I like option #4 more than option #2. If we try to make new GIN work with old page formats we have to maintain 3 use cases: 1) old GIN with old page format (because of old releases) 2) new GIN with old page format 3) new GIN with new page format If we create GIN2 we maintain only 2 use cases: 1) old GIN with old page format 2) new GIN with new page format The code of old GIN would be additional code in 9.4, but not additional code we maintain. Because we anyway maintain exactly same in old releases. The problem I see is how to migrate users to GIN2. We can't expect they read release notes, create GIN2 indexes and drop GIN1 indexes. A lot of users will still use GIN1, because of they don't care :) Ideally any new GIN index should be GIN2 and reindex turns GIN1 into GIN2. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On Fri, Oct 4, 2013 at 02:23:33AM +0400, Alexander Korotkov wrote: I came to idea that I like option #4 more than option #2. If we try to make new GIN work with old page formats we have to maintain 3 use cases: 1) old GIN with old page format (because of old releases) 2) new GIN with old page format 3) new GIN with new page format If we create GIN2 we maintain only 2 use cases: 1) old GIN with old page format 2) new GIN with new page format The code of old GIN would be additional code in 9.4, but not additional code we maintain. Because we anyway maintain exactly same in old releases. The problem I see is how to migrate users to GIN2. We can't expect they read release notes, create GIN2 indexes and drop GIN1 indexes. A lot of users will still use GIN1, because of they don't care :) Ideally any new GIN index should be GIN2 and reindex turns GIN1 into GIN2. I am not sure I like the complexity of a GIN2, but we should give this problem some serious thought as it will affect how we deal with other on-disk index changes in the future. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Wed, Sep 25, 2013 at 5:22 PM, Peter Eisentraut pete...@gmx.net wrote: On 9/23/13 5:36 PM, Alexander Korotkov wrote: In the attached version of patch double finding of ItemPointer during insert is avoided. Overhead becomes lower as expected. Fails cpluspluscheck: ./src/include/access/gin_private.h: In function ‘char* ginDataPageLeafReadItemPointer(char*, ItemPointer)’: ./src/include/access/gin_private.h:797:2: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] Thanks. Fixed in attached version of patch. -- With best regards, Alexander Korotkov. gin-packed-postinglists-7.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On 9/23/13 5:36 PM, Alexander Korotkov wrote: In the attached version of patch double finding of ItemPointer during insert is avoided. Overhead becomes lower as expected. Fails cpluspluscheck: ./src/include/access/gin_private.h: In function ‘char* ginDataPageLeafReadItemPointer(char*, ItemPointer)’: ./src/include/access/gin_private.h:797:2: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Sun, Sep 15, 2013 at 01:14:45PM +0400, Alexander Korotkov wrote: On Sat, Jun 29, 2013 at 12:56 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: There's a few open questions: 1. How are we going to handle pg_upgrade? It would be nice to be able to read the old page format, or convert on-the-fly. OTOH, if it gets too complicated, might not be worth it. The indexes are much smaller with the patch, so anyone using GIN probably wants to rebuild them anyway, sooner or later. Still, I'd like to give it a shot. We have broken pg_upgrade index compatibility in the past. Specifically, hash and GIN index binary format changed from PG 8.3 to 8.4. I handled it by invalidating the indexes and providing a post-upgrade script to REINDEX all the changed indexes. The user message is: Your installation contains hash and/or GIN indexes. These indexes have different internal formats between your old and new clusters, so they must be reindexed with the REINDEX command. The file: ... when executed by psql by the database superuser will recreate all invalid indexes; until then, none of these indexes will be used. It would be very easy to do this from a pg_upgrade perspective. However, I know there has been complaints from others about making pg_upgrade more restrictive. In this specific case, even if you write code to read the old file format, we might want to create the REINDEX script to allow _optional_ reindexing to shrink the index files. If we do require the REINDEX, --check will clearly warn the user that this will be required. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Mon, Sep 23, 2013 at 12:47 AM, Alexander Korotkov aekorot...@gmail.comwrote: It's probably an option to select 64 entries instead of 32. There is still some regression in update speed. However, there is also room for improvement patch. It searches item index entries 2 times on insert: in dataLocateLeafItem and dataPlaceToPage. We can save full results of dataLocateLeafItem, but it require some rework of gin btree interface: store not only offset of item. In the attached version of patch double finding of ItemPointer during insert is avoided. Overhead becomes lower as expected. event |master | 16-entries | 32-entries | 64-entries | 128-entries| ---+-+-+-+-+-+ index_build | 00:01:50.042658 | 00:01:54.130873 | 00:01:59.37302 | 00:01:55.959693 | 00:01:58.126407 | index_build_recovery | 00:00:19| 00:00:06| 00:00:06 | 00:00:06| 00:00:06| index_update | 00:05:18.215707 | 00:05:38.40231 | 00:05:30.658786 | 00:05:27.664312 | 00:05:30.815876 | index_update_recovery | 00:01:48| 00:01:53| 00:01:50 | 00:01:44| 00:01:46| search_new| 00:25:21.481699 | 00:23:20.324152 | 00:24:02.120438 | 00:22:50.989723 | 00:23:05.703824 | search_updated| 00:25:57.622592 | 00:26:43.531979 | 00:26:08.003085 | 00:24:36.669028 | 00:26:09.175243 | label |size| 16-entries | 32-entries | 64-entries | 128-entries | ---+++++-+ new | 884514816 | 417013760 | 421240832 | 430350336 | 450994176 | after_updates | 1595252736 | 711368704 | 719380480 | 735682560 | 774275072 | -- With best regards, Alexander Korotkov. gin-packed-postinglists-6.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers