Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Fri, Mar 6, 2020 at 11:00 AM Andres Freund wrote: > > Pushed. > > Congrats! Thanks Andres! -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On 2020-02-26 14:43:27 -0800, Peter Geoghegan wrote: > On Mon, Feb 24, 2020 at 4:54 PM Peter Geoghegan wrote: > > Attached is v34, which has this change. My plan is to commit something > > very close to this on Wednesday morning (barring any objections). > > Pushed. Congrats!
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Wed, Feb 26, 2020 at 10:03 PM Fujii Masao wrote: > Thanks for committing this nice feature! You're welcome! > Here is one minor comment. > > + deduplicate_items > + storage parameter > > This should be > > deduplicate_items storage parameter I pushed a fix for this. Thanks -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On 2020/02/27 7:43, Peter Geoghegan wrote: On Mon, Feb 24, 2020 at 4:54 PM Peter Geoghegan wrote: Attached is v34, which has this change. My plan is to commit something very close to this on Wednesday morning (barring any objections). Pushed. Thanks for committing this nice feature! Here is one minor comment. + deduplicate_items + storage parameter This should be deduplicate_items storage parameter for reloption is necessary only when the GUC parameter with the same name of the reloption exists. So, for example, you can see that is used in vacuum_cleanup_index_scale_factor but not in buffering reloption. Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Mon, Feb 24, 2020 at 4:54 PM Peter Geoghegan wrote: > Attached is v34, which has this change. My plan is to commit something > very close to this on Wednesday morning (barring any objections). Pushed. I'm going to delay committing the pageinspect patch until tomorrow, since I haven't thought about that aspect of the project in a while. Seems like a good idea to go through it one more time, once it's clear that the buildfarm is stable. The buildfarm appears to be stable now, though there was an issue with a compiler warning earlier. I quickly pushed a fix for that, and can see that longfin is green/passing now. Thanks for sticking with this project, Anastasia. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Thu, Feb 20, 2020 at 10:58 AM Peter Geoghegan wrote: > I think that there is a good chance that it just won't matter. The > number of indexes that won't be able to support deduplication will be > very small in practice. The important exceptions are INCLUDE indexes > and nondeterministic collations. These exceptions make sense > intuitively, and will be documented as limitations of those other > features. I wasn't clear about the implication of what I was saying here, which is: I will make the NOTICE a DEBUG1 message, and leave everything else as-is in the initial committed version. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Thu, Feb 20, 2020 at 7:38 AM Anastasia Lubennikova wrote: > I don't think this patch really needs more nitpicking ) But when has that ever stopped it? :-) > User can discover this with a complex query to pg_index and pg_opclass. > To simplify this, we can probably wrap this into function or some field > in pg_indexes. A function isn't a real user interface, though -- it probably won't be noticed. I think that there is a good chance that it just won't matter. The number of indexes that won't be able to support deduplication will be very small in practice. The important exceptions are INCLUDE indexes and nondeterministic collations. These exceptions make sense intuitively, and will be documented as limitations of those other features. The numeric/float thing doesn't really make intuitive sense, and numeric is an important datatype. Still, numeric columns and float columns seem to rarely get indexed. That just leaves container type opclasses, like anyarray and jsonb. Nobody cares about indexing container types with a B-Tree index, with the possible exception of expression indexes on a jsonb column. I don't see a way around that, but it doesn't seem all that important. Again, applications are unlikely to have more than one or two of those. The *overall* space saving will probably be almost as good as if the limitation did not exist. > Anyway, I would wait for feedback from pre-release testers. Right -- let's delay making a final decision on it. Just like the decision to enable it by default. It will work this way in the committed version, but that isn't supposed to be the final word on it. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On 19.02.2020 22:16, Peter Geoghegan wrote: On Wed, Feb 19, 2020 at 8:14 AM Anastasia Lubennikova wrote: Thank you for this work. I've looked through the patches and they seem to be ready for commit. I haven't yet read recent documentation and readme changes, so maybe I'll send some more feedback tomorrow. The only thing I found is a typo in the comment + int nhtids; /* Number of heap TIDs in nhtids array */ s/nhtids/htids I don't think this patch really needs more nitpicking ) In my opinion, this message is too specific for default behavior. It exposes internal details without explanation and may look to user like something went wrong. You're probably right about that. I just wish that there was some way of showing the same information that was discoverable, and didn't require the use of pageinspect. If I make it a DEBUG1 message, then it cannot really be documented. User can discover this with a complex query to pg_index and pg_opclass. To simplify this, we can probably wrap this into function or some field in pg_indexes. Anyway, I would wait for feedback from pre-release testers.
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Wed, Feb 19, 2020 at 11:16 AM Peter Geoghegan wrote: > On Wed, Feb 19, 2020 at 8:14 AM Anastasia Lubennikova > wrote: > > Thank you for this work. I've looked through the patches and they seem > > to be ready for commit. > > I haven't yet read recent documentation and readme changes, so maybe > > I'll send some more feedback tomorrow. > > Great. I should add: I plan to commit the patch within the next 7 days. I believe that the design of deduplication itself is solid; it has many more strengths than weaknesses. It works in a way that complements the existing approach to page splits. The optimization can easily be turned off (and easily turned back on again). contrib/amcheck can detect almost any possible form of corruption that could affect a B-Tree index that has posting list tuples. I have spent months microbenchmarking every little aspect of this patch in isolation. I've also spent a lot of time on conventional benchmarking. It seems quite possible that somebody won't like some aspect of the user interface. I am more than willing to work with other contributors on any issue in that area that comes to light. I don't see any point in waiting for other hackers to speak up before the patch is committed, though. Anastasia posted the first version of this patch in August of 2015, and there have been over 30 revisions of it since the project was revived in 2019. Everyone has been given ample opportunity to offer input. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Wed, Feb 19, 2020 at 8:14 AM Anastasia Lubennikova wrote: > Thank you for this work. I've looked through the patches and they seem > to be ready for commit. > I haven't yet read recent documentation and readme changes, so maybe > I'll send some more feedback tomorrow. Great. > Is there any specific reason, why we need separate btnameequalimage, > bpchar_equalimage and bttextequalimage functions? > As far as I see, they have the same implementation. Not really. This approach allows us to reverse the decision to enable deduplication in a point release, which is theoretically useful. OTOH, if that's so important, why not have many more support function 4 implementations (one per opclass)? I suspect that we would just disable deduplication in a hard-coded fashion if we needed to disable it due to some issue that transpired. For example, we could do this by modifying _bt_allequalimage() itself. > I would simply move it to debug level for all cases. Since from user's > perspective it doesn't differ that much from the case where > deduplication is applicable in general, but not very efficient due to > data distribution. I was more concerned about cases where the user would really like to use deduplication, but wants to make sure that it gets used. And doesn't want to install pageinspect to find out. > I also noticed that this is not consistent with ALTER index. For > example, alter index idx_n set (deduplicate_items =true); won't show any > message about deduplication. But that's a change in the user's preference. Not a change in whether or not it's safe in principle. > In my opinion, this message is too specific for default behavior. It > exposes internal details without explanation and may look to user like > something went wrong. You're probably right about that. I just wish that there was some way of showing the same information that was discoverable, and didn't require the use of pageinspect. If I make it a DEBUG1 message, then it cannot really be documented. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On 14.02.2020 05:57, Peter Geoghegan wrote: Attached is v33, which adds the last piece we need: opclass infrastructure that tells nbtree whether or not deduplication can be applied safely. This is based on work by Anastasia that was shared with me privately. Thank you for this work. I've looked through the patches and they seem to be ready for commit. I haven't yet read recent documentation and readme changes, so maybe I'll send some more feedback tomorrow. New opclass proc In general, supporting deduplication is the rule for B-Tree opclasses, rather than the exception. Most can use the generic btequalimagedatum() routine as their support function 4, which unconditionally indicates that deduplication is safe. There is a new test that tries to catch opclasses that omitted to do this. Here is the opr_sanity.out changes added by the first patch: -- Almost all Btree opclasses can use the generic btequalimagedatum function -- as their equalimage proc (support function 4). Look for opclasses that -- don't do so; newly added Btree opclasses will usually be able to support -- deduplication with little trouble. SELECT amproc::regproc AS proc, opf.opfname AS opfamily_name, opc.opcname AS opclass_name, opc.opcintype::regtype AS opcintype FROM pg_am am JOIN pg_opclass opc ON opc.opcmethod = am.oid JOIN pg_opfamily opf ON opc.opcfamily = opf.oid LEFT JOIN pg_amproc ON amprocfamily = opf.oid AND amproclefttype = opcintype AND amprocnum = 4 WHERE am.amname = 'btree' AND amproc IS DISTINCT FROM 'btequalimagedatum'::regproc ORDER BY amproc::regproc::text, opfamily_name, opclass_name; proc| opfamily_name | opclass_name |opcintype ---+--+--+-- bpchar_equalimage | bpchar_ops | bpchar_ops | character btnameequalimage | text_ops | name_ops | name bttextequalimage | text_ops | text_ops | text bttextequalimage | text_ops | varchar_ops | text | array_ops| array_ops| anyarray | enum_ops | enum_ops | anyenum | float_ops| float4_ops | real | float_ops| float8_ops | double precision | jsonb_ops| jsonb_ops| jsonb | money_ops| money_ops| money | numeric_ops | numeric_ops | numeric | range_ops| range_ops| anyrange | record_image_ops | record_image_ops | record | record_ops | record_ops | record | tsquery_ops | tsquery_ops | tsquery | tsvector_ops | tsvector_ops | tsvector (16 rows) Is there any specific reason, why we need separate btnameequalimage, bpchar_equalimage and bttextequalimage functions? As far as I see, they have the same implementation. Since using deduplication is supposed to pretty much be the norm from now on, it seemed like it might make sense to add a NOTICE about it during CREATE INDEX -- a notice letting the user know that it isn't being used due to a lack of opclass support: regression=# create table foo(bar numeric); CREATE TABLE regression=# create index on foo(bar); NOTICE: index "foo_bar_idx" cannot use deduplication CREATE INDEX Note that this NOTICE isn't seen with an INCLUDE index, since that's expected to not support deduplication. I have a feeling that not everybody will like this, which is why I'm pointing it out. Thoughts? I would simply move it to debug level for all cases. Since from user's perspective it doesn't differ that much from the case where deduplication is applicable in general, but not very efficient due to data distribution. I also noticed that this is not consistent with ALTER index. For example, alter index idx_n set (deduplicate_items =true); won't show any message about deduplication. I've tried several combinations with an index on a numeric column: 1) postgres=# create index idx_nd on tbl (n) with (deduplicate_items = true); NOTICE: index "idx_nd" cannot use deduplication CREATE INDEX Here the message seems appropriate. I don't think, we should restrict creation of the index even when deduplicate_items parameter is set explicitly, rather we may warn the user that it won't be efficient. 2) postgres=# create index idx_n on tbl (n) with (deduplicate_items = false); NOTICE: index "idx_n" cannot use deduplication CREATE INDEX In this case the message seems slightly strange to me. Why should we show a notice about the fact that deduplication is not possible if that is exactly what was requested? 3) postgres=# create index idx on tbl (n); NOTICE: index "idx" cannot use deduplication In my opinion, this message is too specific for default behavior. It exposes internal
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Fri, Jan 10, 2020 at 1:36 PM Peter Geoghegan wrote: > * HEIKKI: Do we only generate one posting list in one WAL record? I > would assume it's better to deduplicate everything on the page, since > we're modifying it anyway. Still thinking about this one. > * HEIKKI: Does xl_btree_vacuum WAL record store a whole copy of the > remaining posting list on an updated tuple? Wouldn't it be simpler and > more space-efficient to store just the deleted TIDs? This probably makes sense. The btreevacuumposting() code that generates "updated" index tuples (tuples that VACUUM uses to replace existing ones when some but not all of the TIDs need to be removed) was derived from GIN's ginVacuumItemPointers(). That approach works well enough, but we can do better now. It shouldn't be that hard. My preferred approach is a little different to your suggested approach of storing the deleted TIDs directly. I would like to make it work by storing an array of uint16 offsets into a posting list, one array per "updated" tuple (with one offset per deleted TID within each array). These arrays (which must include an array size indicator at the start) can appear in the xl_btree_vacuum record, at the same place the patch currently puts a raw IndexTuple. They'd be equivalent to a raw IndexTuple -- the REDO routine would reconstruct the same raw posting list tuple on its own. This approach seems simpler, and is clearly very space efficient. This approach is similar to the approach used by REDO routines to handle posting list splits. Posting list splits must call _bt_swap_posting() on the primary, while the corresponding REDO routines also call _bt_swap_posting(). For space efficient "updates", we'd have to invent a sibling utility function -- we could call it _bt_delete_posting(), and put it next to _bt_swap_posting() within nbtdedup.c. How do you feel about that approach? (And how do you feel about the existing "REDO routines call _bt_swap_posting()" business that it's based on?) > * HEIKKI: Would it be more clear to have a separate struct for the > posting list split case? (i.e. don't reuse xl_btree_insert) I've concluded that this one probably isn't worthwhile. We'd have to carry a totally separate record on the stack within _bt_insertonpg(). If you feel strongly about it, I will reconsider. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Wed, Jan 8, 2020 at 5:56 AM Heikki Linnakangas wrote: > On 04/01/2020 03:47, Peter Geoghegan wrote: > > Attached is v28, which fixes bitrot from my recent commits to refactor > > VACUUM-related code in nbtpage.c. > > I started to read through this gigantic patch. Oh come on, it's not that big. :-) > I got about 1/3 way > through. I wrote minor comments directly in the attached patch file, > search for "HEIKKI:". I wrote them as I read the patch from beginning to > end, so it's possible that some of my questions are answered later in > the patch. I didn't have the stamina to read through the whole patch > yet, I'll continue later. Thanks for the review! Anything that you've written that I do not respond to directly can be assumed to have been accepted by me. I'll start with responses to the points that you raise in your patch that need a response Patch comments == * Furthermore, deduplication can be turned on or off as needed, or appliedHEIKKI: When would it be needed? I believe that hardly anybody will want to turn off deduplication in practice. My point here is that we're flexible -- we're not maintaining posting lists like GIN. We're just deduplicating as and when needed. We can change our preference about that any time. Turning off deduplication won't magically undo past deduplications, of course, but everything mostly works in the same way when deduplication is on or off. We're just talking about an alternative physical representation of the same logical contents. * HEIKKI: How do LP_DEAD work on posting list tuples? Same as before, except that it applies to all TIDs in the tuple together (will mention this in commit message, though). Note that the fact that we delay deduplication also means that we delay merging the LP_DEAD bits. And we always prefer to remove LP_DEAD items. Finally, we refuse to do a posting list split when its LP_DEAD bit is set, so it's now possible to delete LP_DEAD bit set tuples a little early, before a page split has to be avoided -- see the new code and comments at the end of _bt_findinsertloc(). See also: my later response to your e-mail remarks on LP_DEAD bits, unique indexes, and space accounting. * HEIKKI: When is it [deduplication] not safe? With opclasses like btree/numeric_ops, where display scale messes things up. See this thread for more information on the infrastructure that we need for that: https://www.postgresql.org/message-id/flat/cah2-wzn3ee49gmxb7v1vj3-ac8fwn-fr8pfwqebhe8ryrxt...@mail.gmail.com * HEIKKI: Why is it safe to read on version 3 indexes? Because unused space is set to zeros? Yes. Same applies to version 4 indexes that come from Postgres 12 -- users must REINDEX to call _bt_opclasses_support_dedup() and set metapage field, but we can rely on the new field being all zeroes before that happens. (It would be possible to teach pg_upgrade to set the field for compatible indexes from Postgres 12, but I don't want to bother with that. We probably cannot safely call _bt_opclasses_support_dedup() with a buffer lock held, so that seems like the only way.) * HEIKKI: Do we need it as a separate flag, isn't it always safe with version 4 indexes, and never with version 3? No, it isn't *always* safe with version 4 indexes, for reasons that have nothing to do with the on-disk representation (like the display scale issue, nondeterministic collations, etc). It really is a distinct condition. (Deduplication is never safe with version 3 indexes, obviously.) It occurs to me now that we probably don't even want to make the metapage field about deduplication (though that's what it says right now). Rather, it should be about supporting a general category of optimizations that include deduplication, and might also include prefix compression in the future. Note that whether or not we should actually apply these optimizations is always a separate question. * + * Non-pivot tuples complement pivot tuples, which only have key columns. HEIKKI: What does it mean that they complement pivot tuples? It means that all tuples are either pivot tuples, or are non-pivot tuples. * + * safely (index storage parameter separately indicates if deduplication is HEIKKI: Is there really an "index storage parameter" for that? What is that, something in the WITH clause? Yes, there is actually an index storage parameter named "deduplication" (something in the WITH clause). This is deliberately not named "btree_deduplication", the current name of the GUC. This exists to make the optimization controllable at the index level. (Though I should probably mention the GUC first in this code comment, or not even mention the less significant storage parameter.) * HEIKKI: How much memory does this [BTScanPosData.items array of width MaxBTreeIndexTuplesPerPage] need now? Should we consider pallocing this separately? But BTScanPosData isn't allocated on the stack anyway. * HEIKKI: Would it be more clear to have a separate struct for the posting list split case?
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Thu, Dec 12, 2019 at 6:21 PM Peter Geoghegan wrote: > Still waiting for some review of the first patch, to get it out of the > way. Anastasia? I plan to commit this first patch [1] in the next day or two, barring any objections. It's clear that the nbtree "pin scan" VACUUM code is totally unnecessary -- it really should have been fully removed by commit 3e4b7d87 back in 2016. [1] https://www.postgresql.org/message-id/flat/CAH2-WzkWLRDzCaxsGvA_pZoaix_2AC9S6%3D-D6JMLkQYhqrJuEg%40mail.gmail.com#daed349a71ff9d7ac726cc0e3e01a436 -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Tue, Dec 17, 2019 at 5:18 PM Bruce Momjian wrote: > On Tue, Dec 17, 2019 at 03:30:33PM -0800, Peter Geoghegan wrote: > > With many real world unique indexes, the true reason behind most or > > all B-Tree page splits is "version churn". I view these page splits as > > a permanent solution to a temporary problem -- we *permanently* > > degrade the index structure in order to deal with a *temporary* burst > > in versions that need to be stored. That's really bad. > > Yes, I was thinking why do we need to optimize duplicates in a unique > index but then remembered is a version problem. The whole idea of deduplication in unique indexes is hard to explain. It just sounds odd. Also, it works using the same infrastructure as regular deduplication, while having rather different goals. Fortunately, it seems like we don't really have to tell users about it in order for them to see a benefit -- there will be no choice for them to make there (they just get it). The regular deduplication stuff isn't confusing at all, though. It has some noticeable though small downside, so it will be documented and configurable. (I'm optimistic that it can be enabled by default, because even with high cardinality non-unique indexes the downside is rather small -- we waste some CPU cycles just before a page is split.) -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Tue, Dec 17, 2019 at 03:30:33PM -0800, Peter Geoghegan wrote: > With many real world unique indexes, the true reason behind most or > all B-Tree page splits is "version churn". I view these page splits as > a permanent solution to a temporary problem -- we *permanently* > degrade the index structure in order to deal with a *temporary* burst > in versions that need to be stored. That's really bad. Yes, I was thinking why do we need to optimize duplicates in a unique index but then remembered is a version problem. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Tue, Dec 17, 2019 at 1:58 PM Bruce Momjian wrote: > > Attached is v26, which adds this new criteria/heuristic for unique > > indexes. We now seem to consistently get good results with unique > > indexes. > > In the past we tried to increase the number of cases where HOT updates > can happen but were unable to. Right -- the WARM project. The Z-heap project won't change the fundamentals here. It isn't going to solve the fundamental problem of requiring that the index AM create a new set of physical index tuples in at least *some* cases. A heap tuple cannot be updated in-place when even one indexed column changes -- you're not much better off than you were with the classic heapam, because indexes get bloated in a way that wouldn't happen with Oracle. (Even still, Z-heap isn't sensitive to when and how opportunistic heap pruning takes place, and doesn't have the same issue with having to fit the heap tuple on the same page or create a new HOT chain. This will make things much better with some workloads.) > Would this help with non-HOT updates? Definitely, yes. The strategy used with unique indexes is specifically designed to avoid "unnecessary" page splits altogether -- it only makes sense because of the possibility of non-HOT UPDATEs with mostly-unchanged index tuples. Thinking about what's going on here from first principles is what drove the unique index deduplication design: With many real world unique indexes, the true reason behind most or all B-Tree page splits is "version churn". I view these page splits as a permanent solution to a temporary problem -- we *permanently* degrade the index structure in order to deal with a *temporary* burst in versions that need to be stored. That's really bad. Consider a classic pgbench workload, for example. The smaller indexes on the smaller tables (pgbench_tellers_pkey and pgbench_branches_pkey) have leaf pages that will almost certainly be split a few minutes in, even though the UPDATEs on the underlying tables never modify indexed columns (i.e. even though HOT is as effective as it possibly could be with this unthrottled workload). Actually, even the resulting split pages will themselves usually be split again, and maybe even once more after that. We started out with leaf pages that stored just under 370 items on each leaf page (with fillfactor 90 + 8KiB BLCKSZ), and end up with leaf pages that often have less than 50 items (sometimes as few as 10). Even though the "logical contents" of the index are *totally* unchanged. This could almost be considered pathological by users. Of course, it's easy to imagine a case where it matters a lot more than classic pgbench (pgbench_tellers_pkey and pgbench_branches_pkey are always small, so it's easy to see the effect, which is why I went with that example). For example, you could have a long running transaction, which would probably have the effect of significantly bloating even the large pgbench index (pgbench_accounts_pkey) -- typically you won't see that with classic pgbench until you do something to frustrate VACUUM (and opportunistic cleanup). (I have mostly been using non-HOT UPDATEs to test the patch, though.) In theory we could go even further than this by having some kind of version store for indexes, and using this to stash old versions rather than performing a page split. Then you wouldn't have any page splits in the pgbench indexes; VACUUM would eventually be able to return the index to its "pristine" state. The trade-off with that design would be that index scans would have to access two index pages for a while (a leaf page, plus its subsidiary old version page). Maybe we can actually go that far in the future -- there are various database research papers that describe designs like this (the designs described within these papers do things like determine whether a "version split" or a "value split" should be performed). What we have now is an incremental improvement, that doesn't have any apparent downside with unique indexes -- the way that deduplication is triggered for unique indexes is almost certain to be a win. When deduplication isn't triggered, everything works in the same way as before -- it's "zero overhead" for unique indexes that don't benefit. The design augments existing garbage collection mechanisms, particularly the way in which we set LP_DEAD bits within _bt_check_unique(). > Do we have any benchmarks where non-HOT updates cause slowdowns that we > can test on this? AFAICT, any workload that has lots of non-HOT updates will benefit at least a little bit -- indexes will finish up smaller, there will be higher throughput, and there will be a reduction in latency for queries. With the right distribution of values, it's not that hard to mostly control bloat in an index that doubles in size without the optimization, which is much more significant. I have already reported on this [1]. I've also been able to observe increases of 15%-20% in TPS with similar workloads (with commensurate
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Thu, Dec 12, 2019 at 06:21:20PM -0800, Peter Geoghegan wrote: > On Tue, Dec 3, 2019 at 12:13 PM Peter Geoghegan wrote: > > The new criteria/heuristic for unique indexes is very simple: If a > > unique index has an existing item that is a duplicate on the incoming > > item at the point that we might have to split the page, then apply > > deduplication. Otherwise (when the incoming item has no duplicates), > > don't apply deduplication at all -- just accept that we'll have to > > split the page. We already cache the bounds of our initial binary > > search in insert state, so we can reuse that information within > > _bt_findinsertloc() when considering deduplication in unique indexes. > > Attached is v26, which adds this new criteria/heuristic for unique > indexes. We now seem to consistently get good results with unique > indexes. In the past we tried to increase the number of cases where HOT updates can happen but were unable to. Would this help with non-HOT updates? Do we have any benchmarks where non-HOT updates cause slowdowns that we can test on this? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Tue, Nov 12, 2019 at 3:21 PM Peter Geoghegan wrote: > * Decided to go back to turning deduplication on by default with > non-unique indexes, and off by default using unique indexes. > > The unique index stuff was regressed enough with INSERT-heavy > workloads that I was put off, despite my initial enthusiasm for > enabling deduplication everywhere. I have changed my mind about this again. I now think that it would make sense to treat deduplication within unique indexes as a separate feature that cannot be disabled by the GUC at all (though we'd probably still respect the storage parameter for debugging purposes). I have found that fixing the WAL record size issue has helped remove what looked like a performance penalty for deduplication (but was actually just a general regression). Also, I have found a way of selectively applying deduplication within unique indexes that seems to have no downside, and considerable upside. The new criteria/heuristic for unique indexes is very simple: If a unique index has an existing item that is a duplicate on the incoming item at the point that we might have to split the page, then apply deduplication. Otherwise (when the incoming item has no duplicates), don't apply deduplication at all -- just accept that we'll have to split the page. We already cache the bounds of our initial binary search in insert state, so we can reuse that information within _bt_findinsertloc() when considering deduplication in unique indexes. This heuristic makes sense because deduplication within unique indexes should only target leaf pages that cannot possibly receive new values. In many cases, the only reason why almost all primary key leaf pages can ever split is because of non-HOT updates whose new HOT chain needs a new, equal entry in the primary key. This is the case with your standard identity column/serial primary key, for example (only the rightmost page will have a page split due to the insertion of new logical rows -- everything other variety of page split must be due to new physical tuples/versions). I imagine that it is possible for a leaf page to be a "mixture" of these two basic/general tendencies, but not for long. It really doesn't matter if we occasionally fail to delay a page split where that was possible, nor does it matter if we occasionally apply deduplication when that won't delay a split for very long -- pretty soon the page will split anyway. A split ought to separate the parts of the keyspace that exhibit each tendency. In general, we're only interested in delaying page splits in unique indexes *indefinitely*, since in effect that will prevent them *entirely*. (So the goal is *significantly* different to our general goal for deduplication -- it's about buying time for VACUUM to run or whatever, rather than buying space.) This heuristic helps the TPC-C "old order" tables PK from bloating quite noticeably, since that was the only unique index that is really affected by non-HOT UPDATEs (i.e. the UPDATE queries that touch that table happen to not be HOT-safe in general, which is not the case for any other table). It doesn't regress anything else from TPC-C, since there really isn't a benefit for other tables. More importantly, the working/draft version of the patch will often avoid a huge amount of bloat in a pgbench-style workload that has an extra index on the pgbench_accounts table, to prevent HOT updates. The accounts primary key (pgbench_accounts_pkey) hardly grows at all with the patch, but grows 2x on master. This 2x space saving seems to occur reliably, unless there is a lot of contention on individual *pages*, in which case the bloat can be delayed but not prevented. We get that 2x space saving with either uniformly distributed random updates on pgbench_accounts (i.e. the pgbench default), or with a skewed distribution that hashes the PRNG's value. Hashing like this simulates a workload where there the skew isn't concentrated in one part of the key space (i.e. there is skew, but very popular values are scattered throughout the index evenly, rather than being concentrated together in just a few leaf pages). Can anyone think of an adversarial case, that we may not do so well on with the new "only deduplicate within unique indexes when new item already has a duplicate" strategy? I'm having difficulty identifying some kind of worst case. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Mon, Nov 18, 2019 at 05:26:37PM -0800, Peter Geoghegan wrote: > Attached is v24. This revision doesn't fix the problem with > xl_btree_insert record bloat, but it does fix the bitrot against the > master branch that was caused by commit 50d22de9. (This patch has had > a surprisingly large number of conflicts against the master branch > recently.) Please note that I have moved this patch to next CF per this last update. Anastasia, the ball is waiting on your side of the field, as the CF entry is marked as waiting on author for some time now. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Moin, On 2019-11-16 01:04, Peter Geoghegan wrote: On Fri, Nov 15, 2019 at 5:16 AM Robert Haas wrote: Hmm. Well, maybe I'm just behind the times. But that same wikipedia article also says that deduplication works on large chunks "such as entire files or large sections of files" thus differentiating it from compression algorithms which work on the byte level, so it seems to me that what you are doing still sounds more like ad-hoc compression. I see your point. One reason for my avoiding the word "compression" is that other DB systems that have something similar don't use the word compression either. Actually, they don't really call it *anything*. Posting lists are simply the way that secondary indexes work. The "Modern B-Tree techniques" book/survey paper mentions the idea of using a TID list in its "3.7 Duplicate Key Values" section, not in the two related sections that follow ("Bitmap Indexes", and "Data Compression"). That doesn't seem like a very good argument, now that I've typed it out. The patch applies deduplication/compression/whatever at the point where we'd otherwise have to split the page, unlike GIN. GIN eagerly maintains posting lists (doing in-place updates for most insertions seems pretty bad to me). My argument could reasonably be made about GIN, which really does consider posting lists the natural way to store duplicate tuples. I cannot really make that argument about nbtree with this patch, though -- delaying a page split by re-encoding tuples (changing their physical representation without changing their logical contents) justifies using the word "compression" in the name. > Can you suggest an alternative? My instinct is to pick a name that somehow involves compression and just put enough other words in there to make it clear e.g. duplicate value compression, or something of that sort. Does anyone else want to weigh in on this? Anastasia? I will go along with whatever the consensus is. I'm very close to the problem we're trying to solve, which probably isn't helping me here. I'm in favor of deduplication and not compression. Compression is a more generic term and can involve deduplication, but it hasn't to do so. (It could for instance just encode things in a more compact form). While deduplication does not involve compression, it just means store multiple things once, which by coincidence also amounts to using less space like compression can do. ZFS also follows this by having both deduplication (store the same blocks only once with references) and compression (compress block contents, regardless wether they are stored once or many times). So my vote is for deduplication (if I understand the thread correctly this is what the code no does, by storing the exact same key not that many times but only once with references or a count?). best regards, Tels
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Fri, Nov 15, 2019 at 5:43 PM Mark Dilger wrote: > On 11/13/19 11:51 AM, Peter Geoghegan wrote: > > Can you suggest an alternative? > > Dupression This suggestion makes me feel better about "deduplication". -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On 11/13/19 11:51 AM, Peter Geoghegan wrote: Can you suggest an alternative? Dupression -- Mark Dilger
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan wrote: > > I haven't measured how these changes affect WAL size yet. > > Do you have any suggestions on how to automate testing of new WAL records? > > Is there any suitable place in regression tests? > > I don't know about the regression tests (I doubt that there is a > natural place for such a test), but I came up with a rough test case. > I more or less copied the approach that you took with the index build > WAL reduction patches, though I also figured out a way of subtracting > heapam WAL overhead to get a real figure. I attach the test case -- > note that you'll need to use the "land" database with this. (This test > case might need to be improved, but it's a good start.) I used a test script similar to the "nbtree_wal_test.sql" test script I posted on September 11th today. I am concerned about the WAL overhead for cases that don't benefit from the patch (usually because they turn off deduplication altogether). The details of the index tested were different this time, though. I used an index that had the smallest possible tuple size: 16 bytes (this is the smallest possible size on 64-bit systems, but that's what almost everybody uses these days). So any index with one or two int4 columns (or one int8 column) will generally have 16 byte IndexTuples, at least when there are no NULLs in the index. In general, 16 byte wide tuples are very, very common. What I saw suggests that we will need to remove the new "postingoff" field from xl_btree_insert. (We can create a new XLog record for leaf page inserts that also need to split a posting list, without changing much else.) The way that *alignment* of WAL records affects these common 16 byte IndexTuple cases is the real problem. Adding "postingoff" to xl_btree_insert increases the WAL required for INSERT_LEAF records by two bytes (sizeof(OffsetNumber)), as you'd expect -- pg_waldump output shows that they're 66 bytes, whereas they're only 64 bytes on the master branch. That doesn't sound that bad, but once you consider the alignment of whole records, it's really an extra 8 bytes. That is totally unacceptable. The vast majority of nbtree WAL records are bound to be INSERT_LEAF records, so as things stand we have added (almost) 12.5% space overhead to nbtree for these common cases, that don't benefit. I haven't really looked into other types of WAL record just yet. The real world overhead that we're adding to xl_btree_vacuum records is something that I will have to look into separately. I'm already pretty sure that adding two bytes to xl_btree_split is okay, though, because they're far less numerous than xl_btree_insert records, and aren't affected by alignment in the same way (they're already several hundred bytes in almost all cases). I also noticed something positive: The overhead of xl_btree_dedup WAL records seems to be very low with indexes that have hundreds of logical tuples for each distinct integer value. We don't seem to have a problem with "deduplication thrashing". -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Fri, Nov 15, 2019 at 5:16 AM Robert Haas wrote: > Hmm. Well, maybe I'm just behind the times. But that same wikipedia > article also says that deduplication works on large chunks "such as > entire files or large sections of files" thus differentiating it from > compression algorithms which work on the byte level, so it seems to me > that what you are doing still sounds more like ad-hoc compression. I see your point. One reason for my avoiding the word "compression" is that other DB systems that have something similar don't use the word compression either. Actually, they don't really call it *anything*. Posting lists are simply the way that secondary indexes work. The "Modern B-Tree techniques" book/survey paper mentions the idea of using a TID list in its "3.7 Duplicate Key Values" section, not in the two related sections that follow ("Bitmap Indexes", and "Data Compression"). That doesn't seem like a very good argument, now that I've typed it out. The patch applies deduplication/compression/whatever at the point where we'd otherwise have to split the page, unlike GIN. GIN eagerly maintains posting lists (doing in-place updates for most insertions seems pretty bad to me). My argument could reasonably be made about GIN, which really does consider posting lists the natural way to store duplicate tuples. I cannot really make that argument about nbtree with this patch, though -- delaying a page split by re-encoding tuples (changing their physical representation without changing their logical contents) justifies using the word "compression" in the name. > > Can you suggest an alternative? > > My instinct is to pick a name that somehow involves compression and > just put enough other words in there to make it clear e.g. duplicate > value compression, or something of that sort. Does anyone else want to weigh in on this? Anastasia? I will go along with whatever the consensus is. I'm very close to the problem we're trying to solve, which probably isn't helping me here. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Wed, Nov 13, 2019 at 2:51 PM Peter Geoghegan wrote: > "Deduplication" never means that you get rid of duplicates. According > to Wikipedia's deduplication article: "Whereas compression algorithms > identify redundant data inside individual files and encodes this > redundant data more efficiently, the intent of deduplication is to > inspect large volumes of data and identify large sections – such as > entire files or large sections of files – that are identical, and > replace them with a shared copy". Hmm. Well, maybe I'm just behind the times. But that same wikipedia article also says that deduplication works on large chunks "such as entire files or large sections of files" thus differentiating it from compression algorithms which work on the byte level, so it seems to me that what you are doing still sounds more like ad-hoc compression. > Can you suggest an alternative? My instinct is to pick a name that somehow involves compression and just put enough other words in there to make it clear e.g. duplicate value compression, or something of that sort. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Wed, Nov 13, 2019 at 11:33 AM Robert Haas wrote: > On Tue, Nov 12, 2019 at 6:22 PM Peter Geoghegan wrote: > > * Disabled deduplication in system catalog indexes by deeming it > > generally unsafe. > > I (continue to) think that deduplication is a terrible name, because > you're not getting rid of the duplicates. You are using a compressed > representation of the duplicates. "Deduplication" never means that you get rid of duplicates. According to Wikipedia's deduplication article: "Whereas compression algorithms identify redundant data inside individual files and encodes this redundant data more efficiently, the intent of deduplication is to inspect large volumes of data and identify large sections – such as entire files or large sections of files – that are identical, and replace them with a shared copy". This seemed like it fit what this patch does. We're concerned with a specific, simple kind of redundancy. Also: * From the user's point of view, we're merging together what they'd call duplicates. They don't really think of the heap TID as part of the key. * The term "compression" suggests a decompression penalty when reading, which is not the case here. * The term "compression" confuses the feature added by the patch with TOAST compression. Now we may have two very different varieties of compression in the same index. Can you suggest an alternative? -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Tue, Nov 12, 2019 at 6:22 PM Peter Geoghegan wrote: > * Disabled deduplication in system catalog indexes by deeming it > generally unsafe. I (continue to) think that deduplication is a terrible name, because you're not getting rid of the duplicates. You are using a compressed representation of the duplicates. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Mon, Sep 30, 2019 at 7:39 PM Peter Geoghegan wrote: > I've found that my "regular pgbench, but with a low cardinality index > on pgbench_accounts(abalance)" benchmark works best with the specific > heuristics used in the patch, especially over many hours. I ran pgbench without the pgbench_accounts(abalance) index, and with slightly adjusted client counts -- you could say that this was a classic pgbench benchmark of v20 of the patch. Still scale 500, with single hour runs. Here are the results for each 1 hour run, with client counts of 8, 16, and 32, with two rounds of runs total: master_1_run_8.out: "tps = 25156.689415 (including connections establishing)" patch_1_run_8.out: "tps = 25135.472084 (including connections establishing)" master_1_run_16.out: "tps = 30947.053714 (including connections establishing)" patch_1_run_16.out: "tps = 31225.044305 (including connections establishing)" master_1_run_32.out: "tps = 29550.231969 (including connections establishing)" patch_1_run_32.out: "tps = 29425.011249 (including connections establishing)" master_2_run_8.out: "tps = 24678.792084 (including connections establishing)" patch_2_run_8.out: "tps = 24891.130465 (including connections establishing)" master_2_run_16.out: "tps = 30878.930585 (including connections establishing)" patch_2_run_16.out: "tps = 30982.306091 (including connections establishing)" master_2_run_32.out: "tps = 29555.453436 (including connections establishing)" patch_2_run_32.out: "tps = 29591.767136 (including connections establishing)" This interlaced order is the same order that each 1 hour pgbench run actually ran in. The patch wasn't expected to do any better here -- it was expected to not be any slower for a workload that it cannot really help. Though the two small pgbench indexes do remain a lot smaller with the patch. While a lot of work remains to validate the performance of the patch, this looks good to me. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Fri, Sep 27, 2019 at 9:43 AM Anastasia Lubennikova wrote: > Attached is v19. Cool. > * By default deduplication is on for non-unique indexes and off for > unique ones. I think that it makes sense to enable deduplication by default -- even with unique indexes. It looks like deduplication can be very helpful with non-HOT updates. I have been benchmarking this using more or less standard pgbench at scale 200, with one big difference -- I also create an index on "pgbench_accounts (abalance)". This is a low cardinality index, which ends up about 3x smaller with the patch, as expected. It also makes all updates non-HOT updates, greatly increasing index bloat in the primary key of the accounts table -- this is what I found really interesting about this workload. The theory behind deduplication within unique indexes seems quite different to the cases we've focussed on so far -- that's why my working copy of the patch makes a few small changes to how _bt_dedup_one_page() works with unique indexes specifically (more on that later). With unique indexes, deduplication doesn't help by creating space -- it helps by creating *time* for garbage collection to run before the real "damage" is done -- it delays page splits. This is only truly valuable when page splits caused by non-HOT updates are delayed by so much that they're actually prevented entirely, typically because the _bt_vacuum_one_page() stuff can now happen before pages split, not after. In general, these page splits are bad because they degrade the B-Tree structure, more or less permanently (it's certainly permanent with this workload). Having a huge number of page splits *purely* because of non-HOT updates is particular bad -- it's just awful. I believe that this is the single biggest problem with the Postgres approach to versioned storage (we know that other DB systems have no primary key page splits with this kind of workload). Anyway, if you run this pgbench workload without rate-limiting, so that a patched Postgres does as much work as physically possible, the accounts table primary key (pgbench_accounts_pkey) at least grows at a slower rate -- the patch clearly beats master at the start of the benchmark/test (as measured by index size). As the clients are ramped up by my testing script, and as time goes on, eventually the size of the pgbench_accounts_pkey index "catches up" with master. The patch delays page splits, but ultimately the system as a whole cannot prevent the page splits altogether, since the server is being absolutely hammered by pgbench. Actually, the index is *exactly* the same size for both the master case and the patch case when we reach this "bloat saturation point". We can delay the problem, but we cannot prevent it. But what about a more realistic workload, with rate-limiting? When I add some rate limiting, so that the TPS/throughput is at about 50% of what I got the first time around (i.e. 50% of what is possible), or 15k TPS, it's very different. _bt_dedup_one_page() can now effectively cooperate with _bt_vacuum_one_page(). Now deduplication is able to "soak up all the extra garbage tuples" for long enough to delay and ultimately *prevent* almost all page splits. pgbench_accounts_pkey starts off at 428 MB for both master and patch (CREATE INDEX makes it that size). After about an hour, the index is 447 MB with the patch. The master case ends up with a pgbench_accounts_pkey size of 854 MB, though (this is very close to 857 MB, the "saturation point" index size from before). This is a very significant improvement, obviously -- the patch has an index that is ~52% of the size seen for the same index with the master branch! Here is how I changed _bt_dedup_one_page() for unique indexes to get this result: * We limit the size of posting lists to 5 heap TIDs in the checkingunique case. Right now, we will actually accept a checkingunique page split before we'll merge together items that result in a posting list with more heap TIDs than that (not sure about these details at all, though). * Avoid creating a new posting list that caller will have to split immediately anyway (this is based on details of _bt_dedup_one_page() caller's newitem tuple). (Not sure how much this customization contributes to this favorable test result -- maybe it doesn't make that much difference.) The goal here is for duplicates that are close together in both time and space to get "clumped together" into their own distinct, small-ish posting list tuples with no more than 5 TIDs. This is intended to help _bt_vacuum_one_page(), which is known to be a very important mechanism for indexes like our pgbench_accounts_pkey index (LP_DEAD bits are set very frequently within _bt_check_unique()). The general idea is to balance deduplication against LP_DEAD killing, and to increase spatial/temporal locality within these smaller posting lists. If we have one huge posting list for each value, then we can't set the LP_DEAD bit on anything at all, which is very bad.
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Wed, Sep 25, 2019 at 8:05 AM Anastasia Lubennikova wrote: > Attached is v18. In this version bt_dedup_one_page() is refactored so that: > - no temp page is used, all updates are applied to the original page. > - each posting tuple wal logged separately. > This also allowed to simplify btree_xlog_dedup significantly. This looks great! Even if it isn't faster than using a temp page buffer, the flexibility seems like an important advantage. We can do things like have the _bt_dedup_one_page() caller hint that deduplication should start at a particular offset number. If that doesn't work out by the time the end of the page is reached (whatever "works out" may mean), then we can just start at the beginning of the page, and work through the items we skipped over initially. > > We still haven't added an "off" switch to deduplication, which seems > > necessary. I suppose that this should look like GIN's "fastupdate" > > storage parameter. > Why is it necessary to save this information somewhere but rel->rd_options, > while we can easily access this field from _bt_findinsertloc() and > _bt_load(). Maybe, but we also need to access a flag that says it's safe to use deduplication. Obviously deduplication is not safe for datatypes like numeric and text with a nondeterministic collation. The "is deduplication safe for this index?" mechanism will probably work by doing several catalog lookups. This doesn't seem like something we want to do very often, especially with a buffer lock held -- ideally it will be somewhere that's convenient to access. Do we want to do that separately, and have a storage parameter that says "I would like to use deduplication in principle, if it's safe"? Or, do we store both pieces of information together, and forbid setting the storage parameter to on when it's known to be unsafe for the underlying opclasses used by the index? I don't know. I think that you can start working on this without knowing exactly how we'll do those catalog lookups. What you come up with has to work with that before the patch can be committed, though. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
24.09.2019 3:13, Peter Geoghegan wrote: On Wed, Sep 18, 2019 at 7:25 PM Peter Geoghegan wrote: I attach version 16. This revision merges your recent work on WAL logging with my recent work on simplifying _bt_dedup_one_page(). See my e-mail from earlier today for details. I attach version 17. This version has changes that are focussed on further polishing certain things, including fixing some minor bugs. It seemed worth creating a new version for that. (I didn't get very far with the space utilization stuff I talked about, so no changes there.) Attached is v18. In this version bt_dedup_one_page() is refactored so that: - no temp page is used, all updates are applied to the original page. - each posting tuple wal logged separately. This also allowed to simplify btree_xlog_dedup significantly. Another infrastructure thing that the patch needs to handle to be committable: We still haven't added an "off" switch to deduplication, which seems necessary. I suppose that this should look like GIN's "fastupdate" storage parameter. It's not obvious how to do this in a way that's easy to work with, though. Maybe we could do something like copy GIN's GinGetUseFastUpdate() macro, but the situation with nbtree is actually quite different. There are two questions for nbtree when it comes to deduplication within an inde: 1) Does the user want to use deduplication, because that will help performance?, and 2) Is it safe/possible to use deduplication at all? I'll send another version with dedup option soon. I think that we should probably stash this information (deduplication is both possible and safe) in the metapage. Maybe we can copy it over to our insertion scankey, just like the "heapkeyspace" field -- that information also comes from the metapage (it's based on the nbtree version). The "heapkeyspace" field is a bit ugly, so maybe we shouldn't go further by adding something similar, but I don't see any great alternative right now. Why is it necessary to save this information somewhere but rel->rd_options, while we can easily access this field from _bt_findinsertloc() and _bt_load(). -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c index 05e7d67..d65e2a7 100644 --- a/contrib/amcheck/verify_nbtree.c +++ b/contrib/amcheck/verify_nbtree.c @@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup, bool tupleIsAlive, void *checkstate); static IndexTuple bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup); +static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n); static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup); static inline bool offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset); @@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace, /* * Size Bloom filter based on estimated number of tuples in index, * while conservatively assuming that each block must contain at least - * MaxIndexTuplesPerPage / 5 non-pivot tuples. (Non-leaf pages cannot - * contain non-pivot tuples. That's okay because they generally make - * up no more than about 1% of all pages in the index.) + * MaxPostingIndexTuplesPerPage / 3 "logical" tuples. heapallindexed + * verification fingerprints posting list heap TIDs as plain non-pivot + * tuples, complete with index keys. This allows its heap scan to + * behave as if posting lists do not exist. */ total_pages = RelationGetNumberOfBlocks(rel); - total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5), + total_elems = Max(total_pages * (MaxPostingIndexTuplesPerPage / 3), (int64) state->rel->rd_rel->reltuples); /* Random seed relies on backend srandom() call to avoid repetition */ seed = random(); @@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state) size_t tupsize; BTScanInsert skey; bool lowersizelimit; + ItemPointer scantid; CHECK_FOR_INTERRUPTS(); @@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state) /* * Readonly callers may optionally verify that non-pivot tuples can - * each be found by an independent search that starts from the root + * each be found by an independent search that starts from the root. + * Note that we deliberately don't do individual searches for each + * "logical" posting list tuple, since the posting list itself is + * validated by other checks. */ if (state->rootdescend && P_ISLEAF(topaque) && !bt_rootdescend(state, itup)) { char *itid, *htid; + ItemPointer tid = BTreeTupleGetHeapTID(itup); itid = psprintf("(%u,%u)", state->targetblock, offset); htid = psprintf("(%u,%u)", - ItemPointerGetBlockNumber(&(itup->t_tid)), - ItemPointerGetOffsetNumber(&(itup->t_tid))); +
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Mon, Sep 23, 2019 at 5:13 PM Peter Geoghegan wrote: > I attach version 17. I attach a patch that applies on top of v17. It adds support for deduplication within unique indexes. Actually, this is a snippet of code that appeared in my prototype from August 5 (we need very little extra code for this now). Unique index support kind of looked like a bad idea at the time, but things have changed a lot since then. I benchmarked this overnight using a custom pgbench-based test that used a Zipfian distribution, with a single-row SELECT and an UPDATE of pgbench_accounts per xact. I used regular/logged tables this time around, since WAL-logging is now fairly efficient. I added a separate low cardinality index on pgbench_accounts(abalance). A low cardinality index is the most interesting case for this patch, obviously, but it also serves to prevent all HOT updates, increasing bloat for both indexes. We want a realistic case that creates a lot of index bloat. This wasn't a rigorous enough benchmark to present here in full, but the results were very encouraging. With reasonable client counts for the underlying hardware, we seem to have a small increase in TPS, and a small decrease in latency. There is a regression with 128 clients, when contention is ridiculously high (this is my home server, which only has 4 cores). More importantly: * The low cardinality index is almost 3x smaller with the patch -- no surprises there. * The read latency is where latency goes up, if it goes up at all. Whatever it is that might increase latency, it doesn't look like it's deduplication itself. Yeah, deduplication is expensive, but it's not nearly as expensive as a page split. (I'm talking about the immediate cost, not the bigger picture, though the bigger picture matters even more.) * The growth in primary key size over time is the thing I find really interesting. The patch seems to really control the number of pages splits over many hours with lots of non-HOT updates. I think that a timeline of days or weeks could be really interesting. I am now about 75% convinced that adding deduplication to unique indexes is a good idea, at least as an option that is disabled by default. We're already doing well here, even though there has been no work on optimizing deduplication in unique indexes. Further optimizations seem quite possible, though. I'm mostly thinking about optimizing non-HOT updates by teaching nbtree some basic things about versioning with unique indexes. For example, we could remember "recently dead" duplicates of the value we are about to insert (as part of an UPDATE statement) from within _bt_check_unique(). Then, when it looks like a page split may be necessary, we can hint to _bt_dedup_one_page(): "please just deduplicate the group of duplicates starting from this offset, which are duplicates of the this new item I am inserting -- do not create a posting list that I will have to split, though". We already cache the binary search bounds established within _bt_check_unique() in insertstate, so perhaps we could reuse that to make this work. The goal here is that the the old/recently dead versions end up together in their own posting list (or maybe two posting lists), whereas our new/most current tuple is on its own. There is a very good chance that our transaction will commit, leaving somebody else to set the LP_DEAD bit on the posting list that contains those old versions. In short, we'd be making deduplication and opportunistic garbage collection cooperate closely. This can work both ways -- maybe we should also teach _bt_vacuum_one_page() to cooperate with _bt_dedup_one_page(). That is, if we add the mechanism I just described in the last paragraph, maybe _bt_dedup_one_page() marks the posting list that is likely to get its LP_DEAD bit set soon with a new hint bit -- the LP_REDIRECT bit. Here, LP_REDIRECT means "somebody is probably going to set the LP_DEAD bit on this posting list tuple very soon". That way, if nobody actually does set the LP_DEAD bit, _bt_vacuum_one_page() still has options. If it goes to the heap and finds the latest version, and that that version is visible to any possible MVCC snapshot, that means that it's safe to kill all the other versions, even without the LP_DEAD bit set -- this is a unique index. So, it often gets to kill lots of extra garbage that it wouldn't get to kill, preventing page splits. The cost is pretty low: the risk that the single heap page check will be a wasted effort. (Of course, we still have to visit the heap pages of things that we go on to kill, to get the XIDs to generate recovery conflicts -- the important point is that we only need to visit one heap page in _bt_vacuum_one_page(), to *decide* if it's possible to do all this -- cases that don't benefit at all also don't pay very much.) I don't think that we need to do either of these two other things to justify committing the patch with unique index support. But, teaching nbtree a little bit about versioning
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Wed, Sep 18, 2019 at 10:43 AM Peter Geoghegan wrote: > This also suggests that making _bt_dedup_one_page() do raw page adds > and page deletes to the page in shared_buffers (i.e. don't use a temp > buffer page) could pay off. As I went into at the start of this > e-mail, unnecessarily doing expensive things like copying large > posting lists around is a real concern. Even if it isn't truly useful > for _bt_dedup_one_page() to operate in a very incremental fashion, > incrementalism is probably still a good thing to aim for -- it seems > to make deduplication faster in all cases. I think that I forgot to mention that I am concerned that the kill_prior_tuple/LP_DEAD optimization could be applied less often because _bt_dedup_one_page() operates too aggressively. That is a big part of my general concern. Maybe I'm wrong about this -- who knows? I definitely think that LP_DEAD setting by _bt_check_unique() is generally a lot more important than LP_DEAD setting by the kill_prior_tuple optimization, and the patch won't affect unique indexes. Only very serious benchmarking can give us a clear answer, though. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Tue, Sep 17, 2019 at 9:43 AM Anastasia Lubennikova wrote: > > 3. Third, there is incremental writing of the page itself -- avoiding > > using a temp buffer. Not sure where I stand on this. > > I think it's a good idea. memmove must be much faster than copying > items tuple by tuple. > I'll send next patch by the end of the week. I think that the biggest problem is that we copy all of the tuples, including existing posting list tuples that can't be merged with anything. Even if you assume that we'll never finish early (e.g. by using logic like the "if (pagesaving >= newitemsz) deduplicate = false;" thing), this can still unnecessarily slow down deduplication. Very often, _bt_dedup_one_page() is called when 1/2 - 2/3 of the space on the page is already used by a small number of very large posting list tuples. > > The loop within _bt_dedup_one_page() is very confusing in both v13 and > > v14 -- I couldn't figure out why the accounting worked like this: > I'll look at it. I'm currently working on merging my refactored version of _bt_dedup_one_page() with your v15 WAL-logging. This is a bit tricky. (I have finished merging the other WAL-logging stuff, though -- that was easy.) The general idea is that the loop in _bt_dedup_one_page() now explicitly operates with a "base" tuple, rather than *always* saving the prev tuple from the last loop iteration. We always have a "pending posting list", which won't be written-out as a posting list if it isn't possible to merge at least one existing page item. The "base" tuple doesn't change. "pagesaving" space accounting works in a way that doesn't care about whether or not the base tuple was already a posting list -- it saves the size of the IndexTuple without any existing posting list size, and calculates the contribution to the total size of the new posting list separately (heap TIDs from the original base tuple and subsequent tuples are counted together). This has a number of advantages: * The loop is a lot clearer now, and seems to have slightly better space utilization because of improved accounting (with or without the "if (pagesaving >= newitemsz) deduplicate = false;" thing). * I think that we're going to need to be disciplined about which tuple is the "base" tuple for correctness reasons -- we should always use the leftmost existing tuple to form a new posting list tuple. I am concerned about rare cases where we deduplicate tuples that are equal according to _bt_keep_natts_fast()/datum_image_eq() that nonetheless have different sizes (and are not bitwise equal). There are rare cases involving TOAST compression where that is just about possible (see the temp comments I added to _bt_keep_natts_fast() in the patch). * It's clearly faster, because there is far less palloc() overhead -- the "land" unlogged table test completes in about 95.5% of the time taken by v15 (I disabled "if (pagesaving >= newitemsz) deduplicate = false;" for both versions here, to keep it simple and fair). This also suggests that making _bt_dedup_one_page() do raw page adds and page deletes to the page in shared_buffers (i.e. don't use a temp buffer page) could pay off. As I went into at the start of this e-mail, unnecessarily doing expensive things like copying large posting lists around is a real concern. Even if it isn't truly useful for _bt_dedup_one_page() to operate in a very incremental fashion, incrementalism is probably still a good thing to aim for -- it seems to make deduplication faster in all cases. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
16.09.2019 21:58, Peter Geoghegan wrote: On Mon, Sep 16, 2019 at 8:48 AM Anastasia Lubennikova wrote: I tested patch with nbtree_wal_test, and found out that the real issue is not the dedup WAL records themselves, but the full page writes that they trigger. Here are test results (config is standard, except fsync=off to speedup tests): 'FPW on' and 'FPW off' are tests on v14. NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page(). I think that is makes sense to focus on synthetic cases without FPWs/FPIs from checkpoints. At least for now. With random insertions into btree it's highly possible that deduplication will often be the first write after checkpoint, and thus will trigger FPW, even if only a few tuples were compressed. <...> I think that the problem here is that you didn't copy this old code from _bt_split() over to _bt_dedup_one_page(): /* * Copy the original page's LSN into leftpage, which will become the * updated version of the page. We need this because XLogInsert will * examine the LSN and possibly dump it in a page image. */ PageSetLSN(leftpage, PageGetLSN(origpage)); isleaf = P_ISLEAF(oopaque); Note that this happens at the start of _bt_split() -- the temp page buffer based on origpage starts out with the same LSN as origpage. This is an important step of the WAL volume optimization used by _bt_split(). That's it. I suspected that such enormous amount of FPW reflects some bug. That's why there is no significant difference with log_newpage_buffer() approach. And that's why "lazy" deduplication doesn't help to decrease amount of WAL. My point was that the problem is extra FPWs, so it doesn't matter whether we deduplicate just several entries to free enough space or all of them. The term "lazy deduplication" is seriously overloaded here. I think that this could cause miscommunications. Let me list the possible meanings of that term here: 1. First of all, the basic approach to deduplication is already lazy, unlike GIN, in the sense that _bt_dedup_one_page() is called to avoid a page split. I'm 100% sure that we both think that that works well compared to an eager approach (like GIN's). Sure. 2. Second of all, there is the need to incrementally WAL log. It looks like v14 does that well, in that it doesn't create "xlrec_dedup.n_intervals" space when it isn't truly needed. That's good. In v12-v15 I mostly concentrated on this feature. The last version looks good to me. 3. Third, there is incremental writing of the page itself -- avoiding using a temp buffer. Not sure where I stand on this. I think it's a good idea. memmove must be much faster than copying items tuple by tuple. I'll send next patch by the end of the week. 4. Finally, there is the possibility that we could make deduplication incremental, in order to avoid work that won't be needed altogether -- this would probably be combined with 3. Not sure where I stand on this, either. We should try to be careful when using these terms, as there is a very real danger of talking past each other. Another, and more realistic approach is to make deduplication less intensive: if freed space is less than some threshold, fall back to not changing page at all and not generating xlog record. I see that v14 uses the "dedupInterval" struct, which provides a logical description of a deduplicated set of tuples. That general approach is at least 95% of what I wanted from the _bt_dedup_one_page() WAL-logging. Probably that was the reason, why patch became faster after I added BT_COMPRESS_THRESHOLD in early versions, not because deduplication itself is cpu bound or something, but because WAL load decreased. I think so too -- BT_COMPRESS_THRESHOLD definitely makes compression faster as things are. I am not against bringing back BT_COMPRESS_THRESHOLD. I just don't want to do it right now because I think that it's a distraction. It may hide problems that we want to fix. Like the PageSetLSN() problem I mentioned just now, and maybe others. We will definitely need to have page space accounting that's a bit similar to nbtsplitloc.c, to avoid the case where a leaf page is 100% full (or has 4 bytes left, or something). That happens regularly now. That must start with teaching _bt_dedup_one_page() about how much space it will free. Basing it on the number of items on the page or whatever is not going to work that well. I think that it would be possible to have something like BT_COMPRESS_THRESHOLD to prevent thrashing, and *also* make the deduplication incremental, in the sense that it can give up on deduplication when it frees enough space (i.e. something like v13's 0002-* patch). I said that these two things are closely related, which is true, but it's also true that they don't overlap. Don't forget the reason why I removed BT_COMPRESS_THRESHOLD: Doing so made a handful of specific indexes (mostly from TPC-H) significantly smaller. I never tried to debug the
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Mon, Sep 16, 2019 at 8:48 AM Anastasia Lubennikova wrote: > Attached is v14 based on v12 (v13 changes are not merged). > > In this version, I fixed the bug you mentioned and also fixed nbtinsert, > so that it doesn't save newposting in xlog record anymore. Cool. > I tested patch with nbtree_wal_test, and found out that the real issue is > not the dedup WAL records themselves, but the full page writes that they > trigger. > Here are test results (config is standard, except fsync=off to speedup tests): > > 'FPW on' and 'FPW off' are tests on v14. > NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page(). I think that is makes sense to focus on synthetic cases without FPWs/FPIs from checkpoints. At least for now. > With random insertions into btree it's highly possible that deduplication > will often be > the first write after checkpoint, and thus will trigger FPW, even if only a > few tuples were compressed. I find that hard to believe. Deduplication only occurs when we're about to split the page. If that's almost as likely to occur as a simple insert, then we're in big trouble (maybe it's actually true, but if it is then that's the real problem). Also, fewer pages for the index naturally leads to far fewer FPIs after a checkpoint. I used "pg_waldump -z" and "pg_waldump --stats=record" to evaluate the same case on v13. It was practically the same as the master branch, apart from the huge difference in FPIs for the XLOG rmgr. Aside from that one huge difference, there was a similar volume of the same types of WAL records in each case. Mostly leaf inserts, and far fewer internal page inserts. I suppose this isn't surprising. It probably makes sense for the final version of the patch to increase the volume of WAL a little overall, since the savings for internal page stuff cannot make up for the cost of having to WAL log something extra (deduplication operations) on leaf pages, regardless of the size of those extra dedup WAL records (I am ignoring FPIs after a checkpoint in this analysis). So the patch is more or less certain to add *some* WAL overhead in cases that benefit, and that's okay. But, it adds way too much WAL overhead today (even in v14), for reasons that we don't understand yet, which is not okay. I may have misunderstood your approach to WAL-logging in v12. I thought that you were WAL-logging things that didn't change, which doesn't seem to be the case with v14. I thought that v12 was very similar to v11 (and my v13) in terms of how _bt_dedup_one_page() does its WAL-logging. v14 looks good, though. "pg_waldump -z" and "pg_waldump --stats=record" will break down the contributing factor of FPIs, so it should be possible to account for the overhead in the test case exactly. We can debug the problem by using pg_waldump to count the absolute number of each type of record, and the size of each type of record. (Thinks some more...) I think that the problem here is that you didn't copy this old code from _bt_split() over to _bt_dedup_one_page(): /* * Copy the original page's LSN into leftpage, which will become the * updated version of the page. We need this because XLogInsert will * examine the LSN and possibly dump it in a page image. */ PageSetLSN(leftpage, PageGetLSN(origpage)); isleaf = P_ISLEAF(oopaque); Note that this happens at the start of _bt_split() -- the temp page buffer based on origpage starts out with the same LSN as origpage. This is an important step of the WAL volume optimization used by _bt_split(). > That's why there is no significant difference with log_newpage_buffer() > approach. > And that's why "lazy" deduplication doesn't help to decrease amount of WAL. The term "lazy deduplication" is seriously overloaded here. I think that this could cause miscommunications. Let me list the possible meanings of that term here: 1. First of all, the basic approach to deduplication is already lazy, unlike GIN, in the sense that _bt_dedup_one_page() is called to avoid a page split. I'm 100% sure that we both think that that works well compared to an eager approach (like GIN's). 2. Second of all, there is the need to incrementally WAL log. It looks like v14 does that well, in that it doesn't create "xlrec_dedup.n_intervals" space when it isn't truly needed. That's good. 3. Third, there is incremental writing of the page itself -- avoiding using a temp buffer. Not sure where I stand on this. 4. Finally, there is the possibility that we could make deduplication incremental, in order to avoid work that won't be needed altogether -- this would probably be combined with 3. Not sure where I stand on this, either. We should try to be careful when using these terms, as there is a very real danger of talking past each other. > Another, and more realistic approach is to make deduplication less intensive: > if freed space is less than some threshold, fall back to not changing page at > all and not generating xlog record. I see
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
13.09.2019 4:04, Peter Geoghegan wrote: On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan wrote: I think that the new WAL record has to be created once per posting list that is generated, not once per page that is deduplicated -- that's the only way that I can see that avoids a huge increase in total WAL volume. Even if we assume that I am wrong about there being value in making deduplication incremental, it is still necessary to make the WAL-logging behave incrementally. It would be good to hear your thoughts on this _bt_dedup_one_page() WAL volume/"write amplification" issue. Attached is v14 based on v12 (v13 changes are not merged). In this version, I fixed the bug you mentioned and also fixed nbtinsert, so that it doesn't save newposting in xlog record anymore. I tested patch with nbtree_wal_test, and found out that the real issue is not the dedup WAL records themselves, but the full page writes that they trigger. Here are test results (config is standard, except fsync=off to speedup tests): 'FPW on' and 'FPW off' are tests on v14. NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page(). +---+---+---++---+ | --- | FPW on | FPW off | FORCE_NO_IMAGE | master | +---+---+---++---+ | time | 09:12 min | 06:56 min | 06:24 min | 08:10 min | | nbtree_wal_volume | 8083 MB | 2128 MB | 2327 MB | 2439 MB | | index_size | 169 MB | 169 MB | 169 MB | 1118 MB | +---+---+---++---+ With random insertions into btree it's highly possible that deduplication will often be the first write after checkpoint, and thus will trigger FPW, even if only a few tuples were compressed. That's why there is no significant difference with log_newpage_buffer() approach. And that's why "lazy" deduplication doesn't help to decrease amount of WAL. Also, since the index is packed way better than before, it probably benefits less of wal_compression. One possible "fix" to decrease WAL amplification is to add REGBUF_NO_IMAGE flag to XLogRegisterBuffer in bt_dedup_one_page(). As you can see from test result, it really eliminates the problem of inadequate WAL amount. However, I doubt that it is a crash-safe idea. Another, and more realistic approach is to make deduplication less intensive: if freed space is less than some threshold, fall back to not changing page at all and not generating xlog record. Probably that was the reason, why patch became faster after I added BT_COMPRESS_THRESHOLD in early versions, not because deduplication itself is cpu bound or something, but because WAL load decreased. So I propose to develop this idea. The question is how to choose threshold. I wouldn't like to introduce new user settings. Any ideas? I also noticed that the number of checkpoints differ between tests: select checkpoints_req from pg_stat_bgwriter ; +-+-+-+++ | --- | FPW on | FPW off | FORCE_NO_IMAGE | master | +-+-+-+++ | checkpoints_req | 16 | 7 | 8 | 10 | +-+-+-+++ And I struggle to explain the reason of this. Do you understand what can cause the difference? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c index 05e7d67..399743d 100644 --- a/contrib/amcheck/verify_nbtree.c +++ b/contrib/amcheck/verify_nbtree.c @@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state) size_t tupsize; BTScanInsert skey; bool lowersizelimit; + ItemPointer scantid; CHECK_FOR_INTERRUPTS(); @@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state) /* * Readonly callers may optionally verify that non-pivot tuples can - * each be found by an independent search that starts from the root + * each be found by an independent search that starts from the root. + * Note that we deliberately don't do individual searches for each + * "logical" posting list tuple, since the posting list itself is + * validated by other checks. */ if (state->rootdescend && P_ISLEAF(topaque) && !bt_rootdescend(state, itup)) { char *itid, *htid; + ItemPointer tid = BTreeTupleGetHeapTID(itup); itid = psprintf("(%u,%u)", state->targetblock, offset); htid = psprintf("(%u,%u)", - ItemPointerGetBlockNumber(&(itup->t_tid)), - ItemPointerGetOffsetNumber(&(itup->t_tid))); + ItemPointerGetBlockNumber(tid), + ItemPointerGetOffsetNumber(tid)); ereport(ERROR, (errcode(ERRCODE_INDEX_CORRUPTED), errmsg("could not find tuple using search from
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Wed, Sep 11, 2019 at 3:09 PM Peter Geoghegan wrote: > Hmm. So v12 seems to have some problems with the WAL logging for > posting list splits. With wal_debug = on and > wal_consistency_checking='all', I can get a replica to fail > consistency checking very quickly when "make installcheck" is run on > the primary I see the bug here. The problem is that we WAL-log a version of the new item that already has its heap TID changed. On the primary, the call to _bt_form_newposting() has a new item with the original heap TID, which is then rewritten before being inserted -- that's correct. But during recovery, we *start out with* a version of the new item that *already* had its heap TID swapped. So we have nowhere to get the original heap TID from during recovery. Attached patch fixes the problem in a hacky way -- it WAL-logs the original heap TID, just in case. Obviously this fix isn't usable, but it should make the problem clearer. Can you come up with a proper fix, please? I can think of one way of doing it, but I'll leave the details to you. The same issue exists in _bt_split(), so the tests will still fail with wal_consistency_checking -- it just takes a lot longer to reach a point where an inconsistent page is found, because posting list splits that occur at the same point that we need to split a page are much rarer than posting list splits that occur when we simply need to insert, without splitting the page. I suggest using wal_consistency_checking to test the fix that you come up with. As I mentioned, I regularly use it. Also note that there are further subtleties to doing this within _bt_split() -- see the FIXME comments there. Thanks -- Peter Geoghegan 0001-Save-original-new-heap-TID-in-insert-WAL-record.patch Description: Binary data
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Wed, Sep 11, 2019 at 5:38 AM Anastasia Lubennikova wrote: > Attached is v12, which contains WAL optimizations for posting split and > page > deduplication. Hmm. So v12 seems to have some problems with the WAL logging for posting list splits. With wal_debug = on and wal_consistency_checking='all', I can get a replica to fail consistency checking very quickly when "make installcheck" is run on the primary: 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30423A0; LSN 0/30425A0: prev 0/3041C78; xid 506; len 3; blkref #0: rel 1663/16385/2608, blk 56 FPW - Heap/INSERT: off 20 flags 0x00 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30425A0; LSN 0/3042F78: prev 0/30423A0; xid 506; len 4; blkref #0: rel 1663/16385/2673, blk 13 FPW - Btree/INSERT_LEAF: off 138; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3042F78; LSN 0/3043788: prev 0/30425A0; xid 506; len 4; blkref #0: rel 1663/16385/2674, blk 37 FPW - Btree/INSERT_LEAF: off 68; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3043788; LSN 0/30437C0: prev 0/3042F78; xid 506; len 28 - Transaction/ABORT: 2019-09-11 15:01:06.291717-07; rels: pg_tblspc/16388/PG_13_201909071/16385/16399 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30437C0; LSN 0/3043A30: prev 0/3043788; xid 507; len 3; blkref #0: rel 1663/16385/1247, blk 9 FPW - Heap/INSERT: off 9 flags 0x00 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3043A30; LSN 0/3043D08: prev 0/30437C0; xid 507; len 4; blkref #0: rel 1663/16385/2703, blk 2 FPW - Btree/INSERT_LEAF: off 51; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3043D08; LSN 0/3044948: prev 0/3043A30; xid 507; len 4; blkref #0: rel 1663/16385/2704, blk 1 FPW - Btree/INSERT_LEAF: off 169; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3044948; LSN 0/3044B58: prev 0/3043D08; xid 507; len 3; blkref #0: rel 1663/16385/2608, blk 56 FPW - Heap/INSERT: off 21 flags 0x00 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3044B58; LSN 0/30454A0: prev 0/3044948; xid 507; len 4; blkref #0: rel 1663/16385/2673, blk 8 FPW - Btree/INSERT_LEAF: off 156; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30454A0; LSN 0/3045CC0: prev 0/3044B58; xid 507; len 4; blkref #0: rel 1663/16385/2674, blk 37 FPW - Btree/INSERT_LEAF: off 71; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3045CC0; LSN 0/3045F48: prev 0/30454A0; xid 507; len 3; blkref #0: rel 1663/16385/1247, blk 9 FPW - Heap/INSERT: off 10 flags 0x00 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3045F48; LSN 0/3046240: prev 0/3045CC0; xid 507; len 4; blkref #0: rel 1663/16385/2703, blk 2 FPW - Btree/INSERT_LEAF: off 51; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3046240; LSN 0/3046E70: prev 0/3045F48; xid 507; len 4; blkref #0: rel 1663/16385/2704, blk 1 FPW - Btree/INSERT_LEAF: off 44; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3046E70; LSN 0/3047090: prev 0/3046240; xid 507; len 3; blkref #0: rel 1663/16385/2608, blk 56 FPW - Heap/INSERT: off 22 flags 0x00 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3047090; LSN 0/30479E0: prev 0/3046E70; xid 507; len 4; blkref #0: rel 1663/16385/2673, blk 8 FPW - Btree/INSERT_LEAF: off 156; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30479E0; LSN 0/3048420: prev 0/3047090; xid 507; len 4; blkref #0: rel 1663/16385/2674, blk 38 FPW - Btree/INSERT_LEAF: off 10; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3048420; LSN 0/30486B0: prev 0/30479E0; xid 507; len 3; blkref #0: rel 1663/16385/1259, blk 0 FPW - Heap/INSERT: off 6 flags 0x00 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30486B0; LSN 0/3048C30: prev 0/3048420; xid 507; len 4; blkref #0: rel 1663/16385/2662, blk 2 FPW - Btree/INSERT_LEAF: off 119; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3048C30; LSN 0/3049668: prev 0/30486B0; xid 507; len 4; blkref #0: rel 1663/16385/2663, blk 1 FPW - Btree/INSERT_LEAF: off 42; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3049668; LSN 0/304A550: prev 0/3048C30; xid 507; len 4; blkref #0: rel 1663/16385/3455, blk 1 FPW - Btree/INSERT_LEAF: off 2; in_posting_offset 1 4448/2019-09-11 15:01:06 PDT FATAL: inconsistent page found, rel 1663/16385/3455, forknum 0, blkno 1 4448/2019-09-11 15:01:06 PDT CONTEXT: WAL redo at 0/3049668 for Btree/INSERT_LEAF: off 2; in_posting_offset 1 4447/2019-09-11 15:01:06 PDT LOG: startup process (PID 4448) exited with exit code 1 4447/2019-09-11 15:01:06 PDT LOG: terminating any other active server processes 4447/2019-09-11 15:01:06 PDT LOG: database system is shut down I regularly use this test case for the patch -- I think that I fixed a similar problem in v11, when I changed the same WAL logging, but I didn't mention it until now. I will debug this myself in a few days, though you may prefer to do it before then. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Wed, Sep 11, 2019 at 5:38 AM Anastasia Lubennikova wrote: > I reviewed them and everything looks good except the idea of not > splitting dead posting tuples. > According to comments to scan->ignore_killed_tuples in genam.c:107, > it may lead to incorrect tuple order on a replica. > I don't sure, if it leads to any real problem, though, or it will be > resolved > by subsequent visibility checks. Fair enough, but I didn't do that because it's compelling on its own -- it isn't. I did it because it seemed like the best way to handle posting list splits in a version of the patch where LP_DEAD bits can be set on posting list tuples. I think that we have 3 high level options here: 1. We don't support kill_prior_tuple/LP_DEAD bit setting with posting lists at all. This is clearly the easiest approach. 2. We do what I did in v11 of the patch -- we make it so that _bt_insertonpg() and _bt_split() never have to deal with LP_DEAD posting lists that they must split in passing. 3. We add additional code to _bt_insertonpg() and _bt_split() to deal with the rare case where they must split an LP_DEAD posting list, probably by unsetting the bit or something like that. Obviously it would be wrong to leave the LP_DEAD bit set for the newly inserted heap tuples TID that must go in a posting list that had its LP_DEAD bit set -- that would make it dead to index scans even after its xact successfully committed. I think that you already agree that we want to have the kill_prior_tuple optimizations with posting lists, so #1 isn't really an option. That just leaves #2 and #3. Since posting list splits are already assumed to be quite rare, it seemed far simpler to take the conservative approach of forcing clean-up that removes LP_DEAD bits so that _bt_insertonpg() and _bt_split() don't have to think about it. Obviously I think it's important that we make as few changes as possible to _bt_insertonpg() and _bt_split(), in general. I don't understand what you mean about visibility checks. There is nothing truly special about the way in which _bt_findinsertloc() will sometimes have to kill LP_DEAD items so that _bt_insertonpg() and _bt_split() don't have to think about LP_DEAD posting lists. As far as recovery is concerned, it is just another XLOG_BTREE_DELETE record, like any other. Note that there is a second call to _bt_binsrch_insert() within _bt_findinsertloc() when it has to generate a new XLOG_BTREE_DELETE record (by calling _bt_dedup_one_page(), which calls _bt_delitems_delete() in a way that isn't dependent on the BTP_HAS_GARBAGE status bit being set). > Anyway, it's worth to add more comments in > _bt_killitems() explaining why it's safe. There is no question that the little snippet of code I added to _bt_killitems() in v11 is still too complicated. We also have to consider cases where the array overflows because the scan direction was changed (see the kill_prior_tuple comment block in btgetuple()). Yeah, it's messy. > Attached is v12, which contains WAL optimizations for posting split and > page > deduplication. Cool. > * xl_btree_split record doesn't contain posting tuple anymore, instead > it keeps > 'in_posting offset' and repeats the logic of _bt_insertonpg() as you > proposed > upthread. That looks good. > * I introduced new xlog record XLOG_BTREE_DEDUP_PAGE, which contains > info about > groups of tuples deduplicated into posting tuples. In principle, it is > possible > to fit it into some existing record, but I preferred to keep things clear. I definitely think that inventing a new WAL record was the right thing to do. > I haven't measured how these changes affect WAL size yet. > Do you have any suggestions on how to automate testing of new WAL records? > Is there any suitable place in regression tests? I don't know about the regression tests (I doubt that there is a natural place for such a test), but I came up with a rough test case. I more or less copied the approach that you took with the index build WAL reduction patches, though I also figured out a way of subtracting heapam WAL overhead to get a real figure. I attach the test case -- note that you'll need to use the "land" database with this. (This test case might need to be improved, but it's a good start.) > * I also noticed that _bt_dedup_one_page() can be optimized to return early > when none tuples were deduplicated. I wonder if we can introduce inner > statistic to tune deduplication? That is returning to the idea of > BT_COMPRESS_THRESHOLD, which can help to avoid extra work for pages that > have > very few duplicates or pages that are already full of posting lists. I think that the BT_COMPRESS_THRESHOLD idea is closely related to making _bt_dedup_one_page() behave incrementally. On my machine, v12 of the patch actually uses slightly more WAL than v11 did with the nbtree_wal_test.sql test case -- it's 6510 MB of nbtree WAL in v12 vs. 6502 MB in v11 (note that v11 benefits from WAL compression, so if I turned that off v12 would
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Thu, Aug 29, 2019 at 10:10 PM Peter Geoghegan wrote: > I see some Valgrind errors on v9, all of which look like the following > two sample errors I go into below. I've found a fix for these Valgrind issues. It's a matter of making sure that _bt_truncate() sizes new pivot tuples properly, which is quite subtle: --- a/src/backend/access/nbtree/nbtutils.c +++ b/src/backend/access/nbtree/nbtutils.c @@ -2155,8 +2155,11 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright, { BTreeTupleClearBtIsPosting(pivot); BTreeTupleSetNAtts(pivot, keepnatts); -pivot->t_info &= ~INDEX_SIZE_MASK; -pivot->t_info |= BTreeTupleGetPostingOffset(firstright); +if (keepnatts == natts) +{ +pivot->t_info &= ~INDEX_SIZE_MASK; +pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright)); +} } I'm varying how the new pivot tuple is sized here according to whether or not index_truncate_tuple() just does a CopyIndexTuple(). This very slightly changes the behavior of the nbtsplitloc.c stuff, but that's not a concern for me. I will post a patch with this and other tweaks next week. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Thu, Aug 29, 2019 at 5:07 PM Peter Geoghegan wrote: > I agree that v9 might be ever so slightly more space efficient than v5 > was, on balance. I see some Valgrind errors on v9, all of which look like the following two sample errors I go into below. First one: ==11193== VALGRINDERROR-BEGIN ==11193== Unaddressable byte(s) found during client check request ==11193==at 0x4C0E03: PageAddItemExtended (bufpage.c:332) ==11193==by 0x20F6C3: _bt_split (nbtinsert.c:1643) ==11193==by 0x20F6C3: _bt_insertonpg (nbtinsert.c:1206) ==11193==by 0x21239B: _bt_doinsert (nbtinsert.c:306) ==11193==by 0x2150EE: btinsert (nbtree.c:207) ==11193==by 0x20D63A: index_insert (indexam.c:186) ==11193==by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393) ==11193==by 0x391793: ExecInsert (nodeModifyTable.c:593) ==11193==by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219) ==11193==by 0x37306D: ExecProcNodeFirst (execProcnode.c:445) ==11193==by 0x36C738: ExecProcNode (executor.h:240) ==11193==by 0x36C738: ExecutePlan (execMain.c:1648) ==11193==by 0x36C738: standard_ExecutorRun (execMain.c:365) ==11193==by 0x36C7DD: ExecutorRun (execMain.c:309) ==11193==by 0x4CC41A: ProcessQuery (pquery.c:161) ==11193==by 0x4CC5EB: PortalRunMulti (pquery.c:1283) ==11193==by 0x4CD31C: PortalRun (pquery.c:796) ==11193==by 0x4C8EFC: exec_simple_query (postgres.c:1231) ==11193==by 0x4C9EE0: PostgresMain (postgres.c:4256) ==11193==by 0x453650: BackendRun (postmaster.c:4446) ==11193==by 0x453650: BackendStartup (postmaster.c:4137) ==11193==by 0x453650: ServerLoop (postmaster.c:1704) ==11193==by 0x454CAC: PostmasterMain (postmaster.c:1377) ==11193==by 0x3B85A1: main (main.c:210) ==11193== Address 0x9c11350 is 0 bytes after a recently re-allocated block of size 8,192 alloc'd ==11193==at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==11193==by 0x61085A: AllocSetAlloc (aset.c:914) ==11193==by 0x617AD8: palloc (mcxt.c:938) ==11193==by 0x21A829: _bt_mkscankey (nbtutils.c:107) ==11193==by 0x2118F3: _bt_doinsert (nbtinsert.c:93) ==11193==by 0x2150EE: btinsert (nbtree.c:207) ==11193==by 0x20D63A: index_insert (indexam.c:186) ==11193==by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393) ==11193==by 0x391793: ExecInsert (nodeModifyTable.c:593) ==11193==by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219) ==11193==by 0x37306D: ExecProcNodeFirst (execProcnode.c:445) ==11193==by 0x36C738: ExecProcNode (executor.h:240) ==11193==by 0x36C738: ExecutePlan (execMain.c:1648) ==11193==by 0x36C738: standard_ExecutorRun (execMain.c:365) ==11193==by 0x36C7DD: ExecutorRun (execMain.c:309) ==11193==by 0x4CC41A: ProcessQuery (pquery.c:161) ==11193==by 0x4CC5EB: PortalRunMulti (pquery.c:1283) ==11193==by 0x4CD31C: PortalRun (pquery.c:796) ==11193==by 0x4C8EFC: exec_simple_query (postgres.c:1231) ==11193==by 0x4C9EE0: PostgresMain (postgres.c:4256) ==11193==by 0x453650: BackendRun (postmaster.c:4446) ==11193==by 0x453650: BackendStartup (postmaster.c:4137) ==11193==by 0x453650: ServerLoop (postmaster.c:1704) ==11193==by 0x454CAC: PostmasterMain (postmaster.c:1377) ==11193== ==11193== VALGRINDERROR-END { Memcheck:User fun:PageAddItemExtended fun:_bt_split fun:_bt_insertonpg fun:_bt_doinsert fun:btinsert fun:index_insert fun:ExecInsertIndexTuples fun:ExecInsert fun:ExecModifyTable fun:ExecProcNodeFirst fun:ExecProcNode fun:ExecutePlan fun:standard_ExecutorRun fun:ExecutorRun fun:ProcessQuery fun:PortalRunMulti fun:PortalRun fun:exec_simple_query fun:PostgresMain fun:BackendRun fun:BackendStartup fun:ServerLoop fun:PostmasterMain fun:main } nbtinsert.c:1643 is the first PageAddItem() in _bt_split() -- the lefthikey call. Second one: ==11193== VALGRINDERROR-BEGIN ==11193== Invalid read of size 2 ==11193==at 0x20FDF5: _bt_insertonpg (nbtinsert.c:1126) ==11193==by 0x21239B: _bt_doinsert (nbtinsert.c:306) ==11193==by 0x2150EE: btinsert (nbtree.c:207) ==11193==by 0x20D63A: index_insert (indexam.c:186) ==11193==by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393) ==11193==by 0x391793: ExecInsert (nodeModifyTable.c:593) ==11193==by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219) ==11193==by 0x37306D: ExecProcNodeFirst (execProcnode.c:445) ==11193==by 0x36C738: ExecProcNode (executor.h:240) ==11193==by 0x36C738: ExecutePlan (execMain.c:1648) ==11193==by 0x36C738: standard_ExecutorRun (execMain.c:365) ==11193==by 0x36C7DD: ExecutorRun (execMain.c:309) ==11193==by 0x4CC41A: ProcessQuery (pquery.c:161) ==11193==by 0x4CC5EB: PortalRunMulti (pquery.c:1283) ==11193==by 0x4CD31C: PortalRun (pquery.c:796) ==11193==by 0x4C8EFC: exec_simple_query (postgres.c:1231) ==11193==by 0x4C9EE0: PostgresMain (postgres.c:4256)
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Thu, Aug 29, 2019 at 5:13 AM Anastasia Lubennikova wrote: > Your explanation helped me to understand that this approach can be > extended to > the case of insertion into posting list, that doesn't trigger posting > split, > and that nbtsplitloc indeed doesn't need to know about posting tuples > specific. > The code is much cleaner now. Fantastic! > Some individual indexes are larger, some are smaller compared to the > expected output. I agree that v9 might be ever so slightly more space efficient than v5 was, on balance. In any case v9 completely fixes the regression that I saw in the last version. I have pushed the changes to the test output for the serial tests that I privately maintain, that I gave you access to. The MGD test output also looks perfect. We may find that deduplication is a little too effective, in the sense that it packs so many tuples on to leaf pages that *concurrent* inserters will tend to get excessive page splits. We may find that it makes sense to aim for posting lists that are maybe 96% of BTMaxItemSize() -- note that BTREE_SINGLEVAL_FILLFACTOR is 96 for this reason. Concurrent inserters will tend to have heap TIDs that are slightly out of order, so we want to at least have enough space remaining on the left half of a "single value mode" split. We may end up with a design where deduplication anticipates what will be useful for nbtsplitloc.c. I still think that it's too early to start worrying about problems like this one -- I feel it will be useful to continue to focus on the code and the space utilization of the serial test cases for now. We can look at it at the same time that we think about adding back something like BT_COMPRESS_THRESHOLD. I am mentioning it now because it's probably a good time for you to start thinking about it, if you haven't already (actually, maybe I'm just describing what BT_COMPRESS_THRESHOLD was supposed to do in the first place). We'll need to have a good benchmark to assess these questions, and it's not obvious what that will be. Two possible candidates are TPC-H and TPC-E. (Of course, I mean running them for real -- not using their indexes to make sure that the nbtsplitloc.c stuff works well in isolation.) Any thoughts on a conventional benchmark that allows us to understand the patch's impact on both throughput and latency? BTW, I notice that we often have indexes that are quite a lot smaller when they were created with retail insertions rather than with CREATE INDEX/REINDEX. This is not new, but the difference is much larger than it typically is without the patch. For example, the TPC-E index on trade.t_ca_id (which is named "i_t_ca_id" or "i_t_ca_id2" in my test) is 162 MB with CREATE INDEX/REINDEX, and 121 MB with retail insertions (assuming the insertions use the actual order from the test). I'm not sure what to do about this, if anything. I mean, the reason that the retail insertions do better is that they have the nbtsplitloc.c stuff, and because we don't split the page until it's 100% full and until deduplication stops helping -- we could apply several rounds of deduplication before we actually have to split the cage. So the difference that we see here is both logical and surprising. How do you feel about this CREATE INDEX index-size-is-larger business? -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
28.08.2019 6:19, Peter Geoghegan wrote: On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova wrote: Now the algorithm is the following: - In case page split is needed, pass both tuples to _bt_split(). _bt_findsplitloc() is now aware of upcoming replacement of origtup with neworigtup, so it uses correct item size where needed. It seems that now all replace operations are crash-safe. The new patch passes all regression tests, so I think it's ready for review again. I think that the way this works within nbtsplitloc.c is too complicated. In v5, the only thing that nbtsplitloc.c knew about deduplication was that it could be sure that suffix truncation would at least make a posting list into a single heap TID in the worst case. This consideration was mostly about suffix truncation, not deduplication, which seemed like a good thing to me. _bt_split() and _bt_findsplitloc() should know as little as possible about posting lists. Obviously it will sometimes be necessary to deal with the case where a posting list is about to become too big (i.e. it's about to go over BTMaxItemSize()), and so must be split. Less often, a page split will be needed because of one of these posting list splits. These are two complicated areas (posting list splits and page splits), and it would be a good idea to find a way to separate them as much as possible. Remember, nbtsplitloc.c works by pretending that the new item that cannot fit on the page is already on its own imaginary version of the page that *can* fit the new item, along with everything else from the original/actual page. That gets *way* too complicated when it has to deal with the fact that the new item is being merged with an existing item. Perhaps nbtsplitloc.c could also "pretend" that the new item is always a plain tuple, without knowing anything about posting lists. Almost like how it worked in v5. We always want posting lists to be as close to the BTMaxItemSize() size as possible, because that helps with space utilization. In v5 of the patch, this was what happened, because, in effect, we didn't try to do anything complicated with the new item. This worked well, apart from the crash safety issue. Maybe we can simulate the v5 approach, giving us the best of all worlds (good space utilization, simplicity, and crash safety). Something like this: * Posting list splits should always result in one posting list that is at or just under BTMaxItemSize() in size, plus one plain tuple to its immediate right on the page. This is similar to the more common case where we cannot add additional tuples to a posting list due to the BTMaxItemSize() restriction, and so end up with a single tuple (or a smaller posting list with the same value) to the right of a BTMaxItemSize()-sized posting list tuple. I don't see a reason to split a posting list in the middle -- we should always split to the right, leaving the posting list as large as possible. * When there is a simple posting list split, with no page split, the logic required is fairly straightforward: We rewrite the posting list in-place so that our new item goes wherever it belongs in the existing posting list on the page (we memmove() the posting list to make space for the new TID, basically). The old last/rightmost TID in the original posting list becomes a new, plain tuple. We may need a new WAL record for this, but it's not that different to a regular leaf page insert. * When this happens to result in a page split, we then have a "fake" new item -- the right half of the posting list that we split, which is always a plain item. Obviously we need to be a bit careful with the WAL logging, but the space accounting within _bt_split() and _bt_findsplitloc() can work just the same as now. nbtsplitloc.c can work like it did in v5, when the only thing it knew about posting lists was that _bt_truncate() always removes them, maybe leaving a single TID behind in the new high key. (Note also that it's not okay to remove the conservative assumption about at least having space for one heap TID within _bt_recsplitloc() -- that needs to be restored to its v5 state in the next version of the patch.) Because deduplication is lazy, there is little value in doing deduplication of the new item (which may or may not be the fake new item). The nbtsplitloc.c logic will "trap" duplicates on the same page today, so we can just let deduplication of the new item happen at a later time. _bt_split() can almost pretend that posting lists don't exist, and nbtsplitloc.c needs to know nothing about posting lists (apart from the way that _bt_truncate() behaves with posting lists). We "lie" to _bt_findsplitloc(), and tell it that the new item is our fake new item -- it doesn't do anything that will be broken by that lie, because it doesn't care about the actual content of posting lists. And, we can fix the "fake new item is not actually real new item" issue at one point within _bt_split(), just as we're about to WAL log. What do you think of that
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova wrote: > Now the algorithm is the following: > - In case page split is needed, pass both tuples to _bt_split(). > _bt_findsplitloc() is now aware of upcoming replacement of origtup with > neworigtup, so it uses correct item size where needed. > > It seems that now all replace operations are crash-safe. The new patch passes > all regression tests, so I think it's ready for review again. I think that the way this works within nbtsplitloc.c is too complicated. In v5, the only thing that nbtsplitloc.c knew about deduplication was that it could be sure that suffix truncation would at least make a posting list into a single heap TID in the worst case. This consideration was mostly about suffix truncation, not deduplication, which seemed like a good thing to me. _bt_split() and _bt_findsplitloc() should know as little as possible about posting lists. Obviously it will sometimes be necessary to deal with the case where a posting list is about to become too big (i.e. it's about to go over BTMaxItemSize()), and so must be split. Less often, a page split will be needed because of one of these posting list splits. These are two complicated areas (posting list splits and page splits), and it would be a good idea to find a way to separate them as much as possible. Remember, nbtsplitloc.c works by pretending that the new item that cannot fit on the page is already on its own imaginary version of the page that *can* fit the new item, along with everything else from the original/actual page. That gets *way* too complicated when it has to deal with the fact that the new item is being merged with an existing item. Perhaps nbtsplitloc.c could also "pretend" that the new item is always a plain tuple, without knowing anything about posting lists. Almost like how it worked in v5. We always want posting lists to be as close to the BTMaxItemSize() size as possible, because that helps with space utilization. In v5 of the patch, this was what happened, because, in effect, we didn't try to do anything complicated with the new item. This worked well, apart from the crash safety issue. Maybe we can simulate the v5 approach, giving us the best of all worlds (good space utilization, simplicity, and crash safety). Something like this: * Posting list splits should always result in one posting list that is at or just under BTMaxItemSize() in size, plus one plain tuple to its immediate right on the page. This is similar to the more common case where we cannot add additional tuples to a posting list due to the BTMaxItemSize() restriction, and so end up with a single tuple (or a smaller posting list with the same value) to the right of a BTMaxItemSize()-sized posting list tuple. I don't see a reason to split a posting list in the middle -- we should always split to the right, leaving the posting list as large as possible. * When there is a simple posting list split, with no page split, the logic required is fairly straightforward: We rewrite the posting list in-place so that our new item goes wherever it belongs in the existing posting list on the page (we memmove() the posting list to make space for the new TID, basically). The old last/rightmost TID in the original posting list becomes a new, plain tuple. We may need a new WAL record for this, but it's not that different to a regular leaf page insert. * When this happens to result in a page split, we then have a "fake" new item -- the right half of the posting list that we split, which is always a plain item. Obviously we need to be a bit careful with the WAL logging, but the space accounting within _bt_split() and _bt_findsplitloc() can work just the same as now. nbtsplitloc.c can work like it did in v5, when the only thing it knew about posting lists was that _bt_truncate() always removes them, maybe leaving a single TID behind in the new high key. (Note also that it's not okay to remove the conservative assumption about at least having space for one heap TID within _bt_recsplitloc() -- that needs to be restored to its v5 state in the next version of the patch.) Because deduplication is lazy, there is little value in doing deduplication of the new item (which may or may not be the fake new item). The nbtsplitloc.c logic will "trap" duplicates on the same page today, so we can just let deduplication of the new item happen at a later time. _bt_split() can almost pretend that posting lists don't exist, and nbtsplitloc.c needs to know nothing about posting lists (apart from the way that _bt_truncate() behaves with posting lists). We "lie" to _bt_findsplitloc(), and tell it that the new item is our fake new item -- it doesn't do anything that will be broken by that lie, because it doesn't care about the actual content of posting lists. And, we can fix the "fake new item is not actually real new item" issue at one point within _bt_split(), just as we're about to WAL log. What do you think of that approach? -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
23.08.2019 7:33, Peter Geoghegan wrote: On Wed, Aug 21, 2019 at 10:19 AM Anastasia Lubennikova wrote: I'm going to look through the patch once more to update nbtxlog comments, where needed and answer to your remarks that are still left in the comments. Have you been using amcheck's rootdescend verification? No, I haven't checked it with the latest version yet. There were many large indexes that amcheck didn't detect a problem with. I don't yet understand what the problem is, or why we only see the problem for a small number of indexes. Note that all of these indexes passed verification with v5, so this is some kind of regression. I also noticed that there were some regressions in the size of indexes -- indexes were not nearly as small as they were in v5 in some cases. The overall picture was a clear regression in how effective deduplication is. Do these indexes have something in common? Maybe some specific workload? Are there any error messages in log? I'd like to specify what caused the problem. There were several major changes between v5 and v8: - dead tuples handling added in v6; - _bt_split changes for posting tuples in v7; - WAL logging of posting tuple changes in v8. I don't think the last one could break regular indexes on master. Do you see the same regression in v6, v7? I think that it would save time if you had direct access to my test data, even though it's a bit cumbersome. You'll have to download about 10GB of dumps, which require plenty of disk space when restored: Want me to send this data and the associated tests script over to you? Yes, I think it will help me to debug the patch faster. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Wed, Aug 21, 2019 at 10:19 AM Anastasia Lubennikova wrote: > I'm going to look through the patch once more to update nbtxlog > comments, where needed and > answer to your remarks that are still left in the comments. Have you been using amcheck's rootdescend verification? I see this problem with v8, with the TPC-H test data: DEBUG: finished verifying presence of 150 tuples from table "customer" with bitset 51.09% set ERROR: could not find tuple using search from root page in index "idx_customer_nationkey2" I've been running my standard amcheck query with these databases, which is: SELECT bt_index_parent_check(index => c.oid, heapallindexed => true, rootdescend => true), c.relname, c.relpages FROM pg_index i JOIN pg_opclass op ON i.indclass[0] = op.oid JOIN pg_am am ON op.opcmethod = am.oid JOIN pg_class c ON i.indexrelid = c.oid JOIN pg_namespace n ON c.relnamespace = n.oid WHERE am.amname = 'btree' AND c.relpersistence != 't' AND c.relkind = 'i' AND i.indisready AND i.indisvalid ORDER BY c.relpages DESC; There were many large indexes that amcheck didn't detect a problem with. I don't yet understand what the problem is, or why we only see the problem for a small number of indexes. Note that all of these indexes passed verification with v5, so this is some kind of regression. I also noticed that there were some regressions in the size of indexes -- indexes were not nearly as small as they were in v5 in some cases. The overall picture was a clear regression in how effective deduplication is. I think that it would save time if you had direct access to my test data, even though it's a bit cumbersome. You'll have to download about 10GB of dumps, which require plenty of disk space when restored: regression=# \l+ List of databases Name| Owner | Encoding | Collate | Ctype| Access privileges | Size | Tablespace |Description +---+--+++---+-++ land | pg| UTF8 | en_US.UTF8 | en_US.UTF8 | | 6425 MB | pg_default | mgd| pg| UTF8 | en_US.UTF8 | en_US.UTF8 | | 61 GB | pg_default | postgres | pg| UTF8 | en_US.UTF8 | en_US.UTF8 | | 7753 kB | pg_default | default administrative connection database regression | pg| UTF8 | en_US.UTF8 | en_US.UTF8 | | 886 MB | pg_default | template0 | pg| UTF8 | en_US.UTF8 | en_US.UTF8 | =c/pg +| 7609 kB | pg_default | unmodifiable empty database | | ||| pg=CTc/pg | || template1 | pg| UTF8 | en_US.UTF8 | en_US.UTF8 | =c/pg +| 7609 kB | pg_default | default template for new databases | | ||| pg=CTc/pg | || tpcc | pg| UTF8 | en_US.UTF8 | en_US.UTF8 | | 10 GB | pg_default | tpce | pg| UTF8 | en_US.UTF8 | en_US.UTF8 | | 26 GB | pg_default | tpch | pg| UTF8 | en_US.UTF8 | en_US.UTF8 | | 32 GB | pg_default | (9 rows) I have found it very valuable to use this test data when changing nbtsplitloc.c, or anything that could affect where page splits make free space available. If this is too much data to handle conveniently, then you could skip "mgd" and almost have as much test coverage. There really does seem to be a benefit to using diverse test cases like this, because sometimes regressions only affect a small number of specific indexes for specific reasons. For example, only TPC-H has a small number of indexes that have tuples that are inserted in order, but also have many duplicates. Removing the BT_COMPRESS_THRESHOLD stuff really helped with those indexes. Want me to send this data and the associated tests script over to you? -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
20.08.2019 4:04, Peter Geoghegan wrote: On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova wrote: It seems that now all replace operations are crash-safe. The new patch passes all regression tests, so I think it's ready for review again. I'm looking at it now. I'm going to spend a significant amount of time on this tomorrow. I think that we should start to think about efficient WAL-logging now. Thank you for the review. The new version v8 is attached. Compared to previous version, this patch includes updated btree_xlog_insert() and btree_xlog_split() so that WAL records now only contain data about updated posting tuple and don't require full page writes. I haven't updated pg_waldump yet. It is postponed until we agree on nbtxlog changes. Also in this patch I renamed all 'compress' keywords to 'deduplicate' and did minor cleanup of outdated comments. I'm going to look through the patch once more to update nbtxlog comments, where needed and answer to your remarks that are still left in the comments. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company commit d73c1b8e10177dfb55ff1b1bac999f85d2a0298d Author: Anastasia Date: Wed Aug 21 20:00:54 2019 +0300 v8-0001-Deduplication-in-nbtree.patch diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c index 05e7d67..ddc511a 100644 --- a/contrib/amcheck/verify_nbtree.c +++ b/contrib/amcheck/verify_nbtree.c @@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state) size_t tupsize; BTScanInsert skey; bool lowersizelimit; + ItemPointer scantid; CHECK_FOR_INTERRUPTS(); @@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state) /* * Readonly callers may optionally verify that non-pivot tuples can - * each be found by an independent search that starts from the root + * each be found by an independent search that starts from the root. + * Note that we deliberately don't do individual searches for each + * "logical" posting list tuple, since the posting list itself is + * validated by other checks. */ if (state->rootdescend && P_ISLEAF(topaque) && !bt_rootdescend(state, itup)) { char *itid, *htid; + ItemPointer tid = BTreeTupleGetHeapTID(itup); itid = psprintf("(%u,%u)", state->targetblock, offset); htid = psprintf("(%u,%u)", - ItemPointerGetBlockNumber(&(itup->t_tid)), - ItemPointerGetOffsetNumber(&(itup->t_tid))); + ItemPointerGetBlockNumber(tid), + ItemPointerGetOffsetNumber(tid)); ereport(ERROR, (errcode(ERRCODE_INDEX_CORRUPTED), errmsg("could not find tuple using search from root page in index \"%s\"", RelationGetRelationName(state->rel)), - errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.", + errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.", itid, htid, (uint32) (state->targetlsn >> 32), (uint32) state->targetlsn))); } + /* + * If tuple is actually a posting list, make sure posting list TIDs + * are in order. + */ + if (BTreeTupleIsPosting(itup)) + { + ItemPointerData last; + ItemPointer current; + + ItemPointerCopy(BTreeTupleGetHeapTID(itup), ); + + for (int i = 1; i < BTreeTupleGetNPosting(itup); i++) + { + +current = BTreeTupleGetPostingN(itup, i); + +if (ItemPointerCompare(current, ) <= 0) +{ + char *itid, + *htid; + + itid = psprintf("(%u,%u)", state->targetblock, offset); + htid = psprintf("(%u,%u)", + ItemPointerGetBlockNumberNoCheck(current), + ItemPointerGetOffsetNumberNoCheck(current)); + + ereport(ERROR, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("posting list heap TIDs out of order in index \"%s\"", + RelationGetRelationName(state->rel)), + errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.", +itid, htid, +(uint32) (state->targetlsn >> 32), +(uint32) state->targetlsn))); +} + +ItemPointerCopy(current, ); + } + } + /* Build insertion scankey for current page offset */ skey = bt_mkscankey_pivotsearch(state->rel, itup); @@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state) { IndexTuple norm; - norm = bt_normalize_tuple(state, itup); - bloom_add_element(state->filter, (unsigned char *) norm, - IndexTupleSize(norm)); - /* Be tidy */ - if (norm != itup) -pfree(norm); + if (BTreeTupleIsPosting(itup)) + { +IndexTuple onetup; + +/* Fingerprint all elements of posting tuple one by one */ +for (int i = 0; i < BTreeTupleGetNPosting(itup); i++) +{ + onetup = BTreeGetNthTupleOfPosting(itup, i); + + norm = bt_normalize_tuple(state, onetup); + bloom_add_element(state->filter, (unsigned char *) norm, + IndexTupleSize(norm)); + /* Be tidy */ + if (norm != onetup) +
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova wrote: > Now the algorithm is the following: > > - If bt_findinsertloc() found out that tuple belongs to existing posting > tuple's > TID interval, it sets 'in_posting_offset' variable and passes it to > _bt_insertonpg() > > - If 'in_posting_offset' is valid and origtup is valid, > merge our itup into origtup. > > It can result in one tuple neworigtup, that must replace origtup; or two > tuples: > neworigtup and newrighttup, if the result exceeds BTMaxItemSize, That sounds like the right way to do it. > - If two new tuple(s) fit into the old page, we're lucky. > call _bt_delete_and_insert(..., neworigtup, newrighttup, newitemoff) to > atomically replace oldtup with new tuple(s) and generate xlog record. > > - In case page split is needed, pass both tuples to _bt_split(). > _bt_findsplitloc() is now aware of upcoming replacement of origtup with > neworigtup, so it uses correct item size where needed. That makes sense, since _bt_split() is responsible for both splitting the page, and inserting the new item on either the left or right page, as part of the first phase of a page split. In other words, if you're adding something new to _bt_insertonpg(), you probably also need to add something new to _bt_split(). So that's what you did. > It seems that now all replace operations are crash-safe. The new patch passes > all regression tests, so I think it's ready for review again. I'm looking at it now. I'm going to spend a significant amount of time on this tomorrow. I think that we should start to think about efficient WAL-logging now. > In the meantime, I'll run more stress-tests. As you probably realize, wal_consistency_checking is a good thing to use with your tests here. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Tue, Aug 13, 2019 at 8:45 AM Anastasia Lubennikova wrote: > > I still need to think about the exact details of alignment within > > _bt_insertonpg_in_posting(). I'm worried about boundary cases there. I > > could be wrong. > Could you explain more about these cases? > Now I don't understand the problem. Maybe there is no problem. > Thank you for the patch. > Still, I'd suggest to leave it as a possible future improvement, so that > it doesn't > distract us from the original feature. I don't even think that it's useful work for the future. It's just nice to be sure that we could support unique index deduplication if it made sense. Which it doesn't. If I didn't write the patch that implements deduplication for unique indexes, I might still not realize that we need the index_compute_xid_horizon_for_tuples() stuff in certain other places. I'm not serious about it at all, except as a learning exercise/experiment. > I added to v6 another related fix for _bt_compress_one_page(). > Previous code was implicitly deleted DEAD items without > calling index_compute_xid_horizon_for_tuples(). > New code has a check whether DEAD items on the page exist and remove > them if any. > Another possible solution is to copy dead items as is from old page to > the new one, > but I think it's good to remove dead tuples as fast as possible. I think that what you've done in v7 is probably the best way to do it. It's certainly simple, which is appropriate given that we're not really expecting to see LP_DEAD items within _bt_compress_one_page() (we just need to be prepared for them). > > v5 makes _bt_insertonpg_in_posting() prepared to overwrite an > > existing item if it's an LP_DEAD item that falls in the same TID range > > (that's _bt_compare()-wise "equal" to an existing tuple, which may or > > may not be a posting list tuple already). I haven't made this code do > > something like call index_compute_xid_horizon_for_tuples(), even > > though that's needed for correctness (i.e. this new code is currently > > broken in the same way that I mentioned unique index support is > > broken). > Is it possible that DEAD tuple to delete was smaller than itup? I'm not sure what you mean by this. I suppose that it doesn't matter, since we both prefer the alternative that you came up with anyway. > > How do you feel about officially calling this deduplication, not > > compression? I think that it's a more accurate name for the technique. > I agree. > Should I rename all related names of functions and variables in the patch? Please rename them when convenient. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
13.08.2019 18:45, Anastasia Lubennikova wrote: I also added a nearby FIXME comment to _bt_insertonpg_in_posting() -- I don't think think that the code for splitting a posting list in two is currently crash-safe. Good catch. It seems, that I need to rearrange the code. I'll send updated patch this week. Attached is v7. In this version of the patch, I heavily refactored the code of insertion into posting tuple. bt_split logic is quite complex, so I omitted a couple of optimizations. They are mentioned in TODO comments. Now the algorithm is the following: - If bt_findinsertloc() found out that tuple belongs to existing posting tuple's TID interval, it sets 'in_posting_offset' variable and passes it to _bt_insertonpg() - If 'in_posting_offset' is valid and origtup is valid, merge our itup into origtup. It can result in one tuple neworigtup, that must replace origtup; or two tuples: neworigtup and newrighttup, if the result exceeds BTMaxItemSize, - If two new tuple(s) fit into the old page, we're lucky. call _bt_delete_and_insert(..., neworigtup, newrighttup, newitemoff) to atomically replace oldtup with new tuple(s) and generate xlog record. - In case page split is needed, pass both tuples to _bt_split(). _bt_findsplitloc() is now aware of upcoming replacement of origtup with neworigtup, so it uses correct item size where needed. It seems that now all replace operations are crash-safe. The new patch passes all regression tests, so I think it's ready for review again. In the meantime, I'll run more stress-tests. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c index 05e7d67..504bca2 100644 --- a/contrib/amcheck/verify_nbtree.c +++ b/contrib/amcheck/verify_nbtree.c @@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state) size_t tupsize; BTScanInsert skey; bool lowersizelimit; + ItemPointer scantid; CHECK_FOR_INTERRUPTS(); @@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state) /* * Readonly callers may optionally verify that non-pivot tuples can - * each be found by an independent search that starts from the root + * each be found by an independent search that starts from the root. + * Note that we deliberately don't do individual searches for each + * "logical" posting list tuple, since the posting list itself is + * validated by other checks. */ if (state->rootdescend && P_ISLEAF(topaque) && !bt_rootdescend(state, itup)) { char *itid, *htid; + ItemPointer tid = BTreeTupleGetHeapTID(itup); itid = psprintf("(%u,%u)", state->targetblock, offset); htid = psprintf("(%u,%u)", - ItemPointerGetBlockNumber(&(itup->t_tid)), - ItemPointerGetOffsetNumber(&(itup->t_tid))); + ItemPointerGetBlockNumber(tid), + ItemPointerGetOffsetNumber(tid)); ereport(ERROR, (errcode(ERRCODE_INDEX_CORRUPTED), errmsg("could not find tuple using search from root page in index \"%s\"", RelationGetRelationName(state->rel)), - errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.", + errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.", itid, htid, (uint32) (state->targetlsn >> 32), (uint32) state->targetlsn))); } + /* + * If tuple is actually a posting list, make sure posting list TIDs + * are in order. + */ + if (BTreeTupleIsPosting(itup)) + { + ItemPointerData last; + ItemPointer current; + + ItemPointerCopy(BTreeTupleGetHeapTID(itup), ); + + for (int i = 1; i < BTreeTupleGetNPosting(itup); i++) + { + +current = BTreeTupleGetPostingN(itup, i); + +if (ItemPointerCompare(current, ) <= 0) +{ + char *itid, + *htid; + + itid = psprintf("(%u,%u)", state->targetblock, offset); + htid = psprintf("(%u,%u)", + ItemPointerGetBlockNumberNoCheck(current), + ItemPointerGetOffsetNumberNoCheck(current)); + + ereport(ERROR, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("posting list heap TIDs out of order in index \"%s\"", + RelationGetRelationName(state->rel)), + errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.", +itid, htid, +(uint32) (state->targetlsn >> 32), +(uint32) state->targetlsn))); +} + +ItemPointerCopy(current, ); + } + } + /* Build insertion scankey for current page offset */ skey = bt_mkscankey_pivotsearch(state->rel, itup); @@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state) { IndexTuple norm; - norm = bt_normalize_tuple(state, itup); - bloom_add_element(state->filter, (unsigned char *) norm, - IndexTupleSize(norm)); - /* Be tidy */ - if (norm != itup) -pfree(norm); + if (BTreeTupleIsPosting(itup)) + { +IndexTuple onetup; + +/*
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
06.08.2019 4:28, Peter Geoghegan wrote: Attached is v5, which is based on your v4. The three main differences between this and v4 are: * Removed BT_COMPRESS_THRESHOLD stuff, for the reasons explained in my July 24 e-mail. We can always add something like this back during performance validation of the patch. Right now, having no BT_COMPRESS_THRESHOLD limit definitely improves space utilization for certain important cases, which seems more important than the uncertain/speculative downside. Fair enough. I think we can measure performance and make a decision, when patch will stabilize. * We now have experimental support for unique indexes. This is broken out into its own patch. * We now handle LP_DEAD items in a special way within _bt_insertonpg_in_posting(). As you pointed out already, we do need to think about LP_DEAD items directly, rather than assuming that they cannot be on the page that _bt_insertonpg_in_posting() must process. More on that later. If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple can be larger. It may happen if keysize <= 4 byte. In this situation original tuples must have been aligned to size 16 bytes each, and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is also safe. I still need to think about the exact details of alignment within _bt_insertonpg_in_posting(). I'm worried about boundary cases there. I could be wrong. Could you explain more about these cases? Now I don't understand the problem. The main reason why I decided to avoid applying compression to unique indexes is the performance of microvacuum. It is not applied to items inside a posting tuple. And I expect it to be important for unique indexes, which ideally contain only a few live values. I found that the performance of my experimental patch with unique index was significantly worse. It looks like this is a bad idea, as you predicted, though we may still want to do deduplication/compression with NULL values in unique indexes. I did learn a few things from implementing unique index support, though. BTW, there is a subtle bug in how my unique index patch does WAL-logging -- see my comments within index_compute_xid_horizon_for_tuples(). The bug shouldn't matter if replication isn't used. I don't think that we're going to use this experimental patch at all, so I didn't bother fixing the bug. Thank you for the patch. Still, I'd suggest to leave it as a possible future improvement, so that it doesn't distract us from the original feature. if (ItemIdIsDead(itemId)) continue; In the previous review Rafia asked about "some reason". Trying to figure out if this situation possible, I changed this line to Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a performance test. Unfortunately, I was not able to reproduce it. I found it easy enough to see LP_DEAD items within _bt_insertonpg_in_posting() when running pgbench with the extra unique index patch. To give you a simple example of how this can happen, consider the comments about BTP_HAS_GARBAGE within _bt_delitems_vacuum(). That probably isn't the only way it can happen, either. ISTM that we need to be prepared for LP_DEAD items during deduplication, rather than trying to prevent deduplication from ever having to see an LP_DEAD item. I added to v6 another related fix for _bt_compress_one_page(). Previous code was implicitly deleted DEAD items without calling index_compute_xid_horizon_for_tuples(). New code has a check whether DEAD items on the page exist and remove them if any. Another possible solution is to copy dead items as is from old page to the new one, but I think it's good to remove dead tuples as fast as possible. v5 makes _bt_insertonpg_in_posting() prepared to overwrite an existing item if it's an LP_DEAD item that falls in the same TID range (that's _bt_compare()-wise "equal" to an existing tuple, which may or may not be a posting list tuple already). I haven't made this code do something like call index_compute_xid_horizon_for_tuples(), even though that's needed for correctness (i.e. this new code is currently broken in the same way that I mentioned unique index support is broken). Is it possible that DEAD tuple to delete was smaller than itup? I also added a nearby FIXME comment to _bt_insertonpg_in_posting() -- I don't think think that the code for splitting a posting list in two is currently crash-safe. Good catch. It seems, that I need to rearrange the code. I'll send updated patch this week. How do you feel about officially calling this deduplication, not compression? I think that it's a more accurate name for the technique. I agree. Should I rename all related names of functions and variables in the patch? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company commit 9ac37503c71f7623413a2e406d81f5c9a4b02742 Author: Anastasia Date: Tue Aug 13 17:00:41 2019 +0300
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Wed, Jul 31, 2019 at 9:23 AM Anastasia Lubennikova wrote: > > * Included my own pageinspect hack to visualize the minimum TIDs in > > posting lists. It's broken out into a separate patch file. The code is > > very rough, but it might help someone else, so I thought I'd include > > it. > Cool, I think we should add it to the final patchset, > probably, as separate function by analogy with tuple_data_split. Good idea. Attached is v5, which is based on your v4. The three main differences between this and v4 are: * Removed BT_COMPRESS_THRESHOLD stuff, for the reasons explained in my July 24 e-mail. We can always add something like this back during performance validation of the patch. Right now, having no BT_COMPRESS_THRESHOLD limit definitely improves space utilization for certain important cases, which seems more important than the uncertain/speculative downside. * We now have experimental support for unique indexes. This is broken out into its own patch. * We now handle LP_DEAD items in a special way within _bt_insertonpg_in_posting(). As you pointed out already, we do need to think about LP_DEAD items directly, rather than assuming that they cannot be on the page that _bt_insertonpg_in_posting() must process. More on that later. > If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple > can be > larger. It may happen if keysize <= 4 byte. > In this situation original tuples must have been aligned to size 16 > bytes each, > and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is > also safe. I still need to think about the exact details of alignment within _bt_insertonpg_in_posting(). I'm worried about boundary cases there. I could be wrong. > I changed DEBUG message to ERROR in v4 and it passes all regression tests. > I doubt that it covers all corner cases, so I'll try to add more special > tests. It also passes my tests, FWIW. > Hmm, I can't get the problem. > In current implementation each posting tuple is smaller than BTMaxItemSize, > so no split can lead to having tuple of larger size. That sounds correct, then. > No, we don't need them both. I don't mind combining them into one macro. > Actually, we never needed BTreeTupleGetMinTID(), > since its functionality is covered by BTreeTupleGetHeapTID. I've removed BTreeTupleGetMinTID() in v5. I think it's fine to just have a comment next to BTreeTupleGetHeapTID(), and another comment next to BTreeTupleGetMaxTID(). > The main reason why I decided to avoid applying compression to unique > indexes > is the performance of microvacuum. It is not applied to items inside a > posting > tuple. And I expect it to be important for unique indexes, which ideally > contain only a few live values. I found that the performance of my experimental patch with unique index was significantly worse. It looks like this is a bad idea, as you predicted, though we may still want to do deduplication/compression with NULL values in unique indexes. I did learn a few things from implementing unique index support, though. BTW, there is a subtle bug in how my unique index patch does WAL-logging -- see my comments within index_compute_xid_horizon_for_tuples(). The bug shouldn't matter if replication isn't used. I don't think that we're going to use this experimental patch at all, so I didn't bother fixing the bug. > if (ItemIdIsDead(itemId)) > continue; > > In the previous review Rafia asked about "some reason". > Trying to figure out if this situation possible, I changed this line to > Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a > performance > test. Unfortunately, I was not able to reproduce it. I found it easy enough to see LP_DEAD items within _bt_insertonpg_in_posting() when running pgbench with the extra unique index patch. To give you a simple example of how this can happen, consider the comments about BTP_HAS_GARBAGE within _bt_delitems_vacuum(). That probably isn't the only way it can happen, either. ISTM that we need to be prepared for LP_DEAD items during deduplication, rather than trying to prevent deduplication from ever having to see an LP_DEAD item. v5 makes _bt_insertonpg_in_posting() prepared to overwrite an existing item if it's an LP_DEAD item that falls in the same TID range (that's _bt_compare()-wise "equal" to an existing tuple, which may or may not be a posting list tuple already). I haven't made this code do something like call index_compute_xid_horizon_for_tuples(), even though that's needed for correctness (i.e. this new code is currently broken in the same way that I mentioned unique index support is broken). I also added a nearby FIXME comment to _bt_insertonpg_in_posting() -- I don't think think that the code for splitting a posting list in two is currently crash-safe. How do you feel about officially calling this deduplication, not compression? I think that it's a more accurate name for the technique. -- Peter Geoghegan v5-0001-Compression-deduplication-in-nbtree.patch
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
24.07.2019 4:22, Peter Geoghegan wrote: Attached is a revised version of your v2 that fixes this issue -- I'll call this v3. In general, my goal for the revision was to make sure that all of my old tests from the v12 work passed, and to make sure that amcheck can detect almost any possible problem. I tested the amcheck changes by corrupting random state in a test index using pg_hexedit, then making sure that amcheck actually complained in each case. I also fixed one or two bugs in passing, including the bug that caused an assertion failure in _bt_truncate(). That was down to a subtle off-by-one issue within _bt_insertonpg_in_posting(). Overall, I didn't make that many changes to your v2. There are probably some things about the patch that I still don't understand, or things that I have misunderstood. Thank you for this review and fixes. * Changed the custom binary search code within _bt_compare_posting() to look more like _bt_binsrch() and _bt_binsrch_insert(). Do you know of any reason not to do it that way? It's ok to update it. There was no particular reason, just my habit. * Added quite a few "FIXME"/"XXX" comments at various points, to indicate where I have general concerns that need more discussion. + * FIXME: The calls to BTreeGetNthTupleOfPosting() allocate memory, If we only need to check TIDs, we don't need BTreeGetNthTupleOfPosting(), we can use BTreeTupleGetPostingN() instead and iterate over TIDs, not tuples. Fixed in version 4. * Included my own pageinspect hack to visualize the minimum TIDs in posting lists. It's broken out into a separate patch file. The code is very rough, but it might help someone else, so I thought I'd include it. Cool, I think we should add it to the final patchset, probably, as separate function by analogy with tuple_data_split. I also have some new concerns about the code in the patch that I will point out now (though only as something to think about a solution on -- I am unsure myself): * It's a bad sign that compression involves calls to PageAddItem() that are allowed to fail (we just give up on compression when that happens). For one thing, all existing calls to PageAddItem() in Postgres are never expected to fail -- if they do fail we get a "can't happen" error that suggests corruption. It was a good idea to take this approach to get the patch to work, and to prove the general idea, but we now need to fully work out all the details about the use of space. This includes complicated new questions around how alignment is supposed to work. The main reason to implement this gentle error handling is the fact that deduplication could cause storage overhead, which leads to running out of space on the page. First of all, it is a legacy of the previous versions where BTreeFormPostingTuple was not able to form non-posting tuple even in case where a number of posting items is 1. Another case that was in my mind is the situation where we have 2 tuples: t_tid | t_info | key + t_tid | t_info | key and compressed result is: t_tid | t_info | key | t_tid | t_tid If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple can be larger. It may happen if keysize <= 4 byte. In this situation original tuples must have been aligned to size 16 bytes each, and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is also safe. I changed DEBUG message to ERROR in v4 and it passes all regression tests. I doubt that it covers all corner cases, so I'll try to add more special tests. Alignment in nbtree is already complicated today -- you're supposed to MAXALIGN() everything in nbtree, so that the MAXALIGN() within bufpage.c routines cannot be different to the lp_len/IndexTupleSize() length (note that heapam can have tuples whose lp_len isn't aligned, so nbtree could do it differently if it proved useful). Code within nbtsplitloc.c fully understands the space requirements for the bufpage.c routines, and is very careful about it. (The bufpage.c details are supposed to be totally hidden from code like nbtsplitloc.c, but I guess that that ideal isn't quite possible in reality. Code comments don't really explain the situation today.) I'm not sure what it would look like for this patch to be as precise about free space as nbtsplitloc.c already is, even though that seems desirable (I just know that it would mean you would trust PageAddItem() to work in all cases). The patch is different to what we already have today in that it tries to add *less than* a single MAXALIGN() quantum at a time in some places (when a posting list needs to grow by one item). The devil is in the details. * As you know, the current approach to WAL logging is very inefficient. It's okay for now, but we'll need a fine-grained approach for the patch to be commitable. I think that this is subtly related to the last item (i.e. the one about alignment). I have done basic performance tests using unlogged tables. The patch seems to either make big INSERT
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Thu, 25 Jul 2019 at 05:49, Peter Geoghegan wrote: > > On Wed, Jul 24, 2019 at 3:06 PM Peter Geoghegan wrote: > > There seems to be a kind of "synergy" between the nbtsplitloc.c > > handling of pages that have lots of duplicates and posting list > > compression. It seems as if the former mechanism "sets up the bowling > > pins", while the latter mechanism "knocks them down", which is really > > cool. We should try to gain a better understanding of how that works, > > because it's possible that it could be even more effective in some > > cases. > > I found another important way in which this synergy can fail to take > place, which I can fix. > > By removing the BT_COMPRESS_THRESHOLD limit entirely, certain indexes > from my test suite become much smaller, while most are not affected. > These indexes were not helped too much by the patch before. For > example, the TPC-E i_t_st_id index is 50% smaller. It is entirely full > of duplicates of a single value (that's how it appears after an > initial TPC-E bulk load), as are a couple of other TPC-E indexes. > TPC-H's idx_partsupp_partkey index becomes ~18% smaller, while its > idx_lineitem_orderkey index becomes ~15% smaller. > > I believe that this happened because rightmost page splits were an > inefficient case for compression. But rightmost page split heavy > indexes with lots of duplicates are not that uncommon. Think of any > index with many NULL values, for example. > > I don't know for sure if BT_COMPRESS_THRESHOLD should be removed. I'm > not sure what the idea is behind it. My sense is that we're likely to > benefit by delaying page splits, no matter what. Though I am still > looking at it purely from a space utilization point of view, at least > for now. > Minor comment fix, pointes-->pointer, plus, are we really doing the half, or is it just splitting into two. /* + * Split posting tuple into two halves. + * + * Left tuple contains all item pointes less than the new one and + * right tuple contains new item pointer and all to the right. + * + * TODO Probably we can come up with more clever algorithm. + */ Some remains of 'he'. +/* + * If tuple is posting, t_tid.ip_blkid contains offset of the posting list. + * Caller is responsible for checking BTreeTupleIsPosting to ensure that + * it will get what he expects + */ Everything reads just fine without 'us'. /* + * This field helps us to find beginning of the remaining tuples from + * postings which follow array of offset numbers. + */ -- Regards, Rafia Sabih
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Wed, Jul 24, 2019 at 3:06 PM Peter Geoghegan wrote: > There seems to be a kind of "synergy" between the nbtsplitloc.c > handling of pages that have lots of duplicates and posting list > compression. It seems as if the former mechanism "sets up the bowling > pins", while the latter mechanism "knocks them down", which is really > cool. We should try to gain a better understanding of how that works, > because it's possible that it could be even more effective in some > cases. I found another important way in which this synergy can fail to take place, which I can fix. By removing the BT_COMPRESS_THRESHOLD limit entirely, certain indexes from my test suite become much smaller, while most are not affected. These indexes were not helped too much by the patch before. For example, the TPC-E i_t_st_id index is 50% smaller. It is entirely full of duplicates of a single value (that's how it appears after an initial TPC-E bulk load), as are a couple of other TPC-E indexes. TPC-H's idx_partsupp_partkey index becomes ~18% smaller, while its idx_lineitem_orderkey index becomes ~15% smaller. I believe that this happened because rightmost page splits were an inefficient case for compression. But rightmost page split heavy indexes with lots of duplicates are not that uncommon. Think of any index with many NULL values, for example. I don't know for sure if BT_COMPRESS_THRESHOLD should be removed. I'm not sure what the idea is behind it. My sense is that we're likely to benefit by delaying page splits, no matter what. Though I am still looking at it purely from a space utilization point of view, at least for now. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Tue, Jul 23, 2019 at 6:22 PM Peter Geoghegan wrote: > Attached is a revised version of your v2 that fixes this issue -- I'll > call this v3. Remember that index that I said was 5.5x smaller with the patch applied, following retail insertions (a single big INSERT ... SELECT ...)? Well, it's 6.5x faster with this small additional patch applied on top of the v3 I posted yesterday. Many of the indexes in my test suite are about ~20% smaller __in addition to__ very big size reductions. Some are even ~30% smaller than they were with v3 of the patch. For example, the fair use implementation of TPC-H that my test data comes from has an index on the "orders" o_orderdate column, named idx_orders_orderdate, which is made ~30% smaller by the addition of this simple patch (once again, this is following a single big INSERT ... SELECT ...). This change makes idx_orders_orderdate ~3.3x smaller than it is with master/Postgres 12, in case you were wondering. This new patch teaches nbtsplitloc.c to subtract posting list overhead when sizing the new high key for the left half of a candidate split point, since we know for sure that _bt_truncate() will at least manage to truncate away that much from the new high key, even in the worst case. Since posting lists are often very large, this can make a big difference. This is actually just a bugfix, not a new idea -- I merely made nbtsplitloc.c understand how truncation works with posting lists. There seems to be a kind of "synergy" between the nbtsplitloc.c handling of pages that have lots of duplicates and posting list compression. It seems as if the former mechanism "sets up the bowling pins", while the latter mechanism "knocks them down", which is really cool. We should try to gain a better understanding of how that works, because it's possible that it could be even more effective in some cases. -- Peter Geoghegan 0003-Account-for-posting-list-overhead-during-splits.patch Description: Binary data
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Fri, Jul 19, 2019 at 7:24 PM Peter Geoghegan wrote: > Hmm. So, the attached test case fails amcheck verification for me with > the latest version of the patch: Attached is a revised version of your v2 that fixes this issue -- I'll call this v3. In general, my goal for the revision was to make sure that all of my old tests from the v12 work passed, and to make sure that amcheck can detect almost any possible problem. I tested the amcheck changes by corrupting random state in a test index using pg_hexedit, then making sure that amcheck actually complained in each case. I also fixed one or two bugs in passing, including the bug that caused an assertion failure in _bt_truncate(). That was down to a subtle off-by-one issue within _bt_insertonpg_in_posting(). Overall, I didn't make that many changes to your v2. There are probably some things about the patch that I still don't understand, or things that I have misunderstood. Other changes: * We now support system catalog indexes. There is no reason not to support them. * Removed unnecessary code from _bt_buildadd(). * Added my own new DEBUG4 trace to _bt_insertonpg_in_posting(), which I used to fix that bug I mentioned. I agree that we should keep the DEBUG4 traces around until the overall design settles down. I found the ones that you added helpful, too. * Added quite a few new assertions. For example, we need to still support !heapkeyspace (pre Postgres 12) nbtree indexes, but we cannot let them use compression -- new defensive assertions were added to make this break loudly. * Changed the custom binary search code within _bt_compare_posting() to look more like _bt_binsrch() and _bt_binsrch_insert(). Do you know of any reason not to do it that way? * Added quite a few "FIXME"/"XXX" comments at various points, to indicate where I have general concerns that need more discussion. * Included my own pageinspect hack to visualize the minimum TIDs in posting lists. It's broken out into a separate patch file. The code is very rough, but it might help someone else, so I thought I'd include it. I also have some new concerns about the code in the patch that I will point out now (though only as something to think about a solution on -- I am unsure myself): * It's a bad sign that compression involves calls to PageAddItem() that are allowed to fail (we just give up on compression when that happens). For one thing, all existing calls to PageAddItem() in Postgres are never expected to fail -- if they do fail we get a "can't happen" error that suggests corruption. It was a good idea to take this approach to get the patch to work, and to prove the general idea, but we now need to fully work out all the details about the use of space. This includes complicated new questions around how alignment is supposed to work. Alignment in nbtree is already complicated today -- you're supposed to MAXALIGN() everything in nbtree, so that the MAXALIGN() within bufpage.c routines cannot be different to the lp_len/IndexTupleSize() length (note that heapam can have tuples whose lp_len isn't aligned, so nbtree could do it differently if it proved useful). Code within nbtsplitloc.c fully understands the space requirements for the bufpage.c routines, and is very careful about it. (The bufpage.c details are supposed to be totally hidden from code like nbtsplitloc.c, but I guess that that ideal isn't quite possible in reality. Code comments don't really explain the situation today.) I'm not sure what it would look like for this patch to be as precise about free space as nbtsplitloc.c already is, even though that seems desirable (I just know that it would mean you would trust PageAddItem() to work in all cases). The patch is different to what we already have today in that it tries to add *less than* a single MAXALIGN() quantum at a time in some places (when a posting list needs to grow by one item). The devil is in the details. * As you know, the current approach to WAL logging is very inefficient. It's okay for now, but we'll need a fine-grained approach for the patch to be commitable. I think that this is subtly related to the last item (i.e. the one about alignment). I have done basic performance tests using unlogged tables. The patch seems to either make big INSERT queries run as fast or faster than before when inserting into unlogged tables, which is a very good start. * Since we can now split a posting list in two, we may also have to reconsider BTMaxItemSize, or some similar mechanism that worries about extreme cases where it becomes impossible to split because even two pages are not enough to fit everything. Think of what happens when there is a tuple with a single large datum, that gets split in two (the tuple is split, not the page), with each half receiving its own copy of the datum. I haven't proven to myself that this is broken, but that may just be because I haven't spent any time on it. OTOH, maybe you already have it right, in which case it seems like it should be
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Fri, Jul 19, 2019 at 12:32 PM Peter Geoghegan wrote: > On Fri, Jul 19, 2019 at 10:53 AM Anastasia Lubennikova > wrote: > > Patch 0002 (must be applied on top of 0001) implements preserving of > > correct TID order > > inside posting list when inserting new tuples. > > This version passes all regression tests including amcheck test. > > I also used following script to test insertion into the posting list: > > Nice! Hmm. So, the attached test case fails amcheck verification for me with the latest version of the patch: $ psql -f amcheck-compress-test.sql DROP TABLE CREATE TABLE CREATE INDEX CREATE EXTENSION INSERT 0 2001 psql:amcheck-compress-test.sql:6: ERROR: down-link lower bound invariant violated for index "idx_desc_nl" DETAIL: Parent block=3 child index tid=(2,2) parent page lsn=10/F87A3438. Note that this test only has an INSERT statement. You have to use bt_index_parent_check() to see the problem -- bt_index_check() will not detect the problem. -- Peter Geoghegan amcheck-compress-test.sql Description: Binary data
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Fri, Jul 19, 2019 at 10:53 AM Anastasia Lubennikova wrote: > Patch 0002 (must be applied on top of 0001) implements preserving of > correct TID order > inside posting list when inserting new tuples. > This version passes all regression tests including amcheck test. > I also used following script to test insertion into the posting list: Nice! > I suppose it is not the final version of the patch yet, > so I left some debug messages and TODO comments to ease review. I'm fine with leaving them in. I have sometimes distributed a separate patch with debug messages, but now that I think about it, that probably wasn't a good use of time. You will probably want to remove at least some of the debug messages during performance testing. I'm thinking of code that appears in very tight inner loops, such as the _bt_compare() code. > Please, in your review, pay particular attention to usage of > BTreeTupleGetHeapTID. > For posting tuples it returns the first tid from posting list like > BTreeTupleGetMinTID, > but maybe some callers are not ready for that and want > BTreeTupleGetMaxTID instead. > Incorrect usage of these macros may cause some subtle bugs, > which are probably not covered by tests. So, please double-check it. One testing strategy that I plan to use for the patch is to deliberately corrupt a compressed index in a subtle way using pg_hexedit, and then see if amcheck detects the problem. For example, I may swap the order of two TIDs in the middle of a posting list, which is something that is unlikely to produce wrong answers to queries, and won't even be detected by the "heapallindexed" check, but is still wrong. If we can detect very subtle, adversarial corruption like this, then we can detect any real-world problem. Once we have confidence in amcheck's ability to detect problems with posting lists in general, we can use it in many different contexts without much thought. For example, we'll probably need to do long running benchmarks to validate the performance of the patch. It's easy to add amcheck testing at the end of each run. Every benchmark is now also a correctness/stress test, for free. > Next week I'm going to check performance and try to find specific > scenarios where this > feature can lead to degradation and measure it, to understand if we need > to make this deduplication optional. Sounds good, though I think it might be a bit too early to decide whether or not it needs to be enabled by default. For one thing, the approach to WAL-logging within _bt_compress_one_page() is probably fairly inefficient, which may be a problem for certain workloads. It's okay to leave it that way for now, because it is not relevant to the core design of the patch. I'm sure that _bt_compress_one_page() can be carefully optimized when the time comes. My current focus is not on the raw performance itself. For now, I am focussed on making sure that the compression works well, and that the resulting indexes "look nice" in general. FWIW, the first few versions of my v12 work on nbtree didn't actually make *anything* go faster. It took a couple of months to fix the more important regressions, and a few more months to fix all of them. I think that the work on this patch may develop in a similar way. I am willing to accept regressions in the unoptimized code during development because it seems likely that you have the right idea about the data structure itself, which is the one thing that I *really* care about. Once you get that right, the remaining problems are very likely to either be fixable with further work on optimizing specific code, or a price that users will mostly be happy to pay to get the benefits. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
17.07.2019 19:36, Anastasia Lubennikova: There is one major issue left - preserving TID order in posting lists. For a start, I added a sort into BTreeFormPostingTuple function. It turned out to be not very helpful, because we cannot check this invariant lazily. Now I work on patching _bt_binsrch_insert() and _bt_insertonpg() to implement insertion into the middle of the posting list. I'll send a new version this week. Patch 0002 (must be applied on top of 0001) implements preserving of correct TID order inside posting list when inserting new tuples. This version passes all regression tests including amcheck test. I also used following script to test insertion into the posting list: set client_min_messages to debug4; drop table tbl; create table tbl (i1 int, i2 int); insert into tbl select 1, i from generate_series(0,1000) as i; insert into tbl select 1, i from generate_series(0,1000) as i; create index idx on tbl (i1); delete from tbl where i2 <500; vacuum tbl ; insert into tbl select 1, i from generate_series(1001, 1500) as i; The last insert triggers several insertions that can be seen in debug messages. I suppose it is not the final version of the patch yet, so I left some debug messages and TODO comments to ease review. Please, in your review, pay particular attention to usage of BTreeTupleGetHeapTID. For posting tuples it returns the first tid from posting list like BTreeTupleGetMinTID, but maybe some callers are not ready for that and want BTreeTupleGetMaxTID instead. Incorrect usage of these macros may cause some subtle bugs, which are probably not covered by tests. So, please double-check it. Next week I'm going to check performance and try to find specific scenarios where this feature can lead to degradation and measure it, to understand if we need to make this deduplication optional. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c index 9126c18..2b05b1e 100644 --- a/contrib/amcheck/verify_nbtree.c +++ b/contrib/amcheck/verify_nbtree.c @@ -1033,12 +1033,34 @@ bt_target_page_check(BtreeCheckState *state) { IndexTuple norm; - norm = bt_normalize_tuple(state, itup); - bloom_add_element(state->filter, (unsigned char *) norm, - IndexTupleSize(norm)); - /* Be tidy */ - if (norm != itup) -pfree(norm); + if (BTreeTupleIsPosting(itup)) + { +IndexTuple onetup; +int i; + +/* Fingerprint all elements of posting tuple one by one */ +for (i = 0; i < BTreeTupleGetNPosting(itup); i++) +{ + onetup = BTreeGetNthTupleOfPosting(itup, i); + + norm = bt_normalize_tuple(state, onetup); + bloom_add_element(state->filter, (unsigned char *) norm, + IndexTupleSize(norm)); + /* Be tidy */ + if (norm != onetup) + pfree(norm); + pfree(onetup); +} + } + else + { +norm = bt_normalize_tuple(state, itup); +bloom_add_element(state->filter, (unsigned char *) norm, + IndexTupleSize(norm)); +/* Be tidy */ +if (norm != itup) + pfree(norm); + } } /* diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index 602f884..26ddf32 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -20,6 +20,7 @@ #include "access/tableam.h" #include "access/transam.h" #include "access/xloginsert.h" +#include "catalog/catalog.h" #include "miscadmin.h" #include "storage/lmgr.h" #include "storage/predicate.h" @@ -56,6 +57,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf, static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup, OffsetNumber itup_off); static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel); +static bool insert_itupprev_to_page(Page page, BTCompressState *compressState); +static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel); /* * _bt_doinsert() -- Handle insertion of a single index tuple in the tree. @@ -759,6 +762,12 @@ _bt_findinsertloc(Relation rel, _bt_vacuum_one_page(rel, insertstate->buf, heapRel); insertstate->bounds_valid = false; } + + /* + * If the target page is full, try to compress the page + */ + if (PageGetFreeSpace(page) < insertstate->itemsz) + _bt_compress_one_page(rel, insertstate->buf, heapRel); } else { @@ -806,6 +815,11 @@ _bt_findinsertloc(Relation rel, } /* + * Before considering moving right, try to compress the page + */ + _bt_compress_one_page(rel, insertstate->buf, heapRel); + + /* * Nope, so check conditions (b) and (c) enumerated above * * The earlier _bt_check_unique() call may well have established a @@ -2286,3 +2300,241 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel) * the page. */ } + +/* + * Add new item (compressed or not) to the page,
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
11.07.2019 21:19, Peter Geoghegan wrote: On Thu, Jul 11, 2019 at 8:34 AM Rafia Sabih wrote: Hi, Peter, Rafia, thanks for the review. New version is attached. + elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu", + compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page)); + and other such DEBUG4 statements are meant to be removed, right...? I hope so too. Yes, these messages are only for debugging. I haven't delete them since this is still work in progress and it's handy to be able to print inner details. Maybe I should also write a patch for pageinspect. /* * If we have only 10 uncompressed items on the full page, it probably * won't worth to compress them. */ if (maxoff - n_posting_on_page < 10) return; Is this a magic number...? I think that this should be a constant or something. Fixed. Now this is a constant in nbtree.h. I'm not 100% sure about the value. When the code will stabilize we can benchmark it and find optimal value. /* * We do not expect to meet any DEAD items, since this function is * called right after _bt_vacuum_one_page(). If for some reason we * found dead item, don't compress it, to allow upcoming microvacuum * or vacuum clean it up. */ if (ItemIdIsDead(itemId)) continue; This makes me wonder about those 'some' reasons. I think that this is just defensive. Note that _bt_vacuum_one_page() is prepared to find no dead items, even when the BTP_HAS_GARBAGE flag is set for the page. You are right, now it is impossible to meet dead items in this function. Though it can change in the future if, for example, _bt_vacuum_one_page will behave lazily. So this is just a sanity check. Maybe it's worth to move it to Assert. Caller is responsible for checking BTreeTupleIsPosting to ensure that + * he will get what he expects This can be re-framed to make the caller more gender neutral. Agreed. I also don't like anthropomorphizing code like this. Fixed. Other than that, I am curious about the plans for its backward compatibility. Me too. There is something about a new version 5 in comments in nbtree.h, but the version number isn't changed. I think that we may be able to get away with not increasing the B-Tree version from 4 to 5, actually. Deduplication is performed lazily when it looks like we might have to split the page, so there isn't any expectation that tuples will either be compressed or uncompressed in any context. Current implementation is backward compatible. To distinguish posting tuples, it only adds one new flag combination. This combination was never possible before. Comment about version 5 is deleted. I also added a patch for amcheck. There is one major issue left - preserving TID order in posting lists. For a start, I added a sort into BTreeFormPostingTuple function. It turned out to be not very helpful, because we cannot check this invariant lazily. Now I work on patching _bt_binsrch_insert() and _bt_insertonpg() to implement insertion into the middle of the posting list. I'll send a new version this week. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c index 9126c18..2b05b1e 100644 --- a/contrib/amcheck/verify_nbtree.c +++ b/contrib/amcheck/verify_nbtree.c @@ -1033,12 +1033,34 @@ bt_target_page_check(BtreeCheckState *state) { IndexTuple norm; - norm = bt_normalize_tuple(state, itup); - bloom_add_element(state->filter, (unsigned char *) norm, - IndexTupleSize(norm)); - /* Be tidy */ - if (norm != itup) -pfree(norm); + if (BTreeTupleIsPosting(itup)) + { +IndexTuple onetup; +int i; + +/* Fingerprint all elements of posting tuple one by one */ +for (i = 0; i < BTreeTupleGetNPosting(itup); i++) +{ + onetup = BTreeGetNthTupleOfPosting(itup, i); + + norm = bt_normalize_tuple(state, onetup); + bloom_add_element(state->filter, (unsigned char *) norm, + IndexTupleSize(norm)); + /* Be tidy */ + if (norm != onetup) + pfree(norm); + pfree(onetup); +} + } + else + { +norm = bt_normalize_tuple(state, itup); +bloom_add_element(state->filter, (unsigned char *) norm, + IndexTupleSize(norm)); +/* Be tidy */ +if (norm != itup) + pfree(norm); + } } /* diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index 602f884..26ddf32 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -20,6 +20,7 @@ #include "access/tableam.h" #include "access/transam.h" #include "access/xloginsert.h" +#include "catalog/catalog.h" #include "miscadmin.h" #include "storage/lmgr.h" #include "storage/predicate.h" @@ -56,6 +57,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf, static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Thu, Jul 11, 2019 at 10:42 AM Peter Geoghegan wrote: > > I think unique indexes may benefit from deduplication not only because > > of NULL values. Non-HOT updates produce duplicates of non-NULL values > > in unique indexes. And those duplicates can take significant space. > > I agree that we should definitely have an open mind about unique > indexes, even with non-NULL values. If we can prevent a page split by > deduplicating the contents of a unique index page, then we'll probably > win. Why not try? This will need to be tested. I thought about this some more. I believe that the LP_DEAD bit setting within _bt_check_unique() is generally more important than the more complicated kill_prior_tuple mechanism for setting LP_DEAD bits, even though the _bt_check_unique() thing can only be used with unique indexes. Also, I have often thought that we don't do enough to take advantage of the special characteristics of unique indexes -- they really are quite different. I believe that other database systems do this in various ways. Maybe we should too. Unique indexes are special because there can only ever be zero or one tuples of the same value that are visible to any possible MVCC snapshot. Within the index AM, there is little difference between an UPDATE by a transaction and a DELETE + INSERT of the same value by a transaction. If there are 3 or 5 duplicates within a unique index, then there is a strong chance that VACUUM could reclaim some of them, given the chance. It is worth going to a little effort to find out. In a traditional serial/bigserial primary key, the key space that is typically "owned" by the left half of a rightmost page split describes a range of about ~366 items, with few or no gaps for other values that didn't exist at the time of the split (i.e. the two pivot tuples on each side cover a range that is equal to the number of items itself). If the page ever splits again, the chances of it being due to non-HOT updates is perhaps 100%. Maybe VACUUM just didn't get around to the index in time, or maybe there is a long running xact, or whatever. If we can delay page splits in indexes like this, then we could easily prevent them from *ever* happening. Our first line of defense against page splits within unique indexes will probably always be LP_DEAD bits set within _bt_check_unique(), because it costs so little -- same as today. We could also add a second line of defense: deduplication -- same as with non-unique indexes with the patch. But we can even add a third line of defense on top of those two: more aggressive reclaiming of posting list space, by going to the heap to check the visibility status of earlier posting list entries. We can do this optimistically when there is no LP_DEAD bit set, based on heuristics. The high level principle here is that we can justify going to a small amount of extra effort for the chance to avoid a page split, and maybe even more than a small amount. Our chances of reversing the split by merging pages later on are almost zero. The two halves of the split will probably each get dirtied again and again in the future if we cannot avoid it, plus we have to dirty the parent page, and the old sibling page (to update its left link). In general, a page split is already really expensive. We could do something like amortize the cost of accessing the heap a second time for tuples that we won't have considered setting the LP_DEAD bit on within _bt_check_unique() by trying the *same* heap page a *second* time where possible (distinct values are likely to be nearby on the same page). I think that an approach like this could work quite well for many workloads. You only pay a cost (visiting the heap an extra time) when it looks like you'll get a benefit (not splitting the page). As you know, Andres already changed nbtree to get an XID for conflict purposes on the primary by visiting the heap a second time (see commit 558a9165e08), when we need to actually reclaim LP_DEAD space. I anticipated that we could extend this to do more clever/eager/lazy cleanup of additional items before that went in, which is a closely related idea. See: https://www.postgresql.org/message-id/flat/CAH2-Wznx8ZEuXu7BMr6cVpJ26G8OSqdVo6Lx_e3HSOOAU86YoQ%40mail.gmail.com#46ffd6f32a60e086042a117f2bfd7df7 I know that this is a bit hand-wavy; the details certainly need to be worked out. However, it is not so different to the "ghost bit" design that other systems use with their non-unique indexes (though this idea applies specifically to unique indexes in our case). The main difference is that we're going to the heap rather than to UNDO, because that's where we store our visibility information. That doesn't seem like such a big difference -- we are also reasonably confident that we'll find that the TID is dead, even without LP_DEAD bits being set, because we only do the extra stuff with unique indexes. And, we do it lazily. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Thu, Jul 11, 2019 at 8:34 AM Rafia Sabih wrote: > I tried this patch and found the improvements impressive. However, > when I tried with multi-column indexes it wasn't giving any > improvement, is it the known limitation of the patch? It'll only deduplicate full duplicates. It works with multi-column indexes, provided the entire set of values in duplicated -- not just a prefix. Prefix compression is possible, but it's more complicated. It seems to generally require the DBA to specify a prefix length, expressed as a number of prefix columns. > I am surprised to find that such a patch is on radar since quite some > years now and not yet committed. The v12 work on nbtree (making heap TID a tiebreaker column) seems to have made the general approach a lot more effective. Compression is performed lazily, not eagerly, which seems to work a lot better. > + elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d > IndexTupleSize %zu free %zu", > + compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page)); > + > and other such DEBUG4 statements are meant to be removed, right...? I hope so too. > /* > * If we have only 10 uncompressed items on the full page, it probably > * won't worth to compress them. > */ > if (maxoff - n_posting_on_page < 10) > return; > > Is this a magic number...? I think that this should be a constant or something. > /* > * We do not expect to meet any DEAD items, since this function is > * called right after _bt_vacuum_one_page(). If for some reason we > * found dead item, don't compress it, to allow upcoming microvacuum > * or vacuum clean it up. > */ > if (ItemIdIsDead(itemId)) > continue; > > This makes me wonder about those 'some' reasons. I think that this is just defensive. Note that _bt_vacuum_one_page() is prepared to find no dead items, even when the BTP_HAS_GARBAGE flag is set for the page. > Caller is responsible for checking BTreeTupleIsPosting to ensure that > + * he will get what he expects > > This can be re-framed to make the caller more gender neutral. Agreed. I also don't like anthropomorphizing code like this. > Other than that, I am curious about the plans for its backward compatibility. Me too. There is something about a new version 5 in comments in nbtree.h, but the version number isn't changed. I think that we may be able to get away with not increasing the B-Tree version from 4 to 5, actually. Deduplication is performed lazily when it looks like we might have to split the page, so there isn't any expectation that tuples will either be compressed or uncompressed in any context. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Thu, Jul 11, 2019 at 8:09 AM Alexander Korotkov wrote: > BTW, I think deduplication could cause some small performance > degradation in some particular cases, because page-level locks became > more coarse grained once pages hold more tuples. However, this > doesn't seem like something we should much care about. Providing an > option to turn deduplication off looks enough for me. There was an issue like this with my v12 work on nbtree, with the TPC-C indexes. They were always ~40% smaller, but there was a regression when TPC-C was used with a small number of warehouses, when the data could easily fit in memory (which is not allowed by the TPC-C spec, in effect). TPC-C is very write-heavy, which combined with everything else causes this problem. I wasn't doing anything too fancy there -- the regression seemed to happen simply because the index was smaller, not because of the overhead of doing page splits differently or anything like that (there were far fewer splits). I expect there to be some regression for workloads like this. I am willing to accept that provided it's not too noticeable, and doesn't have an impact on other workloads. I am optimistic about it. > Regarding bitmap indexes itself, I think our BRIN could provide them. > However, it would be useful to have opclass parameters to make them > tunable. I thought that we might implement them in nbtree myself. But we don't need to decide now. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Thu, Jul 11, 2019 at 8:02 AM Alexander Korotkov wrote: > Could you please elaborate more on preserving the logical contents? I > can understand it as following: "B-Tree should have the same structure > and invariants as if each TID in posting list be a separate tuple". That's exactly what I mean. > So, if we imagine each TID to become separate tuple it would be the > same B-tree, which just can magically sometimes store more tuples in > page. Is my understanding correct? Yes. > But outside code will still > notice changes as soon as it directly accesses B-tree pages (like > contrib/amcheck does). Do you mean we need an API for accessing > logical B-tree tuples or something? Well, contrib/amcheck isn't really outside code. But amcheck's "rootdescend" option will still need to be able to supply a heap TID as just another column, and get back zero or one logical tuples from the index. This is important because retail index tuple deletion needs to be able to think about logical tuples in the same way. I also think that it might be useful for the planner to expect to get back duplicates in heap TID order in the future (or in reverse order in the case of a backwards scan). Query execution and VACUUM code outside of nbtree should be able to pretend that there is no such thing as a posting list. The main thing that the patch is missing that is needed to "preserve logical contents" is the ability to update/expand an *existing* posting list due to a retail insertion of a new duplicate that happens to be within the range of that existing posting list. This will usually be a non-HOT update that doesn't change the value for the row in the index -- that must change the posting list, even when there is available space on the page without recompressing. We must still occasionally be eager, like GIN always is, though in practice we'll almost always add to posting lists in a lazy fashion, when it looks like we might have to split the page -- the lazy approach seems to perform best. > I think in order to deduplicate "equal but distinct" values we need at > least to give up with index only scans. Because we have no > restriction that equal according to B-tree opclass values are same for > other operations and/or user output. We can either prevent index-only scans in the case of affected indexes, or prevent compression, or give the user a choice. I'm not too worried about how that will work for users just yet. > Do I understand correctly that current patch may produce posting lists > of the same value with overlapping ranges of TIDs? If so, it's > definitely wrong. Yes, it can, since the assertion fails. It looks like the assertion itself was changed to match what I expect, so I assume that this bug will be fixed in the next version of the patch. It fails with a fairly big index on text for me. > > * Maybe we could do compression with unique indexes when inserting > > values with NULLs? Note that we now treat an insertion of a tuple with > > NULLs into a unique index as if it wasn't even a unique index -- see > > the "checkingunique" optimization at the beginning of _bt_doinsert(). > > Having many NULL values in a unique index is probably fairly common. > > I think unique indexes may benefit from deduplication not only because > of NULL values. Non-HOT updates produce duplicates of non-NULL values > in unique indexes. And those duplicates can take significant space. I agree that we should definitely have an open mind about unique indexes, even with non-NULL values. If we can prevent a page split by deduplicating the contents of a unique index page, then we'll probably win. Why not try? This will need to be tested. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Thu, Jul 11, 2019 at 7:30 AM Bruce Momjian wrote: > Wow, I never thought of that. The only things I know we lock until > transaction end are rows we update (against concurrent updates), and > additions to unique indexes. By definition, indexes with many > duplicates are not unique, so that doesn't apply. Right. Another advantage of their approach is that you can make queries like this work: UPDATE tab SET unique_col = unique_col + 1 This will not throw a unique violation error on most/all other DB systems when the updated column (in this case "unique_col") has a unique constraint/is the primary key. This behavior is actually required by the SQL standard. An SQL statement is supposed to be all-or-nothing, which Postgres doesn't quite manage here. The section "6.6 Interdependencies of Transactional Storage" from the paper "Architecture of a Database System" provides additional background information (I should have suggested reading both 6.6 and 6.7 together). -- Peter Geoghegan
Fwd: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Sun, 7 Jul 2019 at 01:08, Peter Geoghegan wrote: > * Maybe we could do compression with unique indexes when inserting > values with NULLs? Note that we now treat an insertion of a tuple with +1 I tried this patch and found the improvements impressive. However, when I tried with multi-column indexes it wasn't giving any improvement, is it the known limitation of the patch? I am surprised to find that such a patch is on radar since quite some years now and not yet committed. Going through the patch, here are a few comments from me, /* Add the new item into the page */ + offnum = OffsetNumberNext(offnum); + + elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu", + compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page)); + and other such DEBUG4 statements are meant to be removed, right...? Just because I didn't find any other such statements in this API and there are many in this patch, so not sure how much are they needed. /* * If we have only 10 uncompressed items on the full page, it probably * won't worth to compress them. */ if (maxoff - n_posting_on_page < 10) return; Is this a magic number...? /* * We do not expect to meet any DEAD items, since this function is * called right after _bt_vacuum_one_page(). If for some reason we * found dead item, don't compress it, to allow upcoming microvacuum * or vacuum clean it up. */ if (ItemIdIsDead(itemId)) continue; This makes me wonder about those 'some' reasons. Caller is responsible for checking BTreeTupleIsPosting to ensure that + * he will get what he expects This can be re-framed to make the caller more gender neutral. Other than that, I am curious about the plans for its backward compatibility. -- Regards, Rafia Sabih
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Thu, Jul 11, 2019 at 7:53 AM Peter Geoghegan wrote: > Anyway, I think that *hundreds* or even *thousands* of rows are > effectively locked all at once when a bitmap index needs to be updated > in these other systems -- and I mean a heavyweight lock that lasts > until the xact commits or aborts, like a Postgres row lock. As I said, > this is necessary simply because the transaction might need to roll > back. Of course, your patch never needs to do anything like that -- > the only risk is that buffer lock contention will be increased. Maybe > VACUUM isn't so bad after all! > > Doing deduplication adaptively and automatically in nbtree seems like > it might play to the strengths of Postgres, while also ameliorating > its weaknesses. As the same paper goes on to say, it's actually quite > unusual that PostgreSQL has *transactional* full text search built in > (using GIN), and offers transactional, high concurrency spatial > indexing (using GiST). Actually, this is an additional advantages of > our "pure" approach to MVCC -- we can add new high concurrency, > transactional access methods relatively easily. Good finding, thank you! BTW, I think deduplication could cause some small performance degradation in some particular cases, because page-level locks became more coarse grained once pages hold more tuples. However, this doesn't seem like something we should much care about. Providing an option to turn deduplication off looks enough for me. Regarding bitmap indexes itself, I think our BRIN could provide them. However, it would be useful to have opclass parameters to make them tunable. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Hi Peter, Thank you very much for your attention to this patch. Let me comment some points of your message. On Sun, Jul 7, 2019 at 2:09 AM Peter Geoghegan wrote: > On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova > wrote: > > The new version of the patch is attached. > > This version is even simpler than the previous one, > > thanks to the recent btree design changes and all the feedback I received. > > I consider it ready for review and testing. > > I took a closer look at this patch, and have some general thoughts on > its design, and specific feedback on the implementation. > > Preserving the *logical contents* of B-Tree indexes that use > compression is very important -- that should not change in a way that > outside code can notice. The heap TID itself should count as logical > contents here, since we want to be able to implement retail index > tuple deletion in the future. Even without retail index tuple > deletion, amcheck's "rootdescend" verification assumes that it can > find one specific tuple (which could now just be one specific "logical > tuple") using specific key values from the heap, including the heap > tuple's heap TID. This requirement makes things a bit harder for your > patch, because you have to deal with one or two edge-cases that you > currently don't handle: insertion of new duplicates that fall inside > the min/max range of some existing posting list. That should be rare > enough in practice, so the performance penalty won't be too bad. This > probably means that code within _bt_findinsertloc() and/or > _bt_binsrch_insert() will need to think about a logical tuple as a > distinct thing from a physical tuple, though that won't be necessary > in most places. Could you please elaborate more on preserving the logical contents? I can understand it as following: "B-Tree should have the same structure and invariants as if each TID in posting list be a separate tuple". So, if we imagine each TID to become separate tuple it would be the same B-tree, which just can magically sometimes store more tuples in page. Is my understanding correct? But outside code will still notice changes as soon as it directly accesses B-tree pages (like contrib/amcheck does). Do you mean we need an API for accessing logical B-tree tuples or something? > The need to "preserve the logical contents" also means that the patch > will need to recognize when indexes are not safe as a target for > compression/deduplication (maybe we should call this feature > deduplilcation, so it's clear how it differs from TOAST?). For > example, if we have a case-insensitive ICU collation, then it is not > okay to treat an opclass-equal pair of text strings that use the > collation as having the same value when considering merging the two > into one. You don't actually do that in the patch, but you also don't > try to deal with the fact that such a pair of strings are equal, and > so must have their final positions determined by the heap TID column > (deduplication within _bt_compress_one_page() must respect that). > Possibly equal-but-distinct values seems like a problem that's not > worth truly fixing, but it will be necessary to store metadata about > whether or not we're willing to do deduplication in the meta page, > based on operator class and collation details. That seems like a > restriction that we're just going to have to accept, though I'm not > too worried about exactly what that will look like right now. We can > work it out later. I think in order to deduplicate "equal but distinct" values we need at least to give up with index only scans. Because we have no restriction that equal according to B-tree opclass values are same for other operations and/or user output. > I think that the need to be careful about the logical contents of > indexes already causes bugs, even with "safe for compression" indexes. > For example, I can sometimes see an assertion failure > within_bt_truncate(), at the point where we check if heap TID values > are safe: > > /* > * Lehman and Yao require that the downlink to the right page, which is to > * be inserted into the parent page in the second phase of a page split be > * a strict lower bound on items on the right page, and a non-strict upper > * bound for items on the left page. Assert that heap TIDs follow these > * invariants, since a heap TID value is apparently needed as a > * tiebreaker. > */ > #ifndef DEBUG_NO_TRUNCATE > Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft), > BTreeTupleGetMinTID(firstright)) < 0); > ... > > This bug is not that easy to see, but it will happen with a big index, > even without updates or deletes. I think that this happens because > compression can allow the "logical tuples" to be in the wrong heap TID > order when there are multiple posting lists for the same value. As I > said, I think that it's necessary to see a posting list as being > comprised of multiple
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Wed, Jul 10, 2019 at 09:53:04PM -0700, Peter Geoghegan wrote: > Anyway, I think that *hundreds* or even *thousands* of rows are > effectively locked all at once when a bitmap index needs to be updated > in these other systems -- and I mean a heavyweight lock that lasts > until the xact commits or aborts, like a Postgres row lock. As I said, > this is necessary simply because the transaction might need to roll > back. Of course, your patch never needs to do anything like that -- > the only risk is that buffer lock contention will be increased. Maybe > VACUUM isn't so bad after all! > > Doing deduplication adaptively and automatically in nbtree seems like > it might play to the strengths of Postgres, while also ameliorating > its weaknesses. As the same paper goes on to say, it's actually quite > unusual that PostgreSQL has *transactional* full text search built in > (using GIN), and offers transactional, high concurrency spatial > indexing (using GiST). Actually, this is an additional advantages of > our "pure" approach to MVCC -- we can add new high concurrency, > transactional access methods relatively easily. Wow, I never thought of that. The only things I know we lock until transaction end are rows we update (against concurrent updates), and additions to unique indexes. By definition, indexes with many duplicates are not unique, so that doesn't apply. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Sat, Jul 6, 2019 at 4:08 PM Peter Geoghegan wrote: > I took a closer look at this patch, and have some general thoughts on > its design, and specific feedback on the implementation. I have some high level concerns about how the patch might increase contention, which could make queries slower. Apparently that is a real problem in other systems that use MVCC when their bitmap index feature is used -- they are never really supposed to be used with OLTP apps. This patch makes nbtree behave rather a lot like a bitmap index. That's not exactly true, because you're not storing a bitmap or compressing the TID lists, but they're definitely quite similar. It's easy to imagine a hybrid approach, that starts with a B-Tree with deduplication/TID lists, and eventually becomes a bitmap index as more duplicates are added [1]. It doesn't seem like it would be practical for these other MVCC database systems to have standard B-Tree secondary indexes that compress duplicates gracefully in the way that you propose to with this patch, because lock contention would presumably be a big problem for the same reason as it is with their bitmap indexes (whatever the true reason actually is). Is it really possible to have something that's adaptive, offering the best of both worlds? Having dug into it some more, I think that the answer for us might actually be "yes, we can have it both ways". Other database systems that are also based on MVCC still probably use a limited form of index locking, even in READ COMMITTED mode, though this isn't very widely known. They need this for unique indexes, but they also need it for transaction rollback, to remove old entries from the index when the transaction must abort. The section "6.7 Standard Practice" from the paper "Architecture of a Database System" [2] goes into this, saying: "All production databases today support ACID transactions. As a rule, they use write-ahead logging for durability, and two-phase locking for concurrency control. An exception is PostgreSQL, which uses multiversion concurrency control throughout." I suggest reading "6.7 Standard Practice" in full. Anyway, I think that *hundreds* or even *thousands* of rows are effectively locked all at once when a bitmap index needs to be updated in these other systems -- and I mean a heavyweight lock that lasts until the xact commits or aborts, like a Postgres row lock. As I said, this is necessary simply because the transaction might need to roll back. Of course, your patch never needs to do anything like that -- the only risk is that buffer lock contention will be increased. Maybe VACUUM isn't so bad after all! Doing deduplication adaptively and automatically in nbtree seems like it might play to the strengths of Postgres, while also ameliorating its weaknesses. As the same paper goes on to say, it's actually quite unusual that PostgreSQL has *transactional* full text search built in (using GIN), and offers transactional, high concurrency spatial indexing (using GiST). Actually, this is an additional advantages of our "pure" approach to MVCC -- we can add new high concurrency, transactional access methods relatively easily. [1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.98.3159=rep1=pdf [2] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova wrote: > The new version of the patch is attached. > This version is even simpler than the previous one, > thanks to the recent btree design changes and all the feedback I received. > I consider it ready for review and testing. I took a closer look at this patch, and have some general thoughts on its design, and specific feedback on the implementation. Preserving the *logical contents* of B-Tree indexes that use compression is very important -- that should not change in a way that outside code can notice. The heap TID itself should count as logical contents here, since we want to be able to implement retail index tuple deletion in the future. Even without retail index tuple deletion, amcheck's "rootdescend" verification assumes that it can find one specific tuple (which could now just be one specific "logical tuple") using specific key values from the heap, including the heap tuple's heap TID. This requirement makes things a bit harder for your patch, because you have to deal with one or two edge-cases that you currently don't handle: insertion of new duplicates that fall inside the min/max range of some existing posting list. That should be rare enough in practice, so the performance penalty won't be too bad. This probably means that code within _bt_findinsertloc() and/or _bt_binsrch_insert() will need to think about a logical tuple as a distinct thing from a physical tuple, though that won't be necessary in most places. The need to "preserve the logical contents" also means that the patch will need to recognize when indexes are not safe as a target for compression/deduplication (maybe we should call this feature deduplilcation, so it's clear how it differs from TOAST?). For example, if we have a case-insensitive ICU collation, then it is not okay to treat an opclass-equal pair of text strings that use the collation as having the same value when considering merging the two into one. You don't actually do that in the patch, but you also don't try to deal with the fact that such a pair of strings are equal, and so must have their final positions determined by the heap TID column (deduplication within _bt_compress_one_page() must respect that). Possibly equal-but-distinct values seems like a problem that's not worth truly fixing, but it will be necessary to store metadata about whether or not we're willing to do deduplication in the meta page, based on operator class and collation details. That seems like a restriction that we're just going to have to accept, though I'm not too worried about exactly what that will look like right now. We can work it out later. I think that the need to be careful about the logical contents of indexes already causes bugs, even with "safe for compression" indexes. For example, I can sometimes see an assertion failure within_bt_truncate(), at the point where we check if heap TID values are safe: /* * Lehman and Yao require that the downlink to the right page, which is to * be inserted into the parent page in the second phase of a page split be * a strict lower bound on items on the right page, and a non-strict upper * bound for items on the left page. Assert that heap TIDs follow these * invariants, since a heap TID value is apparently needed as a * tiebreaker. */ #ifndef DEBUG_NO_TRUNCATE Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft), BTreeTupleGetMinTID(firstright)) < 0); ... This bug is not that easy to see, but it will happen with a big index, even without updates or deletes. I think that this happens because compression can allow the "logical tuples" to be in the wrong heap TID order when there are multiple posting lists for the same value. As I said, I think that it's necessary to see a posting list as being comprised of multiple logical tuples in the context of inserting new tuples, even when you're not performing compression or splitting the page. I also see that amcheck's bt_index_parent_check() function fails, though bt_index_check() does not fail when I don't use any of its extra verification options. (You haven't updated amcheck, but I don't think that you need to update it for these basic checks to continue to work.) Other feedback on specific things: * A good way to assess whether or not the "logical tuple versus physical tuple" thing works is to make sure that amcheck's "rootdescend" verification works with a variety of indexes. As I said, it has the same requirements for nbtree as retail index tuple deletion will. * _bt_findinsertloc() should not call _bt_compress_one_page() for !heapkeyspace (version 3) indexes -- the second call to _bt_compress_one_page() should be removed. * Why can't compression be used on system catalog indexes? I understand that they are not a compelling case, but we tend to do things the same way with catalog tables and indexes unless there is a very good reason not to (e.g. HOT, suffix
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Thu, Jul 4, 2019 at 05:06:09PM -0700, Peter Geoghegan wrote: > This result is very impressive. We'll need to revisit what the right > trade-off is for the compression scheme, which Heikki had some > thoughts on when we left off 3 years ago, but that should be a lot > easier now. I am very encouraged by the fact that this relatively > simple approach already works quite nicely. It's also great to see > that bulk insertions with lots of compression are very clearly faster > with this latest revision of your patch, unlike earlier versions from > 2016 that made those cases slower (though I haven't tested indexes > that don't really use compression). I think that this is because you > now do the compression lazily, at the point where it looks like we may > need to split the page. Previous versions of the patch had to perform > compression eagerly, just like GIN, which is not really appropriate > for nbtree. I am also encouraged and am happy we can finally move this duplicate optimization forward. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Thu, Jul 4, 2019 at 10:38 AM Peter Geoghegan wrote: > I tried this on my own "UK land registry" test data [1], which was > originally used for the v12 nbtree work. My test case has a low > cardinality, multi-column text index. I chose this test case because > it was convenient for me. > > On v12/master, the index is 1100MB. Whereas with your patch, it ends > up being 196MB -- over 5.5x smaller! I also see a huge and consistent space saving for TPC-H. All 9 indexes are significantly smaller. The lineitem orderkey index is "just" 1/3 smaller, which is the smallest improvement among TPC-H indexes in my index bloat test case. The two largest indexes after the initial bulk load are *much* smaller: the lineitem parts supplier index is ~2.7x smaller, while the lineitem ship date index is a massive ~4.2x smaller. Also, the orders customer key index is ~2.8x smaller, and the order date index is ~2.43x smaller. Note that the test involved retail insertions, not CREATE INDEX. I haven't seen any regression in the size of any index so far, including when the number of internal pages is all that we measure. Actually, there seems to be cases where there is a noticeably larger reduction in internal pages than in leaf pages, probably because of interactions with suffix truncation. This result is very impressive. We'll need to revisit what the right trade-off is for the compression scheme, which Heikki had some thoughts on when we left off 3 years ago, but that should be a lot easier now. I am very encouraged by the fact that this relatively simple approach already works quite nicely. It's also great to see that bulk insertions with lots of compression are very clearly faster with this latest revision of your patch, unlike earlier versions from 2016 that made those cases slower (though I haven't tested indexes that don't really use compression). I think that this is because you now do the compression lazily, at the point where it looks like we may need to split the page. Previous versions of the patch had to perform compression eagerly, just like GIN, which is not really appropriate for nbtree. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova wrote: > i - number of distinct values in the index. > So i=1 means that all rows have the same key, > and i=1000 means that all keys are different. > > i / old size (MB) / new size (MB) > 121588 > 100021590 > 1021571 > 1000214214 > > For more, see the attached diagram with test results. I tried this on my own "UK land registry" test data [1], which was originally used for the v12 nbtree work. My test case has a low cardinality, multi-column text index. I chose this test case because it was convenient for me. On v12/master, the index is 1100MB. Whereas with your patch, it ends up being 196MB -- over 5.5x smaller! I also tried it out with the "Mouse genome informatics" database [2], which was already improved considerably by the v12 work on duplicates. This is helped tremendously by your patch. It's not quite 5.5x across the board, of course. There are 187 indexes (on 28 tables), and almost all of the indexes are smaller. Actually, *most* of the indexes are *much* smaller. Very often 50% smaller. I don't have time to do an in-depth analysis of these results today, but clearly the patch is very effective on real world data. I think that we tend to underestimate just how common indexes with a huge number of duplicates are. [1] https://https:/postgr.es/m/cah2-wzn_nayk4pr0hrwo0stwhmxjp5qyu+x8vppt030xpqr...@mail.gmail.com [2] http://www.informatics.jax.org/software.shtml -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
The new version of the patch is attached. This version is even simpler than the previous one, thanks to the recent btree design changes and all the feedback I received. I consider it ready for review and testing. [feature overview] This patch implements the deduplication of btree non-pivot tuples on leaf pages in a manner similar to GIN index "posting lists". Non-pivot posting tuple has following format: t_tid | t_info | key values | posting_list[] Where t_tid and t_info fields are used to store meta info about tuple's posting list. posting list is an array of ItemPointerData. Currently, compression is applied to all indexes except system indexes, unique indexes, and indexes with included columns. On insertion, compression applied not to each tuple, but to the page before split. If the target page is full, we try to compress it. [benchmark results] idx ON tbl(c1); index contains 1000 integer values i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=1000 means that all keys are different. i / old size (MB) / new size (MB) 1 215 88 1000 215 90 10 215 71 1000 214 214 For more, see the attached diagram with test results. [future work] Many things can be improved in this feature. Personally, I'd prefer to keep this patch as small as possible and work on other improvements after a basic part is committed. Though, I understand that some of these can be considered essential for this patch to be approved. 1. Implement a split of the posting tuples on a page split. 2. Implement microvacuum of posting tuples. 3. Add a flag into pg_index, which allows enabling/disabling compression for a particular index. 4. Implement posting list compression. -- Anastasia Lubennikova Postgres Professional:http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index 602f884..fce499b 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -20,6 +20,7 @@ #include "access/tableam.h" #include "access/transam.h" #include "access/xloginsert.h" +#include "catalog/catalog.h" #include "miscadmin.h" #include "storage/lmgr.h" #include "storage/predicate.h" @@ -56,6 +57,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf, static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup, OffsetNumber itup_off); static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel); +static bool insert_itupprev_to_page(Page page, BTCompressState *compressState); +static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel); /* * _bt_doinsert() -- Handle insertion of a single index tuple in the tree. @@ -759,6 +762,12 @@ _bt_findinsertloc(Relation rel, _bt_vacuum_one_page(rel, insertstate->buf, heapRel); insertstate->bounds_valid = false; } + + /* + * If the target page is full, try to compress the page + */ + if (PageGetFreeSpace(page) < insertstate->itemsz) + _bt_compress_one_page(rel, insertstate->buf, heapRel); } else { @@ -806,6 +815,11 @@ _bt_findinsertloc(Relation rel, } /* + * Before considering moving right, try to compress the page + */ + _bt_compress_one_page(rel, insertstate->buf, heapRel); + + /* * Nope, so check conditions (b) and (c) enumerated above * * The earlier _bt_check_unique() call may well have established a @@ -2286,3 +2300,232 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel) * the page. */ } + +/* + * Add new item (compressed or not) to the page, while compressing it. + * If insertion failed, return false. + * Caller should consider this as compression failure and + * leave page uncompressed. + */ +static bool +insert_itupprev_to_page(Page page, BTCompressState *compressState) +{ + IndexTuple to_insert; + OffsetNumber offnum = PageGetMaxOffsetNumber(page); + + if (compressState->ntuples == 0) + to_insert = compressState->itupprev; + else + { + IndexTuple postingtuple; + /* form a tuple with a posting list */ + postingtuple = BTreeFormPostingTuple(compressState->itupprev, + compressState->ipd, + compressState->ntuples); + to_insert = postingtuple; + pfree(compressState->ipd); + } + + /* Add the new item into the page */ + offnum = OffsetNumberNext(offnum); + + elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu", + compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page)); + + if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert), + offnum, false, false) == InvalidOffsetNumber) + { + elog(DEBUG4, "insert_itupprev_to_page. failed"); + /* + * this may happen if tuple is bigger than freespace + * fallback to uncompressed page case + */ + if (compressState->ntuples > 0) + pfree(to_insert); + return false; + }