Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 03/04/2014 01:58 PM, Amit Kapila wrote: On Mon, Mar 3, 2014 at 7:57 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 02/16/2014 01:51 PM, Amit Kapila wrote: On Wed, Feb 5, 2014 at 5:29 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Thanks. I have to agree with Robert though that using the pglz encoding when we're just checking for a common prefix/suffix is a pretty crappy way of going about it [1]. As the patch stands, it includes the NULL bitmap when checking for a common prefix. That's probably not a good idea, because it defeats the prefix detection in a the common case that you update a field from NULL to not-NULL or vice versa. Attached is a rewritten version, which does the prefix/suffix tests directly in heapam.c, and adds the prefix/suffix lengths directly as fields in the WAL record. If you could take one more look at this version, to check if I've missed anything. I had verified the patch and found few minor points: ... Fixed those. One Question: + rdata[1].data = (char *) xlrec; Earlier it seems to store record hearder as first segment rdata[0], whats the reason of changing it? I found the code easier to read that way. The order of rdata entries used to be: 0: xl_heap_update struct 1: full-page reference to oldbuf (no data) 2: xl_heap_header_len struct for the new tuple 3-7: logical decoding stuff The prefix/suffix fields made that order a bit awkward, IMHO. They are logically part of the header, even though they're not part of the struct (they are documented in comments inside the struct). So they ought to stay together with the xl_heap_update struct. Another option would've been to move it after the xl_heap_header_len struct. Note that this doesn't affect the on-disk format of the WAL record, because the moved rdata entry is just a full-page reference, with no payload of its own. I have verified the patch by doing crash recovery for below scenario's and it worked fine: a. no change in old and new tuple b. all changed in new tuple c. half changed (update half of the values to NULLS) in new tuple d. only prefix same in new tuple e. only suffix same in new tuple f. prefix-suffix same, other columns values changed in new tuple. Thanks! Conclusion is that patch shows good WAL reduction and performance improvement for favourable cases without CPU overhead for non-favourable cases. Ok, great. Committed! I left out the regression tests. It was good to have them while developing this, but I don't think there's a lot of value in including them permanently in the regression suite. Low-level things like the alignment-sensitive test are fragile, and can easily stop testing the thing it's supposed to test, depending on the platform and future changes in the code. And the current algorithm doesn't care much about alignment anyway. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Mar 12, 2014 at 5:30 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Ok, great. Committed! Awesome. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Tue, Mar 4, 2014 at 4:13 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Agreed. Amit, do you have the test setup at hand, can you check the performance of this one more time? Are you expecting more performance numbers than I have posted? Is there anything more left for patch which you are expecting? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 03/03/2014 04:57 PM, Andres Freund wrote: On 2014-03-03 16:27:05 +0200, Heikki Linnakangas wrote: Attached is a rewritten version, which does the prefix/suffix tests directly in heapam.c, and adds the prefix/suffix lengths directly as fields in the WAL record. If you could take one more look at this version, to check if I've missed anything. Have you rerun the benchmarks? No. I'd guess the CPU overhead of this version is lower than earlier versions, That's what I would expect too. but seing it tested won't be a bad idea. Agreed. Amit, do you have the test setup at hand, can you check the performance of this one more time? Also, I removed the GUC and table level options, on the assumption that this is cheap enough even when it's not helping that we don't need to make it configurable. This ought to be tested with the new logical decoding stuff as it modified the WAL update record format which the logical decoding stuff also relies, but I don't know anything about that. Hm, I think all it needs to do disable delta encoding if need_tuple_data (which is dependent on wal_level=logical). That's a pity, but we can live with it. If we did this at a higher level and checked which columns have been modified, we could include just the modified fields in the record, which should to be enough for logical decoding. It might be even more useful for logical decoding too to know exactly which fields were changed. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 2014-03-04 12:43:48 +0200, Heikki Linnakangas wrote: This ought to be tested with the new logical decoding stuff as it modified the WAL update record format which the logical decoding stuff also relies, but I don't know anything about that. Hm, I think all it needs to do disable delta encoding if need_tuple_data (which is dependent on wal_level=logical). That's a pity, but we can live with it. Agreed. This is hardly the first optimization that only works for some wal_levels. If we did this at a higher level and checked which columns have been modified, we could include just the modified fields in the record, which should to be enough for logical decoding. It might be even more useful for logical decoding too to know exactly which fields were changed. Yea, I argued that way elsewhere in this thread. I do think we're going to need per column info for further features in the near future. It's a bit absurd that we're computing various sets of changed columns (HOT, key, identity) plus the pre/postfix with this patchset. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Mon, Mar 3, 2014 at 7:57 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 02/16/2014 01:51 PM, Amit Kapila wrote: On Wed, Feb 5, 2014 at 5:29 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Thanks. I have to agree with Robert though that using the pglz encoding when we're just checking for a common prefix/suffix is a pretty crappy way of going about it [1]. As the patch stands, it includes the NULL bitmap when checking for a common prefix. That's probably not a good idea, because it defeats the prefix detection in a the common case that you update a field from NULL to not-NULL or vice versa. Attached is a rewritten version, which does the prefix/suffix tests directly in heapam.c, and adds the prefix/suffix lengths directly as fields in the WAL record. If you could take one more look at this version, to check if I've missed anything. I had verified the patch and found few minor points: 1. +extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, + HeapTuple newtup, char *encdata, uint32 *enclen); +extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup, + HeapTuple newtup); Declaration for above functions are not required now. 2. + * later, but will not cause any problem because this function is used only to + * identify whether EWT is required for update. + */ +bool +XLogCheckBufferNeedsBackup(Buffer buffer) Here, I think we can change the comment to avoid word EWT (Encoded WAL tuple), as now we changed compression mechanism and it's not used anywhere else. One Question: + rdata[1].data = (char *) xlrec; Earlier it seems to store record hearder as first segment rdata[0], whats the reason of changing it? I have verified the patch by doing crash recovery for below scenario's and it worked fine: a. no change in old and new tuple b. all changed in new tuple c. half changed (update half of the values to NULLS) in new tuple d. only prefix same in new tuple e. only suffix same in new tuple f. prefix-suffix same, other columns values changed in new tuple. Performance Data Non-Default settings autovacuum = off checkpoitnt_segments = 256 checkpoint_timeout =15min full_page_writes = off Unpatched testname | wal_generated | duration -+---+-- one short and one long field, no change | 573506704 | 9.56587505340576 one short and one long field, no change | 575351216 | 9.97713398933411 one short and one long field, no change | 573501848 | 9.76377606391907 hundred tiny fields, all changed| 364894056 | 13.3053929805756 hundred tiny fields, all changed| 364891536 | 13.3533811569214 hundred tiny fields, all changed| 364889264 | 13.3041989803314 hundred tiny fields, half changed | 365411920 | 14.1831648349762 hundred tiny fields, half changed | 365918216 | 13.6393811702728 hundred tiny fields, half changed | 366456552 | 13.6420011520386 hundred tiny fields, half nulled| 300705288 | 12.8859741687775 hundred tiny fields, half nulled| 301665624 | 12.6988201141357 hundred tiny fields, half nulled| 300700504 | 13.3536100387573 9 short and 1 long, short changed | 396983080 | 8.83671307563782 9 short and 1 long, short changed | 396987976 | 9.23769211769104 9 short and 1 long, short changed | 396984080 | 9.45178604125977 wal-update-prefix-suffix-5.patch testname | wal_generated | duration -+---+-- one short and one long field, no change | 156278832 | 6.69434094429016 one short and one long field, no change | 156277352 | 6.70855903625488 one short and one long field, no change | 156280040 | 6.70657396316528 hundred tiny fields, all changed| 364895152 | 13.6677348613739 hundred tiny fields, all changed| 364892256 | 12.7107839584351 hundred tiny fields, all changed| 364890424 | 13.7760601043701 hundred tiny fields, half changed | 365970360 | 13.1902158260345 hundred tiny fields, half changed | 364895120 | 13.5730090141296 hundred tiny fields, half changed | 367031168 | 13.7023210525513 hundred tiny fields, half nulled| 204418576 | 12.1997199058533 hundred tiny fields, half nulled| 204422880 | 11.4583330154419 hundred tiny fields, half nulled| 204417464 | 12.0228970050812 9 short and 1 long, short changed | 220466016 | 8.14843511581421 9 short and 1 long, short changed | 220471168 | 8.03712797164917 9 short and 1 long, short changed | 220464464 | 8.55907511711121 (15 rows) Conclusion is that patch shows good WAL reduction and performance improvement for favourable cases without CPU overhead for
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 02/16/2014 01:51 PM, Amit Kapila wrote: On Wed, Feb 5, 2014 at 5:29 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I'm pretty sure the overhead of that would be negligible, so we could always enable it. There are certainly a lot of scenarios where prefix/suffix detection alone wouldn't help, but so what. Attached is a quick patch for that, if you want to test it. I have updated the patch to correct few problems, addressed review comments by Andres and done some optimizations to improve CPU overhead for worst case. Thanks. I have to agree with Robert though that using the pglz encoding when we're just checking for a common prefix/suffix is a pretty crappy way of going about it [1]. As the patch stands, it includes the NULL bitmap when checking for a common prefix. That's probably not a good idea, because it defeats the prefix detection in a the common case that you update a field from NULL to not-NULL or vice versa. Attached is a rewritten version, which does the prefix/suffix tests directly in heapam.c, and adds the prefix/suffix lengths directly as fields in the WAL record. If you could take one more look at this version, to check if I've missed anything. This ought to be tested with the new logical decoding stuff as it modified the WAL update record format which the logical decoding stuff also relies, but I don't know anything about that. [1] http://www.postgresql.org/message-id/ca+tgmozstdqdku7dhcljchvmbrh1_yfouse+fuxesevnc4j...@mail.gmail.com - Heikki diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index de4befa..cf23db5 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -6594,10 +6594,15 @@ log_heap_update(Relation reln, Buffer oldbuf, xl_heap_header_len xlhdr; xl_heap_header_len xlhdr_idx; uint8 info; + uint16 prefix_suffix[2]; + uint16 prefixlen = 0, +suffixlen = 0; XLogRecPtr recptr; - XLogRecData rdata[7]; + XLogRecData rdata[9]; Page page = BufferGetPage(newbuf); bool need_tuple_data = RelationIsLogicallyLogged(reln); + int nr; + Buffer newbufref; /* Caller should not call me on a non-WAL-logged relation */ Assert(RelationNeedsWAL(reln)); @@ -6607,6 +6612,54 @@ log_heap_update(Relation reln, Buffer oldbuf, else info = XLOG_HEAP_UPDATE; + /* + * If the old and new tuple are on the same page, we only need to log + * the parts of the new tuple that were changed. That saves on the amount + * of WAL we need to write. Currently, we just count any unchanged bytes + * in the beginning or end of the tuples. That's quick to check, and + * perfectly covers the common case that only one field is updated. + * + * We could do this even if the old and new tuple are on different pages, + * but only if we don't make a full-page image of the old page, which is + * difficult to know in advance. Also, if the old tuple is corrupt for + * some reason, it would allow propagate the corruption to the new page, + * so it seems best to avoid. Under the general assumption that for long + * runs most updates tend to create new tuple version on same page, there + * should not be significant impact on WAL reduction or performance. + * + * Skip this if we're taking a full-page image of the new page, as we don't + * include the new tuple in the WAL record in that case. + */ + if (oldbuf == newbuf (need_tuple_data || !XLogCheckBufferNeedsBackup(newbuf))) + { + char *oldp = (char *) oldtup-t_data + oldtup-t_data-t_hoff; + char *newp = (char *) newtup-t_data + newtup-t_data-t_hoff; + int oldlen = oldtup-t_len - oldtup-t_data-t_hoff; + int newlen = newtup-t_len - newtup-t_data-t_hoff; + + /* Check for common prefix between old and new tuple */ + for (prefixlen = 0; prefixlen Min(oldlen, newlen); prefixlen++) + { + if (newp[prefixlen] != oldp[prefixlen]) +break; + } + /* + * Storing the length of the prefix takes 2 bytes, so we need to save + * at least 3 bytes or there's no point. + */ + if (prefixlen 3) + prefixlen = 0; + + /* Same for suffix */ + for (suffixlen = 0; suffixlen Min(oldlen, newlen) - prefixlen; suffixlen++) + { + if (newp[newlen - suffixlen - 1] != oldp[oldlen - suffixlen - 1]) +break; + } + if (suffixlen 3) + suffixlen = 0; + } + xlrec.target.node = reln-rd_node; xlrec.target.tid = oldtup-t_self; xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup-t_data); @@ -6619,41 +6672,119 @@ log_heap_update(Relation reln, Buffer oldbuf, xlrec.newtid = newtup-t_self; if (new_all_visible_cleared) xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED; + if (prefixlen 0) + xlrec.flags |= XLOG_HEAP_PREFIX_FROM_OLD; + if (suffixlen 0) + xlrec.flags |= XLOG_HEAP_SUFFIX_FROM_OLD; - rdata[0].data = (char *) xlrec; - rdata[0].len = SizeOfHeapUpdate; - rdata[0].buffer = InvalidBuffer; + /* If new tuple is the single and first tuple on page... */ + if (ItemPointerGetOffsetNumber((newtup-t_self)) ==
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 2014-03-03 16:27:05 +0200, Heikki Linnakangas wrote: Thanks. I have to agree with Robert though that using the pglz encoding when we're just checking for a common prefix/suffix is a pretty crappy way of going about it [1]. As the patch stands, it includes the NULL bitmap when checking for a common prefix. That's probably not a good idea, because it defeats the prefix detection in a the common case that you update a field from NULL to not-NULL or vice versa. Attached is a rewritten version, which does the prefix/suffix tests directly in heapam.c, and adds the prefix/suffix lengths directly as fields in the WAL record. If you could take one more look at this version, to check if I've missed anything. Have you rerun the benchmarks? I'd guess the CPU overhead of this version is lower than earlier versions, but seing it tested won't be a bad idea. This ought to be tested with the new logical decoding stuff as it modified the WAL update record format which the logical decoding stuff also relies, but I don't know anything about that. Hm, I think all it needs to do disable delta encoding if need_tuple_data (which is dependent on wal_level=logical). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 2014-03-03 10:35:03 -0500, Robert Haas wrote: On Mon, Mar 3, 2014 at 9:57 AM, Andres Freund and...@2ndquadrant.com wrote: Hm, I think all it needs to do disable delta encoding if need_tuple_data (which is dependent on wal_level=logical). Why does it need to do that? The logical decoding stuff should be able to reverse out the delta encoding. Against what should it perform the delta? Unless I misunderstand how the patch works, it computes the delta against the old tuple in the heap page? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Mon, Mar 3, 2014 at 9:57 AM, Andres Freund and...@2ndquadrant.com wrote: Hm, I think all it needs to do disable delta encoding if need_tuple_data (which is dependent on wal_level=logical). Why does it need to do that? The logical decoding stuff should be able to reverse out the delta encoding. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Mon, Mar 3, 2014 at 10:38 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-03-03 10:35:03 -0500, Robert Haas wrote: On Mon, Mar 3, 2014 at 9:57 AM, Andres Freund and...@2ndquadrant.com wrote: Hm, I think all it needs to do disable delta encoding if need_tuple_data (which is dependent on wal_level=logical). Why does it need to do that? The logical decoding stuff should be able to reverse out the delta encoding. Against what should it perform the delta? Unless I misunderstand how the patch works, it computes the delta against the old tuple in the heap page? Oh, maybe I need more caffeine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
Hi, Some quick review comments: On 2014-02-13 18:14:54 +0530, Amit Kapila wrote: + /* + * EWT can be generated for all new tuple versions created by Update + * operation. Currently we do it when both the old and new tuple versions + * are on same page, because during recovery if the page containing old + * tuple is corrupt, it should not cascade that corruption to other pages. + * Under the general assumption that for long runs most updates tend to + * create new tuple version on same page, there should not be significant + * impact on WAL reduction or performance. + * + * We should not generate EWT when we need to backup the whole block in + * WAL as in that case there is no saving by reduced WAL size. + */ + + if (RelationIsEnabledForWalCompression(reln) + (oldbuf == newbuf) + !XLogCheckBufferNeedsBackup(newbuf)) + { + uint32 enclen; You should note that thew check for RelationIsEnabledForWalCompression() here is racy and that that's ok because the worst that can happen is that a uselessly generated delta. xlrec.target.node = reln-rd_node; xlrec.target.tid = oldtup-t_self; xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup-t_data); @@ -6619,6 +6657,8 @@ log_heap_update(Relation reln, Buffer oldbuf, xlrec.newtid = newtup-t_self; if (new_all_visible_cleared) xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED; + if (compressed) + xlrec.flags |= XLOG_HEAP_DELTA_ENCODED; I think this also needs to unset XLOG_HEAP_CONTAINS_NEW_TUPLE and conditional on !need_tuple_data. /* + * Determine whether the buffer referenced has to be backed up. Since we don't + * yet have the insert lock, fullPageWrites and forcePageWrites could change + * later, but will not cause any problem because this function is used only to + * identify whether EWT is required for update. + */ +bool +XLogCheckBufferNeedsBackup(Buffer buffer) +{ Should note very, very boldly that this can only be used in contexts where a race is acceptable. diff --git a/src/backend/utils/adt/pg_rbcompress.c b/src/backend/utils/adt/pg_rbcompress.c new file mode 100644 index 000..877ccd7 --- /dev/null +++ b/src/backend/utils/adt/pg_rbcompress.c @@ -0,0 +1,355 @@ +/* -- + * pg_rbcompress.c - + * + * This is a delta encoding scheme specific to PostgreSQL and designed + * to compress similar tuples. It can be used as it is or extended for + * other purpose in PostgrSQL if required. + * + * Currently, this just checks for a common prefix and/or suffix, but + * the output format is similar to the LZ format used in pg_lzcompress.c. + * + * Copyright (c) 1999-2014, PostgreSQL Global Development Group + * + * src/backend/utils/adt/pg_rbcompress.c + * -- + */ This needs significantly more explanations about the algorithm and the reasoning behind it. +static const PGRB_Strategy strategy_default_data = { + 32, /* Data chunks less than 32 bytes are not + * compressed */ + INT_MAX,/* No upper limit on what we'll try to + * compress */ + 35, /* Require 25% compression rate, or not worth + * it */ +}; compression rate looks like it's mismatch between comment and code. +/* -- + * pgrb_out_ctrl - + * + * Outputs the last and allocates a new control byte if needed. + * -- + */ +#define pgrb_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \ +do { \ + if ((__ctrl 0xff) == 0) \ + { \ + *(__ctrlp) = __ctrlb; \ + __ctrlp = (__buf)++; \ + __ctrlb = 0; \ + __ctrl = 1; \ + } \ +} while (0) + double underscore variables are reserved for the
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Sat, Feb 15, 2014 at 8:21 PM, Andres Freund and...@2ndquadrant.com wrote: Hi, Some quick review comments: Thanks for the review, I shall handle/reply to comments with the updated version in which I am planing to fix a bug (right now preparing a test to reproduce it) in this code. Bug: Tag can handle maximum length of 273 bytes, but this patch is not considering it. I have to admit, I have serious doubts about this approach. I have a very hard time believing this won't cause performance regression in many common cases... Actually, till now I was majorly focusing on worst case (i.e at boundary of compression ratio) thinking that most others will do good. However I shall produce data for much more common cases as well. Please let me know if you have anything specific thing in mind where this will not work well. More importantly I don't think doing the compression on this level is that interesting. I know Heikki argued for it, but I think extending the bitmap that's computed for HOT to cover all columns and doing this on a column level sounds much more sensible to me. Previously we have tried to do at column boundaries, but the main problem turned out to be in worst cases where we spend time in extracting values from tuples based on column boundaries and later found that data is not compressible. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 2014-02-15 21:01:07 +0530, Amit Kapila wrote: More importantly I don't think doing the compression on this level is that interesting. I know Heikki argued for it, but I think extending the bitmap that's computed for HOT to cover all columns and doing this on a column level sounds much more sensible to me. Previously we have tried to do at column boundaries, but the main problem turned out to be in worst cases where we spend time in extracting values from tuples based on column boundaries and later found that data is not compressible. I think that hugely depends on how you implement it. I think you'd need to have a loop traversing over the both tuples at the same time on the level of heap_deform_tuple(). If you'd use the result to get rid of HeapSatisfiesHOTandKeyUpdate() at the same time I am pretty sure you wouldn't see very high overhead. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Thu, Feb 13, 2014 at 10:20:46AM +0530, Amit Kapila wrote: Why not *only* prefix/suffix? To represent prefix/suffix match, we atleast need a way to tell that the offset and len of matched bytes and then how much is the length of unmatched bytes we have copied. I agree that a simpler format could be devised if we just want to do prefix-suffix match, but that would require much more test during recovery to ensure everything is fine, advantage with LZ format is that we don't need to bother about decoding, it will work as without any much change in LZ decode routine. Based on the numbers I think prefix/suffix-only needs to be explored. Consider if you just change one field of a row --- prefix/suffix would find all the matching parts. If you change the first and last fields, you get no compression at all, but your prefix/suffix test isn't going to get that either. As I understand it, the only place prefix/suffix with LZ compression is a win over prefix/suffix-only is when you change two middle fields, and there are common fields unchanged between them. If we are looking at 11% CPU overhead for that, it isn't worth it. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Feb 12, 2014 at 10:02:32AM +0530, Amit Kapila wrote: By issue, I assume you mean to say, which compression algorithm is best for this patch. For this patch, currently we have 2 algorithm's for which results have been posted. As far as I understand Heikki is pretty sure that the latest algorithm (compression using prefix-suffix match in old and new tuple) used for this patch is better than the other algorithm in terms of CPU gain or overhead. The performance data taken by me for the worst case for this algorithm shows there is a CPU overhead for this algorithm as well. OTOH the another algorithm (compression using old tuple as history) can be a bigger win in terms I/O reduction in more number of cases. In short, it is still not decided which algorithm to choose and whether it can be enabled by default or it is better to have table level switch to enable/disable it. So I think the decision to be taken here is about below points: 1. Are we okay with I/O reduction at the expense of CPU for *worst* cases and I/O reduction without impacting CPU (better overall tps) for *favourable* cases? 2. If we are not okay with worst case behaviour, then can we provide a table-level switch, so that it can be decided by user? 3. If none of above, then is there any other way to mitigate the worst case behaviour or shall we just reject this patch and move on. Given a choice to me, I would like to go with option-2, because I think for most cases UPDATE statement will have same data for old and new tuples except for some part of tuple (generally column's having large text data are not modified), so we will be end up mostly in favourable cases and surely for worst cases we don't want user to suffer from CPU overhead, so a table-level switch is also required. I think 99.9% of users are never going to adjust this so we had better choose something we are happy to enable for effectively everyone. In my reading, prefix/suffix seemed safe for everyone. We can always revisit this if we think of something better later, as WAL format changes are not a problem for pg_upgrade. I also think making it user-tunable is so hard for users to know when to adjust as to be almost not worth the user interface complexity it adds. I suggest we go with always-on prefix/suffix mode, then add some check so the worst case is avoided by just giving up on compression. As I said previously, I think compressing the page images is the next big win in this area. I think here one might argue that for some users it is not feasible to decide whether their tuples data for UPDATE is going to be similar or completely different and they are not at all ready for any risk for CPU overhead, but they would be happy to see I/O reduction in which case it is difficult to decide what should be the value of table-level switch. Here I think the only answer is nothing is free in this world, so either make sure about the application's behaviour for UPDATE statement before going to production or just don't enable this switch and be happy with the current behaviour. Again, can't set do a minimal attempt at prefix/suffix compression so there is no measurable overhead? On the other side there will be users who will be pretty certain about their usage of UPDATE statement or atleast are ready to evaluate their application if they can get such a huge gain, so it would be quite useful feature for such users. can we move move forward with the full-page compression patch? In my opinion, it is not certain that whatever compression algorithm got decided for this patch (if any) can be directly used for full-page compression, some ideas could be used or may be the algorithm could be tweaked a bit to make it usable for full-page compression. Thanks, I understand that now. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Feb 12, 2014 at 8:19 PM, Bruce Momjian br...@momjian.us wrote: On Wed, Feb 12, 2014 at 10:02:32AM +0530, Amit Kapila wrote: I think 99.9% of users are never going to adjust this so we had better choose something we are happy to enable for effectively everyone. In my reading, prefix/suffix seemed safe for everyone. We can always revisit this if we think of something better later, as WAL format changes are not a problem for pg_upgrade. Agreed. I also think making it user-tunable is so hard for users to know when to adjust as to be almost not worth the user interface complexity it adds. I suggest we go with always-on prefix/suffix mode, then add some check so the worst case is avoided by just giving up on compression. As I said previously, I think compressing the page images is the next big win in this area. I think here one might argue that for some users it is not feasible to decide whether their tuples data for UPDATE is going to be similar or completely different and they are not at all ready for any risk for CPU overhead, but they would be happy to see I/O reduction in which case it is difficult to decide what should be the value of table-level switch. Here I think the only answer is nothing is free in this world, so either make sure about the application's behaviour for UPDATE statement before going to production or just don't enable this switch and be happy with the current behaviour. Again, can't set do a minimal attempt at prefix/suffix compression so there is no measurable overhead? Yes, currently it is there at 25%, which means there should be atleast 25% match in prefix-suffix, then only we consider it for compression and that is pretty fast and almost no overhead, but the worst case here is other way i.e when the string has 25% match in prefix-suffix, but after that there is no match or at least in next few bytes there is no match. For example, consider below 2 cases: Case-1 old tuple aaa new tuple bbb bbba Here there is a suffix match for 25% of string, but after that there is no match, so we have to copy all the 75% remaining bytes as it is byte-by-byte. Now here with bit longer tuples (800 bytes), the performance data taken be me shows around ~11% of CPU overhead. Now as this test is a fabricated test to just see how much extra CPU it consumes for worst scenario, in reality user might not see this, at least in synchronous commit mode on, because there is always some I/O involved at end of transaction (unless there is some error in between or user rollbacks transaction chances of which are very less). First thing that comes to mind after seeing above scenario, is that why not increase the minimum limit of 25%, because we have almost negligible overhead in comparing prefix-suffix, so I have tried that by increasing it to 35% or more but in that case it starts falling from other side like for cases when there is 34% match and still we return. Here one of the improvements which can be done is that after prefix-suffix match, instead of going byte-by-byte copy as per LZ format we can directly copy all the remaining part of tuple but I think that would require us to use some different format than LZ which is also not too difficult to do, but the question is do we really need such a change to handle the above kind of worst case. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Thu, Feb 13, 2014 at 1:20 AM, Amit Kapila amit.kapil...@gmail.com wrote: Here one of the improvements which can be done is that after prefix-suffix match, instead of going byte-by-byte copy as per LZ format we can directly copy all the remaining part of tuple but I think that would require us to use some different format than LZ which is also not too difficult to do, but the question is do we really need such a change to handle the above kind of worst case. Why use LZ at all? Why not *only* prefix/suffix? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Thu, Feb 13, 2014 at 10:07 AM, Claudio Freire klaussfre...@gmail.com wrote: On Thu, Feb 13, 2014 at 1:20 AM, Amit Kapila amit.kapil...@gmail.com wrote: Here one of the improvements which can be done is that after prefix-suffix match, instead of going byte-by-byte copy as per LZ format we can directly copy all the remaining part of tuple but I think that would require us to use some different format than LZ which is also not too difficult to do, but the question is do we really need such a change to handle the above kind of worst case. Why use LZ at all? We are just using LZ *format* to represent compressed string. Just copied some text from pg_lzcompress.c, to explain what exactly we are using the first byte after the header tells what to do the next 8 times. We call this the control byte. An unset bit in the control byte means, that one uncompressed byte follows, which is copied from input to output. A set bit in the control byte means, that a tag of 2-3 bytes follows. A tag contains information to copy some bytes, that are already in the output buffer, to the current location in the output. Why not *only* prefix/suffix? To represent prefix/suffix match, we atleast need a way to tell that the offset and len of matched bytes and then how much is the length of unmatched bytes we have copied. I agree that a simpler format could be devised if we just want to do prefix-suffix match, but that would require much more test during recovery to ensure everything is fine, advantage with LZ format is that we don't need to bother about decoding, it will work as without any much change in LZ decode routine. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Tue, Feb 11, 2014 at 11:37 AM, Bruce Momjian br...@momjian.us wrote: Yes, that was my point. I though the compression of full-page images was a huge win and that compression was pretty straight-forward, except for the compression algorithm. If the compression algorithm issue is resolved, can we move move forward with the full-page compression patch? Discussion of the full-page compression patch properly belongs on that thread rather than this one. However, based on what we've discovered so far here, I won't be very surprised if that patch turns out to have serious problems with CPU consumption. The evidence from this thread suggests that making even relatively lame attempts at compression is extremely costly in terms of CPU overhead. Now, the issues with straight-up compression are somewhat different than for delta compression and, in particular, it's easier to bail out of straight-up compression sooner if things aren't working out. But even with all that, I expect it to be not too difficult to find cases where some compression is achieved but with a dramatic increase in runtime on CPU-bound workloads. Which is basically the same problem this patch has. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Mon, Feb 10, 2014 at 10:02 AM, Amit Kapila amit.kapil...@gmail.com wrote: I think if we want to change LZ format, it will be bit more work and verification for decoding has to be done much more strenuously. I don't think it'll be that big of a deal. And anyway, the evidence here suggests that we still need more speed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Feb 5, 2014 at 10:57:57AM -0800, Peter Geoghegan wrote: On Wed, Feb 5, 2014 at 12:50 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I think there's zero overlap. They're completely complimentary features. It's not like normal WAL records have an irrelevant volume. Correct. Compressing a full-page image happens on the first update after a checkpoint, and the diff between old and new tuple is not used in that case. Uh, I really just meant that one thing that might overlap is considerations around the choice of compression algorithm. I think that there was some useful discussion of that on the other thread as well. Yes, that was my point. I though the compression of full-page images was a huge win and that compression was pretty straight-forward, except for the compression algorithm. If the compression algorithm issue is resolved, can we move move forward with the full-page compression patch? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Tue, Feb 11, 2014 at 10:07 PM, Bruce Momjian br...@momjian.us wrote: On Wed, Feb 5, 2014 at 10:57:57AM -0800, Peter Geoghegan wrote: On Wed, Feb 5, 2014 at 12:50 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I think there's zero overlap. They're completely complimentary features. It's not like normal WAL records have an irrelevant volume. Correct. Compressing a full-page image happens on the first update after a checkpoint, and the diff between old and new tuple is not used in that case. Uh, I really just meant that one thing that might overlap is considerations around the choice of compression algorithm. I think that there was some useful discussion of that on the other thread as well. Yes, that was my point. I though the compression of full-page images was a huge win and that compression was pretty straight-forward, except for the compression algorithm. If the compression algorithm issue is resolved, By issue, I assume you mean to say, which compression algorithm is best for this patch. For this patch, currently we have 2 algorithm's for which results have been posted. As far as I understand Heikki is pretty sure that the latest algorithm (compression using prefix-suffix match in old and new tuple) used for this patch is better than the other algorithm in terms of CPU gain or overhead. The performance data taken by me for the worst case for this algorithm shows there is a CPU overhead for this algorithm as well. OTOH the another algorithm (compression using old tuple as history) can be a bigger win in terms I/O reduction in more number of cases. In short, it is still not decided which algorithm to choose and whether it can be enabled by default or it is better to have table level switch to enable/disable it. So I think the decision to be taken here is about below points: 1. Are we okay with I/O reduction at the expense of CPU for *worst* cases and I/O reduction without impacting CPU (better overall tps) for *favourable* cases? 2. If we are not okay with worst case behaviour, then can we provide a table-level switch, so that it can be decided by user? 3. If none of above, then is there any other way to mitigate the worst case behaviour or shall we just reject this patch and move on. Given a choice to me, I would like to go with option-2, because I think for most cases UPDATE statement will have same data for old and new tuples except for some part of tuple (generally column's having large text data are not modified), so we will be end up mostly in favourable cases and surely for worst cases we don't want user to suffer from CPU overhead, so a table-level switch is also required. I think here one might argue that for some users it is not feasible to decide whether their tuples data for UPDATE is going to be similar or completely different and they are not at all ready for any risk for CPU overhead, but they would be happy to see I/O reduction in which case it is difficult to decide what should be the value of table-level switch. Here I think the only answer is nothing is free in this world, so either make sure about the application's behaviour for UPDATE statement before going to production or just don't enable this switch and be happy with the current behaviour. On the other side there will be users who will be pretty certain about their usage of UPDATE statement or atleast are ready to evaluate their application if they can get such a huge gain, so it would be quite useful feature for such users. can we move move forward with the full-page compression patch? In my opinion, it is not certain that whatever compression algorithm got decided for this patch (if any) can be directly used for full-page compression, some ideas could be used or may be the algorithm could be tweaked a bit to make it usable for full-page compression. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Thu, Feb 6, 2014 at 5:57 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Thu, Feb 6, 2014 at 9:13 AM, Amit Kapila amit.kapil...@gmail.com wrote: Considering above change as correct, I have tried to see the worst case overhead for this patch by having new tuple such that after 25% or so of suffix/prefix match, there is a small change in tuple and kept rest of tuple same as old tuple and it shows overhead for this patch as well. Updated test script is attached. Unpatched testname | wal_generated | duration --+---+-- ten long fields, 8 bytes changed | 348843824 | 5.56866788864136 ten long fields, 8 bytes changed | 348844800 | 5.84434294700623 ten long fields, 8 bytes changed | 35050 | 5.92329406738281 (3 rows) wal-update-prefix-suffix-encode-1.patch testname | wal_generated | duration --+---+-- ten long fields, 8 bytes changed | 348845624 | 6.92243480682373 ten long fields, 8 bytes changed | 348847000 | 8.35828399658203 ten long fields, 8 bytes changed | 350204752 | 7.61826491355896 (3 rows) One minor point, can we avoid having prefix tag if prefixlen is 0. + /* output prefix as a tag */ + pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, prefixlen, hlen); I think generating out tag for suffix/prefix has one bug i.e it doesn't consider the max length of 273 bytes (PGLZ_MAX_MATCH ) which is mandatory for LZ format. One more point about this patch is that in function pgrb_delta_encode(), is it mandatory to return at end in below check: if (result_size result_max) return false; I mean to say as before starting for copying literal bytes we have already ensured that the compressed data is greater than 25%, so may be we can avoid this check. I have tried to take the data by removing this check and found that it reduces overhead and improves WAL reduction as well. The data is as below (compare this with data in above mail for unpatched version data): testname | wal_generated | duration --+---+-- ten long fields, 8 bytes changed | 300705552 | 6.51416897773743 ten long fields, 8 bytes changed | 300703816 | 6.85267090797424 ten long fields, 8 bytes changed | 300701840 | 7.15832996368408 (3 rows) If we want to go with this approach, then I think apart from above points there is no major change required (may be some comments, function names etc. can be improved). But if we really just want to do prefix/suffix compression, this is a crappy and expensive way to do it. We needn't force everything through the pglz tag format just because we elide a common prefix or suffix. Here are you bothered about below code where the patch is doing byte-by-byte copy after prefix/suffix match? /* output bytes between prefix and suffix as literals */ dp = source[prefixlen]; dend = source[slen - suffixlen]; while (dp dend) { pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp); dp++; /* Do not do this ++ in the line above! */ } I think if we want to change LZ format, it will be bit more work and verification for decoding has to be done much more strenuously. Note - During performance test, I have focussed mainly on worst case, because we already know that this idea is good for best and average cases. However if we decide that this is better and good to proceed, I can take the data for other cases as well. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Thu, Feb 6, 2014 at 9:13 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Wed, Feb 5, 2014 at 8:50 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 02/05/2014 04:48 PM, Amit Kapila wrote: I have done one test where there is a large suffix match, but not large enough that it can compress more than 75% of string, the CPU overhead with wal-update-prefix-suffix-encode-1.patch is not much, but there is no I/O reduction as well. Hmm, it's supposed to compress if you save at least 25%, not 75%. Apparently I got that backwards in the patch... So If I understand the code correctly, the new check should be if (prefixlen + suffixlen (slen * need_rate) / 100) return false; rather than if (slen - prefixlen - suffixlen (slen * need_rate) / 100) return false; Considering above change as correct, I have tried to see the worst case overhead for this patch by having new tuple such that after 25% or so of suffix/prefix match, there is a small change in tuple and kept rest of tuple same as old tuple and it shows overhead for this patch as well. Updated test script is attached. Unpatched testname | wal_generated | duration --+---+-- ten long fields, 8 bytes changed | 348843824 | 5.56866788864136 ten long fields, 8 bytes changed | 348844800 | 5.84434294700623 ten long fields, 8 bytes changed | 35050 | 5.92329406738281 (3 rows) wal-update-prefix-suffix-encode-1.patch testname | wal_generated | duration --+---+-- ten long fields, 8 bytes changed | 348845624 | 6.92243480682373 ten long fields, 8 bytes changed | 348847000 | 8.35828399658203 ten long fields, 8 bytes changed | 350204752 | 7.61826491355896 (3 rows) One minor point, can we avoid having prefix tag if prefixlen is 0. + /* output prefix as a tag */ + pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, prefixlen, hlen); With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com wal-update-testsuite.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Feb 5, 2014 at 8:56 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Wed, Feb 5, 2014 at 5:13 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 02/05/2014 07:54 AM, Amit Kapila wrote: That's not the worst case, by far. First, note that the skipping while scanning new tuple is only performed in the first loop. That means that as soon as you have a single match, you fall back to hashing every byte. So for the worst case, put one 4-byte field as the first column, and don't update it. Also, I suspect the runtimes in your test were dominated by I/O. When I scale down the number of rows involved so that the whole test fits in RAM, I get much bigger differences with and without the patch. You might also want to turn off full_page_writes, to make the effect clear with less data. So with this test, the overhead is very significant. With the skipping logic, another kind of worst case case is that you have a lot of similarity between the old and new tuple, but you miss it because you skip. This is exactly the reason why I have not kept skipping logic in second pass(loop), but I think may be it would have been better to keep it not as aggressive as in first pass. I have tried to merge pass-1 and pass-2 and kept skipping logic as same, and it have reduced the overhead to a good extent but not completely for the new case you have added. This change is to check if it can reduce overhead, if we want to proceed, may be we can limit the skip factor, so that chance of skipping some match data is reduced. New version of patch is attached with mail Unpatched testname | wal_generated | duration --+---+-- ten long fields, all changed | 348842856 | 6.93688106536865 ten long fields, all changed | 348843672 | 7.53063702583313 ten long fields, all changed | 352662344 | 7.76640701293945 (3 rows) pgrb_delta_encoding_v8.patch testname | wal_generated | duration --+---+-- ten long fields, but one changed | 348848144 | 9.22694897651672 ten long fields, but one changed | 348841376 | 9.11818099021912 ten long fields, but one changed | 352963488 | 8.37875485420227 (3 rows) pgrb_delta_encoding_v9.patch testname | wal_generated | duration --+---+-- ten long fields, but one changed | 350166320 | 8.84561610221863 ten long fields, but one changed | 348840728 | 8.45299792289734 ten long fields, but one changed | 348846656 | 8.34846496582031 (3 rows) It appears to me that it can be good idea to merge both the patches (prefix-suffix encoding + delta-encoding) in a way such that if we get reasonable compression (50% or so) with prefix-suffix, then we can return without doing delta encoding and if compression is lesser than we can do delta encoding for rest of tuple. The reason I think it will be good because by just doing prefix-suffix we might leave many cases where good compression is possible. If you think it is viable way, then I can merge both the patches and check the results. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com pgrb_delta_encoding_v9.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 02/04/2014 11:58 PM, Andres Freund wrote: On February 4, 2014 10:50:10 PM CET, Peter Geoghegan p...@heroku.com wrote: On Tue, Feb 4, 2014 at 11:11 AM, Andres Freund and...@2ndquadrant.com wrote: Does this feature relate to compression of WAL page images at all? No. So the obvious question is: where, if anywhere, do the two efforts (this patch, and Fujii's patch) overlap? Does Fujii have any concerns about this patch as it relates to his effort to compress FPIs? I think there's zero overlap. They're completely complimentary features. It's not like normal WAL records have an irrelevant volume. Correct. Compressing a full-page image happens on the first update after a checkpoint, and the diff between old and new tuple is not used in that case. Compressing full page images makes a difference if you're doing random updates across a large table, so that you only update each buffer 1-2 times. This patch will have no effect in that case. And when you update the same page many times between checkpoints, the full-page image is insignificant, and this patch has a big effect. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 02/05/2014 07:54 AM, Amit Kapila wrote: On Tue, Feb 4, 2014 at 11:58 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Feb 4, 2014 at 12:39 PM, Amit Kapila amit.kapil...@gmail.com wrote: Now there is approximately 1.4~5% CPU gain for hundred tiny fields, half nulled case Assuming that the logic isn't buggy, a point in need of further study, I'm starting to feel like we want to have this. And I might even be tempted to remove the table-level off switch. I have tried to stress on worst case more, as you are thinking to remove table-level switch and found that even if we increase the data by approx. 8 times (ten long fields, all changed, each field contains 80 byte data), the CPU overhead is still 5% which clearly shows that the overhead doesn't increase much even if the length of unmatched data is increased by much larger factor. So the data for worst case adds more weight to your statement (remove table-level switch), however there is no harm in keeping table-level option with default as 'true' and if some users are really sure the updates in their system will have nothing in common, then they can make this new option as 'false'. Below is data for the new case ten long fields, all changed added in attached script file: That's not the worst case, by far. First, note that the skipping while scanning new tuple is only performed in the first loop. That means that as soon as you have a single match, you fall back to hashing every byte. So for the worst case, put one 4-byte field as the first column, and don't update it. Also, I suspect the runtimes in your test were dominated by I/O. When I scale down the number of rows involved so that the whole test fits in RAM, I get much bigger differences with and without the patch. You might also want to turn off full_page_writes, to make the effect clear with less data. So, I came up with the attached worst case test, modified from your latest test suite. unpatched: testname | wal_generated | duration --+---+-- ten long fields, all but one changed | 343385312 | 2.20806908607483 ten long fields, all but one changed | 336263592 | 2.18997097015381 ten long fields, all but one changed | 336264504 | 2.17843413352966 (3 rows) pgrb_delta_encoding_v8.patch: testname | wal_generated | duration --+---+-- ten long fields, all but one changed | 338356944 | 3.33501315116882 ten long fields, all but one changed | 344059272 | 3.37364101409912 ten long fields, all but one changed | 336257840 | 3.36244201660156 (3 rows) So with this test, the overhead is very significant. With the skipping logic, another kind of worst case case is that you have a lot of similarity between the old and new tuple, but you miss it because you skip. For example, if you change the first few columns, but leave a large text column at the end of the tuple unchanged. - Heikki wal-update-testsuite.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 01/30/2014 08:53 AM, Amit Kapila wrote: On Wed, Jan 29, 2014 at 8:13 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/29/2014 02:21 PM, Amit Kapila wrote: The main reason to process in chunks as much as possible is to save cpu cycles. For example if we build hash table byte-by-byte, then even for best case where most of tuple has a match, it will have reasonable overhead due to formation of hash table. Hmm. One very simple optimization we could do is to just compare the two strings byte by byte, before doing anything else, to find any common prefix they might have. Then output a tag for the common prefix, and run the normal algorithm on the rest of the strings. In many real-world tables, the 1-2 first columns are a key that never changes, so that might work pretty well in practice. Maybe it would also be worthwhile to do the same for any common suffix the tuples might have. Is it possible to do for both prefix and suffix together, basically the question I have in mind is what will be deciding factor for switching from hash table mechanism to string comparison mode for suffix. Do we switch when we find long enough match? I think you got it backwards. You don't switch from hash table mechanism to string comparison. You do the prefix/suffix comparison *first*, and run the hash table algorithm only on the middle part, between the common prefix and suffix. Can we do this optimization after the basic version is acceptable? I would actually suggest doing that first. Perhaps even ditch the whole history table approach and do *only* the scan for prefix and suffix. That's very cheap, and already covers a large fraction of UPDATEs that real applications do. In particular, it's optimal for the case that you update only a single column, something like UPDATE foo SET bar = bar + 1. I'm pretty sure the overhead of that would be negligible, so we could always enable it. There are certainly a lot of scenarios where prefix/suffix detection alone wouldn't help, but so what. Attached is a quick patch for that, if you want to test it. - Heikki diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml index e0b8a4e..c4ac2bd 100644 --- a/doc/src/sgml/ref/create_table.sgml +++ b/doc/src/sgml/ref/create_table.sgml @@ -1014,6 +1014,22 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI /listitem /varlistentry + varlistentry +termliteralwal_compress_update/ (typeboolean/)/term +listitem + para + Enables or disables the WAL tuple compression for commandUPDATE/ + on this table. Default value of this option is false to maintain + backward compatability for the command. If true, all the update + operations on this table which will place the new tuple on same page + as it's original tuple will compress the WAL for new tuple and + subsequently reduce the WAL volume. It is recommended to enable + this option for tables where commandUPDATE/ changes less than + 50 percent of tuple data. + /para + /listitem +/varlistentry + /variablelist /refsect2 diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c index aea9d40..3bf5728 100644 --- a/src/backend/access/common/heaptuple.c +++ b/src/backend/access/common/heaptuple.c @@ -60,6 +60,7 @@ #include access/sysattr.h #include access/tuptoaster.h #include executor/tuptable.h +#include utils/pg_rbcompress.h /* Does att's datatype allow packing into the 1-byte-header varlena format? */ @@ -617,6 +618,44 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest) memcpy((char *) dest-t_data, (char *) src-t_data, src-t_len); } +/* + * heap_delta_encode + * + * Calculate the delta between two tuples and generate + * encoded wal tuple (EWT), using pgrb. The result is stored + * in *encdata. + * + */ +bool +heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup, + char *encdata, uint32 *enclen) +{ + return pgrb_delta_encode( + (char *) newtup-t_data + offsetof(HeapTupleHeaderData, t_bits), + newtup-t_len - offsetof(HeapTupleHeaderData, t_bits), + (char *) oldtup-t_data + offsetof(HeapTupleHeaderData, t_bits), + oldtup-t_len - offsetof(HeapTupleHeaderData, t_bits), + encdata, enclen, NULL + ); +} + +/* + * heap_delta_decode + * + * Decode a tuple using delta-encoded WAL tuple and old tuple version. + * + */ +void +heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup) +{ + pgrb_delta_decode(encdata, enclen, + (char *) newtup-t_data + offsetof(HeapTupleHeaderData, t_bits), + MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits), + newtup-t_len, + (char *) oldtup-t_data + offsetof(HeapTupleHeaderData, t_bits), + oldtup-t_len - offsetof(HeapTupleHeaderData, t_bits)); +} + /* * heap_form_tuple * construct a tuple from the
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Feb 5, 2014 at 5:29 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/30/2014 08:53 AM, Amit Kapila wrote: Is it possible to do for both prefix and suffix together, basically the question I have in mind is what will be deciding factor for switching from hash table mechanism to string comparison mode for suffix. Do we switch when we find long enough match? I think you got it backwards. You don't switch from hash table mechanism to string comparison. You do the prefix/suffix comparison *first*, and run the hash table algorithm only on the middle part, between the common prefix and suffix. Can we do this optimization after the basic version is acceptable? I would actually suggest doing that first. Perhaps even ditch the whole history table approach and do *only* the scan for prefix and suffix. That's very cheap, and already covers a large fraction of UPDATEs that real applications do. In particular, it's optimal for the case that you update only a single column, something like UPDATE foo SET bar = bar + 1. I'm pretty sure the overhead of that would be negligible, so we could always enable it. There are certainly a lot of scenarios where prefix/suffix detection alone wouldn't help, but so what. Attached is a quick patch for that, if you want to test it. I have done one test where there is a large suffix match, but not large enough that it can compress more than 75% of string, the CPU overhead with wal-update-prefix-suffix-encode-1.patch is not much, but there is no I/O reduction as well. However for same case there is both significant WAL reduction and CPU gain with pgrb_delta_encoding_v8.patch I have updated ten long fields, all changed such that there is large suffix match. Updated script is attached. Unpatched testname | wal_generated | duration --+---+-- ten long fields, all changed |1760986528 | 28.3700430393219 ten long fields, all changed |1760981320 | 28.53244805336 ten long fields, all changed |1764294992 | 28.6722140312195 (3 rows) wal-update-prefix-suffix-encode-1.patch testname | wal_generated | duration --+---+-- ten long fields, all changed |1760986016 | 29.4183659553528 ten long fields, all changed |1760981904 | 29.7636449337006 ten long fields, all changed |1762436104 | 29.508908033371 (3 rows) pgrb_delta_encoding_v8.patch testname | wal_generated | duration --+---+-- ten long fields, all changed | 733969304 | 23.916286945343 ten long fields, all changed | 733977040 | 23.6019561290741 ten long fields, all changed | 737384632 | 24.2645490169525 With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com wal-update-testsuite.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Feb 5, 2014 at 5:13 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 02/05/2014 07:54 AM, Amit Kapila wrote: On Tue, Feb 4, 2014 at 11:58 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Feb 4, 2014 at 12:39 PM, Amit Kapila amit.kapil...@gmail.com wrote: Now there is approximately 1.4~5% CPU gain for hundred tiny fields, half nulled case Assuming that the logic isn't buggy, a point in need of further study, I'm starting to feel like we want to have this. And I might even be tempted to remove the table-level off switch. I have tried to stress on worst case more, as you are thinking to remove table-level switch and found that even if we increase the data by approx. 8 times (ten long fields, all changed, each field contains 80 byte data), the CPU overhead is still 5% which clearly shows that the overhead doesn't increase much even if the length of unmatched data is increased by much larger factor. So the data for worst case adds more weight to your statement (remove table-level switch), however there is no harm in keeping table-level option with default as 'true' and if some users are really sure the updates in their system will have nothing in common, then they can make this new option as 'false'. Below is data for the new case ten long fields, all changed added in attached script file: That's not the worst case, by far. First, note that the skipping while scanning new tuple is only performed in the first loop. That means that as soon as you have a single match, you fall back to hashing every byte. So for the worst case, put one 4-byte field as the first column, and don't update it. Also, I suspect the runtimes in your test were dominated by I/O. When I scale down the number of rows involved so that the whole test fits in RAM, I get much bigger differences with and without the patch. You might also want to turn off full_page_writes, to make the effect clear with less data. So with this test, the overhead is very significant. With the skipping logic, another kind of worst case case is that you have a lot of similarity between the old and new tuple, but you miss it because you skip. This is exactly the reason why I have not kept skipping logic in second pass(loop), but I think may be it would have been better to keep it not as aggressive as in first pass. The basic idea I had in mind is that if we get match, then there is high chance that we get match in consecutive positions. I think we should see this patch as I/O reduction feature rather than CPU gain/overhead because the I/O reduction in WAL has some other has some other benefits like transfer for replication, in archiving, recovery, basically where-ever there is disk read operation, as I/O reduction will amount to less data read and which can be beneficial in many ways. Sometime back, I was reading article on benefits of compression in database where the benefits are shown something like what I said above (atleast that is what I understood from it). Link to that article is: http://db2guys.wordpress.com/2013/08/23/compression/ Another thing is that I think it might be difficult to get negligible overhead for data which is very less or non-compressible, that's why it is preferred to have compression for table enabled with switch. Is it viable to see here, what is the best way to get I/O reduction for most cases and provide a switch so that for worst cases user can make it off? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 02/05/2014 04:48 PM, Amit Kapila wrote: I have done one test where there is a large suffix match, but not large enough that it can compress more than 75% of string, the CPU overhead with wal-update-prefix-suffix-encode-1.patch is not much, but there is no I/O reduction as well. Hmm, it's supposed to compress if you save at least 25%, not 75%. Apparently I got that backwards in the patch... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Feb 5, 2014 at 8:50 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 02/05/2014 04:48 PM, Amit Kapila wrote: I have done one test where there is a large suffix match, but not large enough that it can compress more than 75% of string, the CPU overhead with wal-update-prefix-suffix-encode-1.patch is not much, but there is no I/O reduction as well. Hmm, it's supposed to compress if you save at least 25%, not 75%. Apparently I got that backwards in the patch... Okay I think that is right, may be I can change the that check to see the difference, but in general isn't it going to loose compression in much more cases like if there is less than 25% match in prefix/suffix, but more than 50% match in between the string. While debugging, I noticed that it compresses less than history table approach for general cases when internally update is done like for Truncate table. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Feb 5, 2014 at 12:50 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I think there's zero overlap. They're completely complimentary features. It's not like normal WAL records have an irrelevant volume. Correct. Compressing a full-page image happens on the first update after a checkpoint, and the diff between old and new tuple is not used in that case. Uh, I really just meant that one thing that might overlap is considerations around the choice of compression algorithm. I think that there was some useful discussion of that on the other thread as well. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Feb 5, 2014 at 6:59 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Attached is a quick patch for that, if you want to test it. But if we really just want to do prefix/suffix compression, this is a crappy and expensive way to do it. We needn't force everything through the pglz tag format just because we elide a common prefix or suffix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Feb 5, 2014 at 8:50 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 02/05/2014 04:48 PM, Amit Kapila wrote: I have done one test where there is a large suffix match, but not large enough that it can compress more than 75% of string, the CPU overhead with wal-update-prefix-suffix-encode-1.patch is not much, but there is no I/O reduction as well. Hmm, it's supposed to compress if you save at least 25%, not 75%. Apparently I got that backwards in the patch... So If I understand the code correctly, the new check should be if (prefixlen + suffixlen (slen * need_rate) / 100) return false; rather than if (slen - prefixlen - suffixlen (slen * need_rate) / 100) return false; Please confirm, else any validation for this might not be useful? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Feb 5, 2014 at 6:43 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: So, I came up with the attached worst case test, modified from your latest test suite. unpatched: testname | wal_generated | duration --+---+-- ten long fields, all but one changed | 343385312 | 2.20806908607483 ten long fields, all but one changed | 336263592 | 2.18997097015381 ten long fields, all but one changed | 336264504 | 2.17843413352966 (3 rows) pgrb_delta_encoding_v8.patch: testname | wal_generated | duration --+---+-- ten long fields, all but one changed | 338356944 | 3.33501315116882 ten long fields, all but one changed | 344059272 | 3.37364101409912 ten long fields, all but one changed | 336257840 | 3.36244201660156 (3 rows) So with this test, the overhead is very significant. Yuck. Well that sucks. With the skipping logic, another kind of worst case case is that you have a lot of similarity between the old and new tuple, but you miss it because you skip. For example, if you change the first few columns, but leave a large text column at the end of the tuple unchanged. I suspect there's no way to have our cake and eat it, too. Most of the work that Amit has done on this patch in the last few revs is to cut back CPU overhead in the cases where the patch can't help because the tuple has been radically modified. If we're trying to get maximum compression, we need to go the other way: for example, we could just feed both the old and new tuples through pglz (or snappy, or whatever). That would allow us to take advantage not only of similarity between the old and new tuples but also internal duplication within either the old or the new tuple, but it would also cost more CPU. The concern with minimizing overhead in cases where the compression doesn't help has thus far pushed us in the opposite direction, namely passing over compression opportunities that a more aggressive algorithm could find in order to keep the overhead low. Off-hand, I'm wondering why we shouldn't apply the same skipping algorithm that Amit is using at the beginning of the string for the rest of it as well. It might be a little too aggressive (maybe the skip distance shouldn't increase by quite as much as doubling every time, or not beyond 16/32 bytes?) but I don't see why the general principle isn't sound wherever we are in the tuple. Unfortunately, despite changing things to make a history entry only every 4th character, building the history is still pretty expensive. By the time we even begin looking at the tuple we're gonna compress, we've already spent something like half the total effort, and of course we have to go further than that before we know whether our attempt to compress is actually going anywhere. I think that's the central problem here. pglz has several safeguards to ensure that it doesn't do too much work in vain: we abort if we find nothing compressible within first_success_by bytes, or if we emit enough total output to be certain that we won't meet the need_rate threshold. Those safeguards are a lot less effective here because they can't be applied until *after* we've already paid the cost of building the history. If we could figure out some way to apply those guards, or other guards, earlier in the algorithm, we could do a better job mitigating the worst-case scenarios, but I don't have a good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 06/02/14 16:59, Robert Haas wrote: On Wed, Feb 5, 2014 at 6:43 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: So, I came up with the attached worst case test, modified from your latest test suite. unpatched: testname | wal_generated | duration --+---+-- ten long fields, all but one changed | 343385312 | 2.20806908607483 ten long fields, all but one changed | 336263592 | 2.18997097015381 ten long fields, all but one changed | 336264504 | 2.17843413352966 (3 rows) pgrb_delta_encoding_v8.patch: testname | wal_generated | duration --+---+-- ten long fields, all but one changed | 338356944 | 3.33501315116882 ten long fields, all but one changed | 344059272 | 3.37364101409912 ten long fields, all but one changed | 336257840 | 3.36244201660156 (3 rows) So with this test, the overhead is very significant. Yuck. Well that sucks. With the skipping logic, another kind of worst case case is that you have a lot of similarity between the old and new tuple, but you miss it because you skip. For example, if you change the first few columns, but leave a large text column at the end of the tuple unchanged. I suspect there's no way to have our cake and eat it, too. Most of the work that Amit has done on this patch in the last few revs is to cut back CPU overhead in the cases where the patch can't help because the tuple has been radically modified. If we're trying to get maximum compression, we need to go the other way: for example, we could just feed both the old and new tuples through pglz (or snappy, or whatever). That would allow us to take advantage not only of similarity between the old and new tuples but also internal duplication within either the old or the new tuple, but it would also cost more CPU. The concern with minimizing overhead in cases where the compression doesn't help has thus far pushed us in the opposite direction, namely passing over compression opportunities that a more aggressive algorithm could find in order to keep the overhead low. Off-hand, I'm wondering why we shouldn't apply the same skipping algorithm that Amit is using at the beginning of the string for the rest of it as well. It might be a little too aggressive (maybe the skip distance shouldn't increase by quite as much as doubling every time, or not beyond 16/32 bytes?) but I don't see why the general principle isn't sound wherever we are in the tuple. Unfortunately, despite changing things to make a history entry only every 4th character, building the history is still pretty expensive. By the time we even begin looking at the tuple we're gonna compress, we've already spent something like half the total effort, and of course we have to go further than that before we know whether our attempt to compress is actually going anywhere. I think that's the central problem here. pglz has several safeguards to ensure that it doesn't do too much work in vain: we abort if we find nothing compressible within first_success_by bytes, or if we emit enough total output to be certain that we won't meet the need_rate threshold. Those safeguards are a lot less effective here because they can't be applied until *after* we've already paid the cost of building the history. If we could figure out some way to apply those guards, or other guards, earlier in the algorithm, we could do a better job mitigating the worst-case scenarios, but I don't have a good idea. Surely the weighting should be done according to the relative scarcity of processing power vs I/O bandwidth? I get the impression that different workloads and hardware configurations may favour conserving one of processor or I/O resources. Would it be feasible to have different logic, depending on the trade-offs identified? Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Tue, Feb 4, 2014 at 12:39 PM, Amit Kapila amit.kapil...@gmail.com wrote: Now there is approximately 1.4~5% CPU gain for hundred tiny fields, half nulled case I don't want to advocate too strongly for this patch because, number one, Amit is a colleague and more importantly, number two, I can't claim to be an expert on compression. But that having been said, I think these numbers are starting to look awfully good. The only remaining regressions are in the cases where a large fraction of the tuple turns over, and they're not that big even then. The two *worst* tests now seem to be hundred tiny fields, all changed and hundred tiny fields, half changed. For the all changed case, the median unpatched time is 16.3172590732574 and the median patched time is 16.9294109344482, a 4% loss; for the half changed case, the median unpatched time is 16.5795118808746 and the median patched time is 17.0454230308533, a 3% loss. Both cases show minimal change in WAL volume. Meanwhile, in friendlier cases, like one short and one long field, no change, we're seeing big improvements. That particular case shows a speedup of 21% and a WAL reduction of 36%. That's a pretty big deal, and I think not unrepresentative of many real-world workloads. Some might well do better, having either more or longer unchanged fields. Assuming that the logic isn't buggy, a point in need of further study, I'm starting to feel like we want to have this. And I might even be tempted to remove the table-level off switch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Tue, Feb 4, 2014 at 01:28:38PM -0500, Robert Haas wrote: Meanwhile, in friendlier cases, like one short and one long field, no change, we're seeing big improvements. That particular case shows a speedup of 21% and a WAL reduction of 36%. That's a pretty big deal, and I think not unrepresentative of many real-world workloads. Some might well do better, having either more or longer unchanged fields. Assuming that the logic isn't buggy, a point in need of further study, I'm starting to feel like we want to have this. And I might even be tempted to remove the table-level off switch. Does this feature relate to compression of WAL page images at all? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 2014-02-04 14:09:57 -0500, Bruce Momjian wrote: On Tue, Feb 4, 2014 at 01:28:38PM -0500, Robert Haas wrote: Meanwhile, in friendlier cases, like one short and one long field, no change, we're seeing big improvements. That particular case shows a speedup of 21% and a WAL reduction of 36%. That's a pretty big deal, and I think not unrepresentative of many real-world workloads. Some might well do better, having either more or longer unchanged fields. Assuming that the logic isn't buggy, a point in need of further study, I'm starting to feel like we want to have this. And I might even be tempted to remove the table-level off switch. Does this feature relate to compression of WAL page images at all? No. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Tue, Feb 4, 2014 at 08:11:18PM +0100, Andres Freund wrote: On 2014-02-04 14:09:57 -0500, Bruce Momjian wrote: On Tue, Feb 4, 2014 at 01:28:38PM -0500, Robert Haas wrote: Meanwhile, in friendlier cases, like one short and one long field, no change, we're seeing big improvements. That particular case shows a speedup of 21% and a WAL reduction of 36%. That's a pretty big deal, and I think not unrepresentative of many real-world workloads. Some might well do better, having either more or longer unchanged fields. Assuming that the logic isn't buggy, a point in need of further study, I'm starting to feel like we want to have this. And I might even be tempted to remove the table-level off switch. Does this feature relate to compression of WAL page images at all? No. I guess it bothers me we are working on compressing row change sets while the majority(?) of WAL is page images. I know we had a page image compression patch that got stalled. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Tue, Feb 4, 2014 at 11:11 AM, Andres Freund and...@2ndquadrant.com wrote: Does this feature relate to compression of WAL page images at all? No. So the obvious question is: where, if anywhere, do the two efforts (this patch, and Fujii's patch) overlap? Does Fujii have any concerns about this patch as it relates to his effort to compress FPIs? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On February 4, 2014 10:50:10 PM CET, Peter Geoghegan p...@heroku.com wrote: On Tue, Feb 4, 2014 at 11:11 AM, Andres Freund and...@2ndquadrant.com wrote: Does this feature relate to compression of WAL page images at all? No. So the obvious question is: where, if anywhere, do the two efforts (this patch, and Fujii's patch) overlap? Does Fujii have any concerns about this patch as it relates to his effort to compress FPIs? I think there's zero overlap. They're completely complimentary features. It's not like normal WAL records have an irrelevant volume. Andres -- Please excuse brevity and formatting - I am writing this on my mobile phone. Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Tue, Feb 4, 2014 at 1:58 PM, Andres Freund and...@2ndquadrant.com wrote: I think there's zero overlap. They're completely complimentary features. It's not like normal WAL records have an irrelevant volume. I'd have thought so too, but I would not like to assume. Like many people commenting on this thread, I don't know very much about compression. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Tue, Feb 4, 2014 at 11:58 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Feb 4, 2014 at 12:39 PM, Amit Kapila amit.kapil...@gmail.com wrote: Now there is approximately 1.4~5% CPU gain for hundred tiny fields, half nulled case Assuming that the logic isn't buggy, a point in need of further study, I'm starting to feel like we want to have this. And I might even be tempted to remove the table-level off switch. I have tried to stress on worst case more, as you are thinking to remove table-level switch and found that even if we increase the data by approx. 8 times (ten long fields, all changed, each field contains 80 byte data), the CPU overhead is still 5% which clearly shows that the overhead doesn't increase much even if the length of unmatched data is increased by much larger factor. So the data for worst case adds more weight to your statement (remove table-level switch), however there is no harm in keeping table-level option with default as 'true' and if some users are really sure the updates in their system will have nothing in common, then they can make this new option as 'false'. Below is data for the new case ten long fields, all changed added in attached script file: Unpatched testname | wal_generated | duration --+---+-- ten long fields, all changed |3473999520 | 45.0375978946686 ten long fields, all changed |3473999864 | 45.2536928653717 ten long fields, all changed |3474006880 | 45.1887288093567 After pgrb_delta_encoding_v8.patch -- testname | wal_generated | duration --+---+-- ten long fields, all changed |3474006456 | 47.5744359493256 ten long fields, all changed |3474000136 | 47.3830440044403 ten long fields, all changed |3474002688 | 46.9923310279846 With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com wal-update-testsuite.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Fri, Jan 31, 2014 at 12:33 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Thu, Jan 30, 2014 at 12:23 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Wed, Jan 29, 2014 at 8:13 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: After basic verification of back-to-pglz-like-delta-encoding-1, I will take the data with both the patches and report the same. I have corrected the problems reported in back-to-pglz-like-delta-encoding-1 and removed hindex from pgrb_delta_encoding_v6 and attached are new versions of both patches. I/O Reduction Data - Non-Default settings autovacuum = off checkpoitnt_segments = 256 checkpoint_timeout =15min Observations 1. With both the patches WAL reduction is similar i.e ~37% for one short and one long field, no change and 12% for hundred tiny fields, half nulled 2. With pgrb_delta_encoding_v7, there is ~19% CPU reduction for best case one short and one long field, no change. 3. With pgrb_delta_encoding_v7, there is approximately 8~9% overhead for cases where there is no match 4. With pgrb_delta_encoding_v7, there is approximately 15~18% overhead for hundred tiny fields, half nulled case 5. With back-to-pglz-like-delta-encoding-2, the data is mostly similar except for hundred tiny fields, half nulled where CPU overhead is much more. The case (hundred tiny fields, half nulled) where CPU overhead is visible is due to repetitive data and if take some random or different data, it will not be there. To verify this theory, I have added one new test which is almost similar to hundred tiny fields, half nulled, the difference is that it has non-repetive string and the results are as below: Unpatch -- testname | wal_generated | duration --+---+-- nine short and one long field, thirty percent change | 698912496 | 12.1819660663605 nine short and one long field, thirty percent change | 698906048 | 11.9409539699554 nine short and one long field, thirty percent change | 698910904 | 11.9367880821228 Patch pgrb_delta_encoding_v7 testname | wal_generated | duration --+---+-- nine short and one long field, thirty percent change | 559840840 | 11.6027710437775 nine short and one long field, thirty percent change | 559829440 | 11.8239741325378 nine short and one long field, thirty percent change | 560141352 | 11.6789472103119 Patch back-to-pglz-like-delta-encoding-2 -- testname | wal_generated | duration --+---+-- nine short and one long field, thirty percent change | 544391432 | 12.3666560649872 nine short and one long field, thirty percent change | 544378616 | 11.8833730220795 nine short and one long field, thirty percent change | 544376888 | 11.9487581253052 (3 rows) Basic idea of new test is that some part of tuple is unchanged and other part is changed, here the unchanged part contains random string rather than repetitive set of chars. The new test is added with other tests in attached file. Observation --- LZ like delta encoding has more WAL reduction and chunk wise encoding has bit better CPU usage, but overall both are almost similar. I think the main reason for overhead is that we store last offset of matching data in history at front, so during match, it has to traverse back many times to find longest possible match and in real world it won't be the case that most of history entries contain same hash index, so it should not effect. If we want to improve CPU usage for cases like hundred tiny fields, half nulled (which I think is not important), forming history table by traversing from end rather than beginning, can serve the purpose, I have not tried it but I think it can certainly help. Do you think overall data is acceptable? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com wal-update-testsuite.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Thu, Jan 30, 2014 at 12:23 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Wed, Jan 29, 2014 at 8:13 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Few observations in patch (back-to-pglz-like-delta-encoding-1): 1. +#define pgrb_hash_unroll(_p, hindex) \ + hindex = hindex ^ ((_p)[0] 8) shouldn't it shift by 6 rather than by 8. 2. + if (bp - bstart = result_max) + return false; I think for nothing or lesser match case it will traverse whole tuple. Can we optimize such that if there is no match till 75%, we can bail out. Ideally, I think if we don't find any match in first 50 to 60%, we should come out. 3. pg_rbcompress.h is missing. I am still working on patch back-to-pglz-like-delta-encoding-1 to see if it works well for all cases, but thought of sharing what I have done till now to you. After basic verification of back-to-pglz-like-delta-encoding-1, I will take the data with both the patches and report the same. Apart from above, the only other thing which I found problematic is below code in find match + while (*ip == *hp thislen maxlen) + thislen++; It should be while (*ip++ == *hp++ thislen maxlen) Please confirm. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 01/28/2014 07:01 PM, Heikki Linnakangas wrote: On 01/27/2014 07:03 PM, Amit Kapila wrote: I have tried to improve algorithm in another way so that we can get benefit of same chunks during find match (something similar to lz). The main change is to consider chunks at fixed boundary (4 byte) and after finding match, try to find if there is a longer match than current chunk. While finding longer match, it still takes care that next bigger match should be at chunk boundary. I am not completely sure about the chunk boundary may be 8 or 16 can give better results. Since you're only putting a value in the history every 4 bytes, you wouldn't need to calculate the hash in a rolling fashion. You could just take next four bytes, calculate hash, put it in history table. Then next four bytes, calculate hash, and so on. Might save some cycles when building the history table... On a closer look, you're putting a chunk in the history table only every four bytes, but you're *also* checking the history table for a match only every four bytes. That completely destroys the shift-resistence of the algorithm. For example, if the new tuple is an exact copy of the old tuple, except for one additional byte in the beginning, the algorithm would fail to recognize that. It would be good to add a test case like that in the test suite. You can skip bytes when building the history table, or when finding matches, but not both. Or you could skip N bytes, and then check for matches for the next four bytes, then skip again and so forth, as long as you always check four consecutive bytes (because the hash function is calculated from four bytes). I couldn't resist the challenge, and started hacking this. First, some observations from your latest patch (pgrb_delta_encoding_v5.patch): 1. There are a lot of comments and code that refers to chunks, which seem obsolete. For example, ck_size field in PGRB_HistEntry is always set to a constant, 4, except maybe at the very end of the history string. The algorithm has nothing to do with Rabin-Karp anymore. 2. The 'hindex' field in PGRB_HistEntry is unused. Also, ck_start_pos is redundant with the index of the entry in the array, because the array is filled in order. That only leaves us just the 'next' field, and that can be represented as a int16 rather than a pointer. So, we really only need a simple int16 array as the history entries array. 3. You're not gaining much by calculating the hash in a rolling fashion. A single rolling step requires two multiplications and two sums, plus shifting the variables around. Calculating the hash from scratch requires three multiplications and three sums. 4. Given that we're not doing the Rabin-Karp variable-length chunk thing, we could use a cheaper hash function to begin with. Like, the one used in pglz. The multiply-by-prime method probably gives fewer collisions than pglz's shift/xor method, but I don't think that matters as much as computation speed. No-one has mentioned or measured the effect of collisions in this thread; that either means that it's a non-issue or that no-one's just realized how big a problem it is yet. I'm guessing that it's not a problem, and if it is, it's mitigated by only trying to find matches every N bytes; collisions would make finding matches slower, and that's exactly what skipping helps with. After addressing the above, we're pretty much back to PGLZ approach. I kept the change to only find matches every four bytes, that does make some difference. And I like having this new encoding code in a separate file, not mingled with pglz stuff, it's sufficiently different that that's better. I haven't done all much testing with this, so take it with a grain of salt. I don't know if this is better or worse than the other patches that have been floated around, but I though I might as well share it.. - Heikki diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml index e0b8a4e..c4ac2bd 100644 --- a/doc/src/sgml/ref/create_table.sgml +++ b/doc/src/sgml/ref/create_table.sgml @@ -1014,6 +1014,22 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI /listitem /varlistentry + varlistentry +termliteralwal_compress_update/ (typeboolean/)/term +listitem + para + Enables or disables the WAL tuple compression for commandUPDATE/ + on this table. Default value of this option is false to maintain + backward compatability for the command. If true, all the update + operations on this table which will place the new tuple on same page + as it's original tuple will compress the WAL for new tuple and + subsequently reduce the WAL volume. It is recommended to enable + this option for tables where commandUPDATE/ changes less than + 50 percent of tuple data. + /para + /listitem +/varlistentry + /variablelist /refsect2 diff --git
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Jan 29, 2014 at 3:41 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/28/2014 07:01 PM, Heikki Linnakangas wrote: On 01/27/2014 07:03 PM, Amit Kapila wrote: I have tried to improve algorithm in another way so that we can get benefit of same chunks during find match (something similar to lz). The main change is to consider chunks at fixed boundary (4 byte) and after finding match, try to find if there is a longer match than current chunk. While finding longer match, it still takes care that next bigger match should be at chunk boundary. I am not completely sure about the chunk boundary may be 8 or 16 can give better results. Since you're only putting a value in the history every 4 bytes, you wouldn't need to calculate the hash in a rolling fashion. You could just take next four bytes, calculate hash, put it in history table. Then next four bytes, calculate hash, and so on. Might save some cycles when building the history table... First of all thanks for looking into patch. Yes, this is right, we can save cycles by not doing rolling during hash calculation and I was working to improve the patch on those lines. Earlier it was their because of rabin's delta encoding where we need to check for a special match after each byte. On a closer look, you're putting a chunk in the history table only every four bytes, but you're *also* checking the history table for a match only every four bytes. That completely destroys the shift-resistence of the algorithm. You are right that it will loose the shift-resistence and even Robert has pointed me this, that's why he wants to maintain the property of special bytes at chunk boundaries as mentioned in Rabin encoding. The only real reason to shift to fixed size was to improve CPU usage and I thought most cases in Update will update the fixed length columns, but it might not be true. For example, if the new tuple is an exact copy of the old tuple, except for one additional byte in the beginning, the algorithm would fail to recognize that. It would be good to add a test case like that in the test suite. You can skip bytes when building the history table, or when finding matches, but not both. Or you could skip N bytes, and then check for matches for the next four bytes, then skip again and so forth, as long as you always check four consecutive bytes (because the hash function is calculated from four bytes). Can we do something like: Build Phase a. Calculate the hash and add the entry in history table at every 4 bytes. Match Phase a. Calculate the hash in rolling fashion and try to find match at every byte. b. When match is found then skip only in chunks, something like I was doing in find match function + + /* consider only complete chunk matches. */ + if (history_chunk_size == 0) + thislen += PGRB_MIN_CHUNK_SIZE; + } Will this address the concern? The main reason to process in chunks as much as possible is to save cpu cycles. For example if we build hash table byte-by-byte, then even for best case where most of tuple has a match, it will have reasonable overhead due to formation of hash table. I couldn't resist the challenge, and started hacking this. First, some observations from your latest patch (pgrb_delta_encoding_v5.patch): 1. There are a lot of comments and code that refers to chunks, which seem obsolete. For example, ck_size field in PGRB_HistEntry is always set to a constant, 4, except maybe at the very end of the history string. The algorithm has nothing to do with Rabin-Karp anymore. 2. The 'hindex' field in PGRB_HistEntry is unused. Also, ck_start_pos is redundant with the index of the entry in the array, because the array is filled in order. That only leaves us just the 'next' field, and that can be represented as a int16 rather than a pointer. So, we really only need a simple int16 array as the history entries array. 3. You're not gaining much by calculating the hash in a rolling fashion. A single rolling step requires two multiplications and two sums, plus shifting the variables around. Calculating the hash from scratch requires three multiplications and three sums. 4. Given that we're not doing the Rabin-Karp variable-length chunk thing, we could use a cheaper hash function to begin with. Like, the one used in pglz. The multiply-by-prime method probably gives fewer collisions than pglz's shift/xor method, but I don't think that matters as much as computation speed. No-one has mentioned or measured the effect of collisions in this thread; that either means that it's a non-issue or that no-one's just realized how big a problem it is yet. I'm guessing that it's not a problem, and if it is, it's mitigated by only trying to find matches every N bytes; collisions would make finding matches slower, and that's exactly what skipping helps with. After addressing the above, we're pretty much back to PGLZ approach. Here during match phase, I think we can avoid copying literal
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 01/29/2014 02:21 PM, Amit Kapila wrote: On Wed, Jan 29, 2014 at 3:41 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: For example, if the new tuple is an exact copy of the old tuple, except for one additional byte in the beginning, the algorithm would fail to recognize that. It would be good to add a test case like that in the test suite. You can skip bytes when building the history table, or when finding matches, but not both. Or you could skip N bytes, and then check for matches for the next four bytes, then skip again and so forth, as long as you always check four consecutive bytes (because the hash function is calculated from four bytes). Can we do something like: Build Phase a. Calculate the hash and add the entry in history table at every 4 bytes. Match Phase a. Calculate the hash in rolling fashion and try to find match at every byte. Sure, that'll work. However, I believe it's cheaper to add entries to the history table at every byte, and check for a match every 4 bytes. I think you'll get more or less the same level of compression either way, but adding to the history table is cheaper than checking for matches, and we'd rather do the cheap thing more often than the expensive thing. b. When match is found then skip only in chunks, something like I was doing in find match function + + /* consider only complete chunk matches. */ + if (history_chunk_size == 0) + thislen += PGRB_MIN_CHUNK_SIZE; + } Will this address the concern? Hmm, so when checking if a match is truly a match, you compare the strings four bytes at a time rather than byte-by-byte? That might work, but I don't think that's a hot spot currently. In the profiling I did, with a nothing matches test case, all the cycles were spent in the history building, and finding matches. Finding out how long a match is was negligible. Of course, it might be a different story with input where the encoding helps and you have matches, but I think we were doing pretty well in those cases already. The main reason to process in chunks as much as possible is to save cpu cycles. For example if we build hash table byte-by-byte, then even for best case where most of tuple has a match, it will have reasonable overhead due to formation of hash table. Hmm. One very simple optimization we could do is to just compare the two strings byte by byte, before doing anything else, to find any common prefix they might have. Then output a tag for the common prefix, and run the normal algorithm on the rest of the strings. In many real-world tables, the 1-2 first columns are a key that never changes, so that might work pretty well in practice. Maybe it would also be worthwhile to do the same for any common suffix the tuples might have. That would fail to find matches where you e.g. update the last column to have the same value as the first column, and change nothing else, but that's ok. We're not aiming for the best possible compression, just trying to avoid WAL-logging data that wasn't changed. Here during match phase, I think we can avoid copying literal bytes until a match is found, that can save cycles for cases when old and new tuple are mostly different. I think the extra if's in the loop will actually cost you more cycles than you save. You could perhaps have two copies of the main match-finding loop though. First, loop without outputting anything, until you find the first match. Then, output anything up to that point as literals. Then fall into the second loop, which outputs any non-matches byte by byte. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 01/27/2014 07:03 PM, Amit Kapila wrote: I have tried to improve algorithm in another way so that we can get benefit of same chunks during find match (something similar to lz). The main change is to consider chunks at fixed boundary (4 byte) and after finding match, try to find if there is a longer match than current chunk. While finding longer match, it still takes care that next bigger match should be at chunk boundary. I am not completely sure about the chunk boundary may be 8 or 16 can give better results. Since you're only putting a value in the history every 4 bytes, you wouldn't need to calculate the hash in a rolling fashion. You could just take next four bytes, calculate hash, put it in history table. Then next four bytes, calculate hash, and so on. Might save some cycles when building the history table... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
I think sort by string column is lower during merge join, maybe comparing function in sort need be refined to save some cycle. It’s the hot function when do sort. Heikki Linnakangas hlinnakan...@vmware.com编写: On 01/27/2014 07:03 PM, Amit Kapila wrote: I have tried to improve algorithm in another way so that we can get benefit of same chunks during find match (something similar to lz). The main change is to consider chunks at fixed boundary (4 byte) and after finding match, try to find if there is a longer match than current chunk. While finding longer match, it still takes care that next bigger match should be at chunk boundary. I am not completely sure about the chunk boundary may be 8 or 16 can give better results. Since you're only putting a value in the history every 4 bytes, you wouldn't need to calculate the hash in a rolling fashion. You could just take next four bytes, calculate hash, put it in history table. Then next four bytes, calculate hash, and so on. Might save some cycles when building the history table... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Mon, Jan 27, 2014 at 12:03 PM, Amit Kapila amit.kapil...@gmail.com wrote: I think that's a good thing to try. Can you code it up? I have tried to improve algorithm in another way so that we can get benefit of same chunks during find match (something similar to lz). The main change is to consider chunks at fixed boundary (4 byte) and after finding match, try to find if there is a longer match than current chunk. While finding longer match, it still takes care that next bigger match should be at chunk boundary. I am not completely sure about the chunk boundary may be 8 or 16 can give better results. I think now we can once run with this patch on high end m/c. Here are the results I got on the community PPC64 box. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company pgrb-v5 test.ods Description: application/vnd.oasis.opendocument.spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Tue, Jan 21, 2014 at 2:00 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Mon, Jan 20, 2014 at 9:49 PM, Robert Haas robertmh...@gmail.com wrote: I ran Heikki's test suit on latest master and latest master plus pgrb_delta_encoding_v4.patch on a PPC64 machine, but the results didn't look too good. The only tests where the WAL volume changed by more than half a percent were the one short and one long field, no change test, where it dropped by 17%, but at the expense of an increase in duration of 38%; and the hundred tiny fields, half nulled test, where it dropped by 2% without a change in runtime. Unfortunately, some of the tests where WAL didn't change significantly took a runtime hit - in particular, hundred tiny fields, half changed slowed down by 10% and hundred tiny fields, all changed by 8%. I think this part of result is positive, as with earlier approaches here the dip was 20%. Refer the result posted at link: http://www.postgresql.org/message-id/51366323.8070...@vmware.com I've attached the full results in OpenOffice format. Profiling the one short and one long field, no change test turns up the following: 51.38% postgres pgrb_delta_encode 23.58% postgres XLogInsert 2.54% postgres heap_update 1.09% postgres LWLockRelease 0.90% postgres LWLockAcquire 0.89% postgres palloc0 0.88% postgres log_heap_update 0.84% postgres HeapTupleSatisfiesMVCC 0.75% postgres ExecModifyTable 0.73% postgres hash_search_with_hash_value Yipes. That's a lot more than I remember this costing before. And I don't understand why I'm seeing such a large time hit on this test where you actually saw a significant time *reduction*. One possibility is that you may have been running with a default checkpoint_segments value or one that's low enough to force checkpointing activity during the test. I ran with checkpoint_segments=300. I ran with checkpoint_segments = 128 and when I ran with v4, I also see similar WAL reduction as you are seeing, except that in my case runtime for both are almost similar (i think in your case disk writes are fast, so CPU overhead is more visible). I think the major difference in above test is due to below part of code: pgrb_find_match() { .. + /* if (match_chunk) + { + while (*ip == *hp) + { + matchlen++; + ip++; + hp++; + } + } */ } Basically if we don't go for longer match, then for test where most data (one short and one long field, no change) is similar, it has to do below extra steps with no advantage: a. copy extra tags b. calculation for rolling hash c. finding the match I think here major cost is due to 'a', but others might also not be free. To confirm the theory, if we run the test by just un-commenting above code, there can be significant change in both WAL reduction and runtime for this test. I have one idea to avoid the overhead of step a) which is to combine the tags, means don't write the tag until it founds any un-matching data. When any un-matched data is found, then combine all the previously matched data and write it as one tag. This should eliminate the overhead due to step a. I think that's a good thing to try. Can you code it up? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Thu, Jan 16, 2014 at 12:07 AM, Amit Kapila amit.kapil...@gmail.com wrote: Okay, got your point. Another minor thing is that in latest patch which I have sent yesterday, I have modified it such that while formation of chunks if there is a data at end of string which doesn't have special pattern and is less than max chunk size, we also consider that as a chunk. The reason of doing this was that let us say if we have 104 bytes string which contains no special bit pattern, then it will just have one 64 byte chunk and will leave the remaining bytes, which might miss the chance of doing compression for that data. I ran Heikki's test suit on latest master and latest master plus pgrb_delta_encoding_v4.patch on a PPC64 machine, but the results didn't look too good. The only tests where the WAL volume changed by more than half a percent were the one short and one long field, no change test, where it dropped by 17%, but at the expense of an increase in duration of 38%; and the hundred tiny fields, half nulled test, where it dropped by 2% without a change in runtime. Unfortunately, some of the tests where WAL didn't change significantly took a runtime hit - in particular, hundred tiny fields, half changed slowed down by 10% and hundred tiny fields, all changed by 8%. I've attached the full results in OpenOffice format. Profiling the one short and one long field, no change test turns up the following: 51.38% postgres pgrb_delta_encode 23.58% postgres XLogInsert 2.54% postgres heap_update 1.09% postgres LWLockRelease 0.90% postgres LWLockAcquire 0.89% postgres palloc0 0.88% postgres log_heap_update 0.84% postgres HeapTupleSatisfiesMVCC 0.75% postgres ExecModifyTable 0.73% postgres hash_search_with_hash_value Yipes. That's a lot more than I remember this costing before. And I don't understand why I'm seeing such a large time hit on this test where you actually saw a significant time *reduction*. One possibility is that you may have been running with a default checkpoint_segments value or one that's low enough to force checkpointing activity during the test. I ran with checkpoint_segments=300. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company pgrb_delta_encoding_v4.ods Description: application/vnd.oasis.opendocument.spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Mon, Jan 20, 2014 at 9:49 PM, Robert Haas robertmh...@gmail.com wrote: I ran Heikki's test suit on latest master and latest master plus pgrb_delta_encoding_v4.patch on a PPC64 machine, but the results didn't look too good. The only tests where the WAL volume changed by more than half a percent were the one short and one long field, no change test, where it dropped by 17%, but at the expense of an increase in duration of 38%; and the hundred tiny fields, half nulled test, where it dropped by 2% without a change in runtime. Unfortunately, some of the tests where WAL didn't change significantly took a runtime hit - in particular, hundred tiny fields, half changed slowed down by 10% and hundred tiny fields, all changed by 8%. I think this part of result is positive, as with earlier approaches here the dip was 20%. Refer the result posted at link: http://www.postgresql.org/message-id/51366323.8070...@vmware.com I've attached the full results in OpenOffice format. Profiling the one short and one long field, no change test turns up the following: 51.38% postgres pgrb_delta_encode 23.58% postgres XLogInsert 2.54% postgres heap_update 1.09% postgres LWLockRelease 0.90% postgres LWLockAcquire 0.89% postgres palloc0 0.88% postgres log_heap_update 0.84% postgres HeapTupleSatisfiesMVCC 0.75% postgres ExecModifyTable 0.73% postgres hash_search_with_hash_value Yipes. That's a lot more than I remember this costing before. And I don't understand why I'm seeing such a large time hit on this test where you actually saw a significant time *reduction*. One possibility is that you may have been running with a default checkpoint_segments value or one that's low enough to force checkpointing activity during the test. I ran with checkpoint_segments=300. I ran with checkpoint_segments = 128 and when I ran with v4, I also see similar WAL reduction as you are seeing, except that in my case runtime for both are almost similar (i think in your case disk writes are fast, so CPU overhead is more visible). I think the major difference in above test is due to below part of code: pgrb_find_match() { .. + /* if (match_chunk) + { + while (*ip == *hp) + { + matchlen++; + ip++; + hp++; + } + } */ } Basically if we don't go for longer match, then for test where most data (one short and one long field, no change) is similar, it has to do below extra steps with no advantage: a. copy extra tags b. calculation for rolling hash c. finding the match I think here major cost is due to 'a', but others might also not be free. To confirm the theory, if we run the test by just un-commenting above code, there can be significant change in both WAL reduction and runtime for this test. I have one idea to avoid the overhead of step a) which is to combine the tags, means don't write the tag until it founds any un-matching data. When any un-matched data is found, then combine all the previously matched data and write it as one tag. This should eliminate the overhead due to step a. Can we think of anyway in which inspite of doing longer matches, we can retain the sanctity of this approach? One way could be to check if the match after chunk is long enough that it matches rest of the string, but I think it can create problems in some other cases. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Mon, Nov 25, 2013 at 6:55 PM, Robert Haas robertmh...@gmail.com wrote: But even if that doesn't pan out, I think the fallback position should not be OK, well, if we can't get decreased I/O for free then forget it but rather OK, if we can't get decreased I/O for free then let's get decreased I/O in exchange for increased CPU usage. While I haven't been following the development of this patch, I will note that on the face of it the latter seem like a trade-off I'd be quite willing to make. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Thu, Jan 16, 2014 at 12:07 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Thu, Jan 16, 2014 at 12:49 AM, Robert Haas robertmh...@gmail.com wrote: On Wed, Jan 15, 2014 at 7:28 AM, Amit Kapila amit.kapil...@gmail.com wrote: Unpatched --- testname | wal_generated | duration --+--+-- one short and one long field, no change |1054923224 | 33.101135969162 After pgrb_delta_encoding_v4 - testname | wal_generated | duration --+--+-- one short and one long field, no change | 877859144 | 30.6749138832092 Temporary Changes (Revert Max Chunksize = 4 and logic of finding longer match) - testname| wal_generated | duration --+--+-- one short and one long field, no change | 677337304 | 25.4048750400543 Sure, but watch me not care. If we're interested in taking advantage of the internal compressibility of tuples, we can do a lot better than this patch. We can compress the old tuple and the new tuple. We can compress full-page images. We can compress inserted tuples. But that's not the point of this patch. The point of *this* patch is to exploit the fact that the old and new tuples are likely to be very similar, NOT to squeeze out every ounce of compression from other sources. Okay, got your point. Another minor thing is that in latest patch which I have sent yesterday, I have modified it such that while formation of chunks if there is a data at end of string which doesn't have special pattern and is less than max chunk size, we also consider that as a chunk. The reason of doing this was that let us say if we have 104 bytes string which contains no special bit pattern, then it will just have one 64 byte chunk and will leave the remaining bytes, which might miss the chance of doing compression for that data. Yeah, that sounds right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Jan 15, 2014 at 5:58 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Fri, Jan 10, 2014 at 9:12 PM, Robert Haas robertmh...@gmail.com wrote: Performance Data - Non-default settings: autovacuum =off checkpoint_segments =128 checkpoint_timeout = 10min Unpatched --- testname | wal_generated | duration --+--+-- one short and one long field, no change |1054923224 | 33.101135969162 After pgrb_delta_encoding_v4 - testname | wal_generated | duration --+--+-- one short and one long field, no change | 877859144 | 30.6749138832092 Temporary Changes (Revert Max Chunksize = 4 and logic of finding longer match) - testname| wal_generated | duration --+--+-- one short and one long field, no change | 677337304 | 25.4048750400543 Sorry, minor correction in last mail, the last data (Temporary Changes) is just by reverting logic of finding longer match in pgrb_find_match(). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Jan 15, 2014 at 7:28 AM, Amit Kapila amit.kapil...@gmail.com wrote: Unpatched --- testname | wal_generated | duration --+--+-- one short and one long field, no change |1054923224 | 33.101135969162 After pgrb_delta_encoding_v4 - testname | wal_generated | duration --+--+-- one short and one long field, no change | 877859144 | 30.6749138832092 Temporary Changes (Revert Max Chunksize = 4 and logic of finding longer match) - testname| wal_generated | duration --+--+-- one short and one long field, no change | 677337304 | 25.4048750400543 Sure, but watch me not care. If we're interested in taking advantage of the internal compressibility of tuples, we can do a lot better than this patch. We can compress the old tuple and the new tuple. We can compress full-page images. We can compress inserted tuples. But that's not the point of this patch. The point of *this* patch is to exploit the fact that the old and new tuples are likely to be very similar, NOT to squeeze out every ounce of compression from other sources. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Thu, Jan 16, 2014 at 12:49 AM, Robert Haas robertmh...@gmail.com wrote: On Wed, Jan 15, 2014 at 7:28 AM, Amit Kapila amit.kapil...@gmail.com wrote: Unpatched --- testname | wal_generated | duration --+--+-- one short and one long field, no change |1054923224 | 33.101135969162 After pgrb_delta_encoding_v4 - testname | wal_generated | duration --+--+-- one short and one long field, no change | 877859144 | 30.6749138832092 Temporary Changes (Revert Max Chunksize = 4 and logic of finding longer match) - testname| wal_generated | duration --+--+-- one short and one long field, no change | 677337304 | 25.4048750400543 Sure, but watch me not care. If we're interested in taking advantage of the internal compressibility of tuples, we can do a lot better than this patch. We can compress the old tuple and the new tuple. We can compress full-page images. We can compress inserted tuples. But that's not the point of this patch. The point of *this* patch is to exploit the fact that the old and new tuples are likely to be very similar, NOT to squeeze out every ounce of compression from other sources. Okay, got your point. Another minor thing is that in latest patch which I have sent yesterday, I have modified it such that while formation of chunks if there is a data at end of string which doesn't have special pattern and is less than max chunk size, we also consider that as a chunk. The reason of doing this was that let us say if we have 104 bytes string which contains no special bit pattern, then it will just have one 64 byte chunk and will leave the remaining bytes, which might miss the chance of doing compression for that data. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Tue, Jan 14, 2014 at 1:16 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Jan 14, 2014 at 2:16 AM, Robert Haas robertmh...@gmail.com wrote: On Sat, Jan 11, 2014 at 1:08 AM, Amit Kapila amit.kapil...@gmail.com wrote: Yes, currently this applies to update, what I have in mind is that in future if some one wants to use WAL compression for any other operation like 'full_page_writes', then it can be easily extendible. To be honest, I have not evaluated whether such a flag or compression would make sense for full page writes, but I think it should be possible while doing full page write (BkpBlock has RelFileNode) to check such a flag if it's present. Makes sense. So shall I change it to string instead of bool and keep the name as compress_wal or compress_wal_for_opr? No. If we add full-page-write compression in the future, that can be a separate option. But I doubt we'd want to set that at the table level anyway; there's no particular reason that would be good for some tables and bad for others (whereas in this case there is such a reason). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Sat, Jan 11, 2014 at 1:08 AM, Amit Kapila amit.kapil...@gmail.com wrote: Yes, currently this applies to update, what I have in mind is that in future if some one wants to use WAL compression for any other operation like 'full_page_writes', then it can be easily extendible. To be honest, I have not evaluated whether such a flag or compression would make sense for full page writes, but I think it should be possible while doing full page write (BkpBlock has RelFileNode) to check such a flag if it's present. Makes sense. The reason of adding the same chunk in head of list is that it uses same technique as pglz_hist_add. Now in pglz, it will not have repeat steps from c~f, as it has concept of good_match which leads to get this done in one go. Being said above, I am really not sure, how much real world data falls in above category and should we try to optimize based on above example, but yes it will save some CPU cycles in current test we are using. In the Rabin algorithm, we shouldn't try to find a longer match. The match should end at the chunk end, period. Otherwise, you lose the shift-resistant property of the algorithm. But I do think there might be a bug here, which is that, unless I'm misinterpreting something, hp is NOT the end of the chunk. After calling pgrb_hash_init(), we've looked at the first FOUR bytes of the input. If we find that we have a zero hash value at that point, shouldn't the chunk size be 4, not 1? And similarly if we find it after sucking in one more byte, shouldn't the chunk size be 5, not 2? Right now, we're deciding where the chunks should end based on the data in the chunk plus the following 3 bytes, and that seems wonky. I would expect us to include all of those bytes in the chunk. It depends on how we define chunk, basically chunk size will be based on the byte for which we consider hindex. The hindex for any byte is calculated considering that byte and the following 3 bytes, so after calling pgrb_hash_init(), even though we have looked at 4 bytes but still the hindex is for first byte and thats why it consider chunk size as 1, not 4. Isn't it similar to how current pglz works, basically it also uses next 4 bytes to calculate index (pglz_hist_idx) but still does byte by byte comparison, here if we try to map to rabin's delta encoding then always chunk size is 1. I don't quite understand this. The point of the Rabin algorithm is to split the old tuple up into chunks and then for those chunks in the new tuple. For example, suppose the old tuple is abcdefghijklmnopqrstuvwxyz. It might get split like this: abcdef hijklmnopqrstuvw xyz. If any of those three chunks appear in the new tuple, then we'll use them for compression. If not, we'll just copy the literal bytes. If the chunks appear in the new tuple reordered or shifted or with stuff inserted between one chunk at the next, we'll still find them. Unless I'm confused, which is possible, what you're doing is essentially looking at the string and spitting it in those three places, but then recording the chunks as being three bytes shorter than they really are. I don't see how that can be right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Tue, Jan 14, 2014 at 2:16 AM, Robert Haas robertmh...@gmail.com wrote: On Sat, Jan 11, 2014 at 1:08 AM, Amit Kapila amit.kapil...@gmail.com wrote: Yes, currently this applies to update, what I have in mind is that in future if some one wants to use WAL compression for any other operation like 'full_page_writes', then it can be easily extendible. To be honest, I have not evaluated whether such a flag or compression would make sense for full page writes, but I think it should be possible while doing full page write (BkpBlock has RelFileNode) to check such a flag if it's present. Makes sense. So shall I change it to string instead of bool and keep the name as compress_wal or compress_wal_for_opr? The reason of adding the same chunk in head of list is that it uses same technique as pglz_hist_add. Now in pglz, it will not have repeat steps from c~f, as it has concept of good_match which leads to get this done in one go. Being said above, I am really not sure, how much real world data falls in above category and should we try to optimize based on above example, but yes it will save some CPU cycles in current test we are using. In the Rabin algorithm, we shouldn't try to find a longer match. The match should end at the chunk end, period. Otherwise, you lose the shift-resistant property of the algorithm. Okay, it will work well for cases when most chunks in tuple are due due to special pattern in it, but it will loose out on CPU cycles in cases where most of the chunks are due to maximum chunk boundary and most part of new tuple matches with old tuple. The reason is that if the algorithm have some such property of finding longer matches than chunk boundaries, then it can save us on calculating hash again and again when we try to find match in old tuple. However I think it is better to go with rabin's algorithm instead of adding optimizations based on our own assumptions, because it is difficult to predict the real world tuple data. Isn't it similar to how current pglz works, basically it also uses next 4 bytes to calculate index (pglz_hist_idx) but still does byte by byte comparison, here if we try to map to rabin's delta encoding then always chunk size is 1. I don't quite understand this. The point of the Rabin algorithm is to split the old tuple up into chunks and then for those chunks in the new tuple. For example, suppose the old tuple is abcdefghijklmnopqrstuvwxyz. It might get split like this: abcdef hijklmnopqrstuvw xyz. If any of those three chunks appear in the new tuple, then we'll use them for compression. If not, we'll just copy the literal bytes. If the chunks appear in the new tuple reordered or shifted or with stuff inserted between one chunk at the next, we'll still find them. Unless I'm confused, which is possible, what you're doing is essentially looking at the string and spitting it in those three places, but then recording the chunks as being three bytes shorter than they really are. I don't see how that can be right. Today again spending some time on algorithm, I got the bug you are pointing to and you are right in saying that chunk is shorter. I think it should not be difficult to address this issue without affecting most part of algorithm, let me try to handle it. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
2. Provide a new reloption to specify Wal compression for update operation on table Create table tbl(c1 char(100)) With (compress_wal = true); Alternative options: a. compress_wal can take input as operation, e.g. 'insert', 'update', b. use alternate syntax: Create table tbl(c1 char(100)) Compress Wal For Update; c. anything better? I think WITH (compress_wal = true) is pretty good. I don't understand your point about taking the operation as input, because this only applies to updates. But we could try to work update into the name of the setting somehow, so as to be less likely to conflict with future additions, like maybe wal_compress_update. I think the alternate syntax you propose is clearly worse, because it would involve adding new keywords, something we try to avoid. The only possible enhancement I can think of here is to make the setting an integer rather than a Boolean, defined as the minimum acceptable compression ratio. A setting of 0 means always compress; a setting of 100 means never compress; intermediate values define the least acceptable ratio. But to be honest, I think that's overkill; I'd be inclined to hard-code the default value of 25 in the patch and make it a #define. The only real advantage of requiring a minimum 25% compression percentage is that we can bail out on compression three-quarters of the way through the tuple if we're getting nowhere. That's fine for what it is, but the idea that users are going to see much benefit from twaddling that number seems very dubious to me. Points to consider - 1. As the current algorithm store the entry for same chunks at head of list, it will always find last but one chunk (we don't store last 4 bytes) for long matching string during match phase in encoding (pgrb_delta_encode). We can improve it either by storing same chunks at end of list instead of at head or by trying to find a good_match technique used in lz algorithm. Finding good_match technique can have overhead in some of the cases when there is actually no match. I don't see what the good_match thing has to do with anything in the Rabin algorithm. But I do think there might be a bug here, which is that, unless I'm misinterpreting something, hp is NOT the end of the chunk. After calling pgrb_hash_init(), we've looked at the first FOUR bytes of the input. If we find that we have a zero hash value at that point, shouldn't the chunk size be 4, not 1? And similarly if we find it after sucking in one more byte, shouldn't the chunk size be 5, not 2? Right now, we're deciding where the chunks should end based on the data in the chunk plus the following 3 bytes, and that seems wonky. I would expect us to include all of those bytes in the chunk. 2. Another optimization that we can do in pgrb_find_match(), is that currently if it doesn't find the first chunk (chunk got by hash index) matching, it continues to find the match in other chunks. I am not sure if there is any benefit to search for other chunks if first one is not matching. Well, if you took that out, I suspect it would hurt the compression ratio. Unless the CPU savings are substantial, I'd leave it alone. 3. We can move code from pg_lzcompress.c to some new file pg_rbcompress.c, if we want to move, then we need to either duplicate some common macros like pglz_out_tag or keep it common, but might be change the name. +1 for a new file. 4. Decide on min and max chunksize. (currently kept as 2 and 4 respectively). The point to consider is that if we keep bigger chunk sizes, then it can save us on CPU cycles, but less reduction in Wal, on the other side if we keep it small it can have better reduction in Wal but consume more CPU cycles. Whoa. That seems way too small. Since PGRB_PATTERN_AFTER_BITS is 4, the average length of a chunk is about 16 bytes. It makes little sense to have the maximum chunk size be 25% of the expected chunk length. I'd recommend making the maximum chunk length something like 4 * PGRB_CONST_NUM, and the minimum chunk length maybe something like 4. 5. kept an guc variable 'wal_update_compression_ratio', for test purpose, we can remove it before commit. Let's remove it now. 7. docs needs to be updated, tab completion needs some work. Tab completion can be skipped for now, but documentation is important. 8. We can extend Alter Table to set compress option for table. I don't understand what you have in mind here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Fri, Jan 10, 2014 at 9:12 PM, Robert Haas robertmh...@gmail.com wrote: 2. Provide a new reloption to specify Wal compression for update operation on table Create table tbl(c1 char(100)) With (compress_wal = true); Alternative options: a. compress_wal can take input as operation, e.g. 'insert', 'update', b. use alternate syntax: Create table tbl(c1 char(100)) Compress Wal For Update; c. anything better? I think WITH (compress_wal = true) is pretty good. I don't understand your point about taking the operation as input, because this only applies to updates. Yes, currently this applies to update, what I have in mind is that in future if some one wants to use WAL compression for any other operation like 'full_page_writes', then it can be easily extendible. To be honest, I have not evaluated whether such a flag or compression would make sense for full page writes, but I think it should be possible while doing full page write (BkpBlock has RelFileNode) to check such a flag if it's present. But we could try to work update into the name of the setting somehow, so as to be less likely to conflict with future additions, like maybe wal_compress_update. I think the alternate syntax you propose is clearly worse, because it would involve adding new keywords, something we try to avoid. Yes, this would be better than current, I will include it in next version of patch, unless there is any other better idea than this one. Points to consider - 1. As the current algorithm store the entry for same chunks at head of list, it will always find last but one chunk (we don't store last 4 bytes) for long matching string during match phase in encoding (pgrb_delta_encode). We can improve it either by storing same chunks at end of list instead of at head or by trying to find a good_match technique used in lz algorithm. Finding good_match technique can have overhead in some of the cases when there is actually no match. I don't see what the good_match thing has to do with anything in the Rabin algorithm. The case for which I have mentioned it is where most of the data is repetitive and modified tuple is almost same. For example orignal tuple b modified tuple ccaab Now let us see what will happen as per current algorithm Step -1: Form the hash table (lets consider 4 byte chunks just for purpose of explanation): a. First chunk after first 4 a's () with hash value 1024. b. After that all the chunks will be 4 b's () and will have same hash value (lets call it H2), so will be mapped to same bucket (let us call it 'B2' bucket) and hence form a list. Every new chunk with same hash value is added to front of list, so if go by this B2 has front element with hash value as H2 and location as 58 (last but 8 bytes, as we don't include last 4 bytes in our algorithm) Step -2: Perform encoding for modified tuple by using hash table formed in Step-1 a. First chunk after first 4 bytes (ccaa) with hash value 1056. b. try to find match in hash table, no match, so proceed. c. Next chunk with 4 b's and hash value H2, try to find a match, it will find match in B2 bucket (at first element). d. okay hash value matched, so it will try to match each byte. if all bytes matched, it will consider chunk as matched_chunk. e. Now if the chunk matches, then we try to match consecutive bytes after chunk (in the hope of finding longer match), and in this case, it will find a match of 8 bytes and consider match_len as 8. f. it will increment the modified tuple (dp) to point to byte next to match_len. g. Again, it will consider next 4 b's as chunk and repeat step c~f. So here, what is happening is steps c~f are getting repeated, whereas if we would have added same chunks at end of list in step 1b, then we could have found the matching string just in one go (c~f). The reason of adding the same chunk in head of list is that it uses same technique as pglz_hist_add. Now in pglz, it will not have repeat steps from c~f, as it has concept of good_match which leads to get this done in one go. Being said above, I am really not sure, how much real world data falls in above category and should we try to optimize based on above example, but yes it will save some CPU cycles in current test we are using. But I do think there might be a bug here, which is that, unless I'm misinterpreting something, hp is NOT the end of the chunk. After calling pgrb_hash_init(), we've looked at the first FOUR bytes of the input. If we find that we have a zero hash value at that point, shouldn't the
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Fri, Dec 6, 2013 at 6:41 PM, Amit Kapila amit.kapil...@gmail.com wrote: Agreed, summarization of data for LZ/Chunkwise encoding especially for non-compressible (hundred tiny fields, all changed/half changed) or less compressible data (hundred tiny fields, half nulled) w.r.t CPU usage is as below: a. For hard disk, there is an overhead of 7~16% with LZ delta encoding and there is an overhead of 5~8% with Chunk wise encoding. b. For Tempfs (which means operate on RAM as disk), there is an overhead of 19~26% with LZ delta encoding and there is an overhead of 9~18% with Chunk wise encoding. There might be some variation of data (in your last mail the overhead for chunkwise method for Tempfs was 12%), but in general the data suggests that chunk wise encoding has less overhead than LZ encoding for non-compressible data and for others it is better or equal. Now, I think we have below options for this patch: a. If the performance overhead for worst case is acceptable (we can try to reduce some more, but don't think it will be something drastic), then this can be done without any flag/option. b. Have it with table level option to enable/disable WAL compression c. Drop this patch, as for worst cases there is some performance overhead. d. Go back and work more on it, if there is any further suggestions for improvement. Based on data posted previously for both approaches (lz_delta, chunk_wise_encoding) and above options, I have improved the last version of patch by keeping chunk wise approach and provided a table level option to user. Changes in this version of patch: -- 1. Implement decoding, it is almost similar to pglz_decompress as the format to store encoded data is not changed much. 2. Provide a new reloption to specify Wal compression for update operation on table Create table tbl(c1 char(100)) With (compress_wal = true); Alternative options: a. compress_wal can take input as operation, e.g. 'insert', 'update', b. use alternate syntax: Create table tbl(c1 char(100)) Compress Wal For Update; c. anything better? 3. Fixed below 2 defects in encoding: a. In function pgrb_find_match(), if last byte of chunk matches, it consider whole chunk as match. b. If there is no match, it copies chunk as it is to encoded data, while copying, it is ignoring last byte. Due to defect fixes, data can vary, but I don't think there can be any major change. Points to consider - 1. As the current algorithm store the entry for same chunks at head of list, it will always find last but one chunk (we don't store last 4 bytes) for long matching string during match phase in encoding (pgrb_delta_encode). We can improve it either by storing same chunks at end of list instead of at head or by trying to find a good_match technique used in lz algorithm. Finding good_match technique can have overhead in some of the cases when there is actually no match. 2. Another optimization that we can do in pgrb_find_match(), is that currently if it doesn't find the first chunk (chunk got by hash index) matching, it continues to find the match in other chunks. I am not sure if there is any benefit to search for other chunks if first one is not matching. 3. We can move code from pg_lzcompress.c to some new file pg_rbcompress.c, if we want to move, then we need to either duplicate some common macros like pglz_out_tag or keep it common, but might be change the name. 4. Decide on min and max chunksize. (currently kept as 2 and 4 respectively). The point to consider is that if we keep bigger chunk sizes, then it can save us on CPU cycles, but less reduction in Wal, on the other side if we keep it small it can have better reduction in Wal but consume more CPU cycles. 5. kept an guc variable 'wal_update_compression_ratio', for test purpose, we can remove it before commit. 7. docs needs to be updated, tab completion needs some work. 8. We can extend Alter Table to set compress option for table. Thoughts/Suggestions? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com pgrb_delta_encoding_v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Thu, Dec 12, 2013 at 12:27 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Thu, Dec 12, 2013 at 3:43 AM, Peter Eisentraut pete...@gmx.net wrote: This patch fails the regression tests; see attachment. Thanks for reporting the diffs. The reason for failures is that still decoding for tuple is not done as mentioned in Notes section in mail (http://www.postgresql.org/message-id/caa4ek1jeuby16uwrda2tabkk+rlrl3giyyqy1mvh_6cthmd...@mail.gmail.com) However, to keep the sanity of patch, I will do that and post an updated patch, but I think the main idea behind new approach at this point is to get feedback on if such an optimization is acceptable for worst case scenarios and if not whether we can get this done with table level or GUC option. I don't understand why lack of decoding support should cause regression tests to fail. I thought decoding was only being done during WAL replay, a case not exercised by the regression tests. A few other comments: +#define PGRB_HKEY_PRIME11 /* prime number used for rolling hash */ +#define PGRB_HKEY_SQUARE_PRIME11 * 11 /* prime number used for rolling hash */ +#define PGRB_HKEY_CUBE_PRIME11 * 11 * 11 /* prime number used for rolling hash */ 11 * 11 can't accurately be described as a prime number. Nor can 11 * 11 * 11. Please adjust the comment. Also, why 11? It doesn't appear that pglz_hist_idx is changed except for whitespace; please revert that hunk. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
This patch fails the regression tests; see attachment. *** /var/lib/jenkins/jobs/postgresql_commitfest_world/workspace/src/test/regress/expected/arrays.out 2013-08-24 01:24:42.0 + --- /var/lib/jenkins/jobs/postgresql_commitfest_world/workspace/src/test/regress/results/arrays.out 2013-12-07 16:21:48.0 + *** *** 495,508 (11 rows) SELECT * FROM array_op_test WHERE i @ '{38,34,32,89}' ORDER BY seqno; ! seqno | i | t ! ---+---+ ! 40 | {34} | {AA10611,AAA1205,AAA50956,31334,A70466,81587,AAA74623} ! 74 | {32} | {1729,A22860,AA99807,A17383,AAA67062,AAA15165,AAA50956} ! 98 | {38,34,32,89} | {AA71621,8857,AAA65037,31334,AA48845} !101 | {}| {} ! (4 rows) ! SELECT * FROM array_op_test WHERE i = '{}' ORDER BY seqno; seqno | i | t ---++ --- 495,501 (11 rows) SELECT * FROM array_op_test WHERE i @ '{38,34,32,89}' ORDER BY seqno; ! ERROR: compressed data is corrupt SELECT * FROM array_op_test WHERE i = '{}' ORDER BY seqno; seqno | i | t ---++ *** *** 622,632 (0 rows) SELECT * FROM array_op_test WHERE i @ '{}' ORDER BY seqno; ! seqno | i | t ! ---++ !101 | {} | {} ! (1 row) ! SELECT * FROM array_op_test WHERE i = '{NULL}' ORDER BY seqno; seqno | i| t ---++ --- 615,621 (0 rows) SELECT * FROM array_op_test WHERE i @ '{}' ORDER BY seqno; ! ERROR: compressed data is corrupt SELECT * FROM array_op_test WHERE i = '{NULL}' ORDER BY seqno; seqno | i| t ---++ *** *** 644,654 (0 rows) SELECT * FROM array_op_test WHERE i @ '{NULL}' ORDER BY seqno; ! seqno | i | t ! ---++ !101 | {} | {} ! (1 row) ! SELECT * FROM array_op_test WHERE t @ '{72908}' ORDER BY seqno; seqno | i | t ---+---+ --- 633,639 (0 rows) SELECT * FROM array_op_test WHERE i @ '{NULL}' ORDER BY seqno; ! ERROR: compressed data is corrupt SELECT * FROM array_op_test WHERE t @ '{72908}' ORDER BY seqno; seqno | i | t ---+---+ == *** /var/lib/jenkins/jobs/postgresql_commitfest_world/workspace/src/test/regress/expected/polymorphism.out 2013-11-13 20:20:40.0 + --- /var/lib/jenkins/jobs/postgresql_commitfest_world/workspace/src/test/regress/results/polymorphism.out 2013-12-07 16:21:55.0 + *** *** 638,648 -- check that we can apply functions taking ANYARRAY to pg_stats select distinct array_ndims(histogram_bounds) from pg_stats where histogram_bounds is not null; ! array_ndims ! - !1 ! (1 row) ! -- such functions must protect themselves if varying element type isn't OK -- (WHERE clause here is to avoid possibly getting a collation error instead) select max(histogram_bounds) from pg_stats where tablename = 'pg_am'; --- 638,644 -- check that we can apply functions taking ANYARRAY to pg_stats select distinct array_ndims(histogram_bounds) from pg_stats where histogram_bounds is not null; ! ERROR: compressed data is corrupt -- such functions must protect themselves if varying element type isn't OK -- (WHERE clause here is to avoid possibly getting a collation error instead) select max(histogram_bounds) from pg_stats where tablename = 'pg_am'; == -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Thu, Dec 12, 2013 at 3:43 AM, Peter Eisentraut pete...@gmx.net wrote: This patch fails the regression tests; see attachment. Thanks for reporting the diffs. The reason for failures is that still decoding for tuple is not done as mentioned in Notes section in mail (http://www.postgresql.org/message-id/caa4ek1jeuby16uwrda2tabkk+rlrl3giyyqy1mvh_6cthmd...@mail.gmail.com) However, to keep the sanity of patch, I will do that and post an updated patch, but I think the main idea behind new approach at this point is to get feedback on if such an optimization is acceptable for worst case scenarios and if not whether we can get this done with table level or GUC option. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Fri, Dec 6, 2013 at 3:39 PM, Haribabu kommi haribabu.ko...@huawei.com wrote: On 06 December 2013 12:29 Amit Kapila wrote: Note - a. Performance is data is taken on my laptop, needs to be tested on some better m/c b. Attached Patch is just a prototype of chunkwise concept, code needs to be improved and decode handling/test is pending. I ran the performance test on linux machine and attached the results in the mail. I ran the performance test on above patches including another patch which Does snappy hash instead of normal hash in LZ algorithm. The performance Readings and patch with snappy hash not including new data in compression are attached in the mail. Thanks for taking the data. The chunk wise approach is giving good performance in most of the scenarios. Agreed, summarization of data for LZ/Chunkwise encoding especially for non-compressible (hundred tiny fields, all changed/half changed) or less compressible data (hundred tiny fields, half nulled) w.r.t CPU usage is as below: a. For hard disk, there is an overhead of 7~16% with LZ delta encoding and there is an overhead of 5~8% with Chunk wise encoding. b. For Tempfs (which means operate on RAM as disk), there is an overhead of 19~26% with LZ delta encoding and there is an overhead of 9~18% with Chunk wise encoding. There might be some variation of data (in your last mail the overhead for chunkwise method for Tempfs was 12%), but in general the data suggests that chunk wise encoding has less overhead than LZ encoding for non-compressible data and for others it is better or equal. Now, I think we have below options for this patch: a. If the performance overhead for worst case is acceptable (we can try to reduce some more, but don't think it will be something drastic), then this can be done without any flag/option. b. Have it with table level option to enable/disable WAL compression c. Drop this patch, as for worst cases there is some performance overhead. d. Go back and work more on it, if there is any further suggestions for improvement. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 05 December 2013 21:16 Amit Kapila wrote: Note - a. Performance is data is taken on my laptop, needs to be tested on some better m/c b. Attached Patch is just a prototype of chunkwise concept, code needs to be improved and decode handling/test is pending. I ran the performance test on linux machine and attached the results in the mail. Regards, Hari babu. Performance Data - Machine: CPU - 4 core RAM - 24GB OS - SUSE 11 SP2 Configurations: Encoding - SQL_ASCII lc-collate - C lc-ctype - C Sharedbuffers - 4GB walbuffers- 16MB synchronous_commit - off checkpoint_segments - 255 checkpoint_timeout - 25min autovacuum - off NORMAL DISK: Head code: testname| wal_generated | duration ---+---+-- ten tiny fields, all changed |1408441624 | 42.9247660636902 hundred tiny fields, first 10 changed | 623904568 | 19.5896420478821 hundred tiny fields, all changed | 626929520 | 20.4015860557556 hundred tiny fields, half changed | 632949424 | 20.4108989238739 hundred tiny fields, half nulled | 568672600 | 16.9662220478058 (5 rows) lz snappy hash delta encoding testname| wal_generated | duration ---+---+-- ten tiny fields, all changed |1314962968 | 39.8615770339966 hundred tiny fields, first 10 changed | 452513816 | 16.8626310825348 hundred tiny fields, all changed | 458503728 | 18.5805869102478 hundred tiny fields, half changed | 463203696 | 17.7005319595337 hundred tiny fields, half nulled | 472732888 | 17.7074511051178 (5 rows) robin delta encoding testname| wal_generated | duration ---+---+-- ten tiny fields, all changed |1321761080 | 40.9638168811798 hundred tiny fields, first 10 changed | 501064768 | 18.8991298675537 hundred tiny fields, all changed | 631711040 | 22.2287940979004 hundred tiny fields, half changed | 628356072 | 21.4988770484924 hundred tiny fields, half nulled | 506637272 | 18.4065499305725 (5 rows) TEMPFS: Head code: testname| wal_generated | duration ---+---+-- ten tiny fields, all changed |1395199584 | 23.9126200675964 hundred tiny fields, first 10 changed | 636436752 | 13.5280549526215 hundred tiny fields, all changed | 635591368 | 13.5622470378876 hundred tiny fields, half changed | 628001440 | 13.5341370105743 hundred tiny fields, half nulled | 564033176 | 12.3501629829407 (5 rows) lz snappy hash delta encoding testname| wal_generated | duration ---+---+-- ten tiny fields, all changed |1314964400 | 23.2445130348206 hundred tiny fields, first 10 changed | 456703104 | 14.2320129871368 hundred tiny fields, all changed | 463224736 | 14.0170950889587 hundred tiny fields, half changed | 462589768 | 14.032851934433 hundred tiny fields, half nulled | 473361760 | 13.3050961494446 (5 rows) robin delta encoding testname| wal_generated | duration ---+---+-- ten tiny fields, all changed |1319550032 | 23.7516498565674 hundred tiny fields, first 10 changed | 512198256 | 15.0410940647125 hundred tiny fields, all changed | 638134296 | 15.143513917923 hundred tiny fields, half changed | 635378360 | 15.1855199337006 hundred tiny fields, half nulled | 508630832 | 13.4496691226959 (5 rows) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Fri, Dec 6, 2013 at 12:10 PM, Haribabu kommi haribabu.ko...@huawei.com wrote: On 05 December 2013 21:16 Amit Kapila wrote: Note - a. Performance is data is taken on my laptop, needs to be tested on some better m/c b. Attached Patch is just a prototype of chunkwise concept, code needs to be improved and decode handling/test is pending. I ran the performance test on linux machine and attached the results in the mail. This test doesn't make much sense for comparison as in chunkwise delta encoding, I am not doing compression using new tuple and the reason is that I want to check how good/bad it is as compare to LZ approach for cases when data is non-compressible. So could you please try to take readings by using patch pgrb_delta_encoding_v1 attached in my previous mail. For LZ delta encoding- pgrb_delta_encoding_v1 - In heaptuple.c, there is a parameter rabin_fingerprint_comp, set it to false, compile the code and take readings. This will do LZ compression. For chunk wise delta encoding - In heaptuple.c, there is a parameter rabin_fingerprint_comp, set it to true, compile the code and take readings This will operate chunk wise. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Mon, Dec 2, 2013 at 7:40 PM, Haribabu kommi haribabu.ko...@huawei.com wrote: On 29 November 2013 03:05 Robert Haas wrote: On Wed, Nov 27, 2013 at 9:31 AM, Amit Kapila amit.kapil...@gmail.com wrote: I tried modifying the existing patch to support the dynamic rollup as follows. For every 32 bytes mismatch between the old and new tuple and it resets back whenever it found a match. 1. pglz-with-micro-optimization-compress-using-newdata-5: Adds all old tuple data to history and then check for the match from new tuple. For every 32 bytes mismatch, it checks for the match for 2 bytes once. Like this It repeats until it found a match or end of data. 2. pglz-with-micro-optimization-compress-using-newdata_snappy_hash-1: Adds only first byte of old tuple data to the history and then check for the match From new tuple. If any match found, then next unmatched byte from old tuple is added To the history and repeats the process. If no match founds then adds the next byte of the old tuple history followed by the Unmatched byte from new tuple data to the history. In this case the performance is good, but if there is any forward references in the New data with old data then it will not compress the data. The performance data has still same problem that is on fast disks (tempfs data) it is low. I am already doing chunk-wise implementation to see if it can improve the situation, please wait and then we can decide what is the best way to proceed. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Nov 27, 2013 at 9:31 AM, Amit Kapila amit.kapil...@gmail.com wrote: Sure, but to explore (a), the scope is bit bigger. We have below options to explore (a): 1. try to optimize existing algorithm as used in patch, which we have tried but ofcourse we can spend some more time to see if anything more can be tried out. 2. try fingerprint technique as suggested by you above. 3. try some other standard methods like vcdiff, lz4 etc. Well, obviously, I'm hot on idea #2 and think that would be worth spending some time on. If we can optimize the algorithm used in the patch some more (option #1), that would be fine, too, but the code looks pretty tight to me, so I'm not sure how successful that's likely to be. But if you have an idea, sure. As to #3, I took a look at lz4 and snappy but neither seems to have an API for delta compression. vcdiff is a commonly-used output for delta compression but doesn't directly address the question of what algorithm to use to find matches between the old and new input; and I suspect that when you go searching for algorithms to do that efficiently, it's going to bring you right back to #2. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Nov 27, 2013 at 12:56 AM, Amit Kapila amit.kapil...@gmail.com wrote: - There is a comment TODO: It would be nice to behave like the history and the source strings were concatenated, so that you could compress using the new data, too. If we're not already doing that, then how are we managing to compress WAL by more than one-quarter in the hundred tiny fields, all changed case? Algorithm is not doing concatenation of history and source strings, the hash table is formed just with history data and then later if match is not found then it is added to history, so this can be the reason for the above result. From the compressor's point of view, that's pretty much equivalent to behaving as if those strings were concatenated. The point is that there's a difference between using the old tuple's history entries to compress the new tuple and using the new tuple's own history to compress it. The former is delta-compression, which is what we're supposedly doing here. The later is just plain compression. That doesn't *necessarily* make it a bad idea, but they are clearly two different things. The basic idea is that you use a rolling hash function to divide up the history data into chunks of a given average size. So we scan the history data, compute a rolling hash value at each point, and each time the bottom N bits are zero, we consider that to be the end of a chunk. We enter all the chunks into a hash table. The chunk size will vary, but on the average, given a reasonably well-behaved rolling hash function (the pglz one probably doesn't qualify) it'll happen every 2^N bytes, so perhaps for this purpose we'd choose N to be between 3 and 5. Then, we scan the input we want to compress and divide it into chunks in the same way. Chunks that don't exist in the history data get copied to the output, while those that do get replaced with a reference to their position in the history data. I think this idea looks better than current and it will definately improve some of the cases, but not sure we can win in all cases. We have tried one of the similar idea (reduce size of hash and eventually comparision) by adding every 10 bytes to the history lookup table rather than every byte. It improved most cases but not all cases (hundred tiny fields, all changed, hundred tiny fields, half changed test were still slow). Patch and results are at link (refer approach-1): http://www.postgresql.org/message-id/001f01ce1c14$d3af0770$7b0d1650$@kap...@huawei.com What you did there will, I think, tend to miss a lot of compression opportunities. Suppose for example that the old tuple is ABCDEFGHIJKLMNOP and the new tuple is xABCDEFGHIJKLMNOP. After copying one literal byte we'll proceed to copy 9 more, missing the fact that there was a long match available after the first byte. The advantage of the fingerprinting technique is that it's supposed to be resistant to that sort of thing. Now the tough question is what are the possible options for this patch and which one to pick: a. optimize encoding technique, so that it can improve results in most cases even if not all. b. have a table level option or guc to enable/disable WAL compression. c. use some heuristics to check if chances of compression are good, then only perform compression. 1. apply this optimization for tuple size 128 and 2000 2. apply this optimization if number of modified columns are less than 25% (some threshold number) of total columns. I think we can get modified columns from target entry and use it if triggers haven't changed that tuple. I remember earlier there were concerns that this value can't be trusted completely, but I think using it as a heuristic is not a problem, even if this number is not right in some cases. d. forget about this optimization and reject the patch. I think by doing option 'b' and 'c' together can make this optimization usable in cases where it is actually useful. I agree that we probably want to do (b), and I suspect we want both a GUC and a reloption, assuming that can be done relatively cleanly. However, I think we should explore (a) more before we explore (c). I think there's a good chance that we can reduce the CPU overhead of this enough to feel comfortable having it enabled by default. If we proceed with heuristics as in approach (c), I don't think that's the end of the world, but I think there will be more corner cases where we lose and have to fiddle things manually. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Nov 27, 2013 at 7:35 PM, Robert Haas robertmh...@gmail.com wrote: On Wed, Nov 27, 2013 at 12:56 AM, Amit Kapila amit.kapil...@gmail.com wrote: The basic idea is that you use a rolling hash function to divide up the history data into chunks of a given average size. So we scan the history data, compute a rolling hash value at each point, and each time the bottom N bits are zero, we consider that to be the end of a chunk. We enter all the chunks into a hash table. The chunk size will vary, but on the average, given a reasonably well-behaved rolling hash function (the pglz one probably doesn't qualify) it'll happen every 2^N bytes, so perhaps for this purpose we'd choose N to be between 3 and 5. Then, we scan the input we want to compress and divide it into chunks in the same way. Chunks that don't exist in the history data get copied to the output, while those that do get replaced with a reference to their position in the history data. I think this idea looks better than current and it will definately improve some of the cases, but not sure we can win in all cases. We have tried one of the similar idea (reduce size of hash and eventually comparision) by adding every 10 bytes to the history lookup table rather than every byte. It improved most cases but not all cases (hundred tiny fields, all changed, hundred tiny fields, half changed test were still slow). Patch and results are at link (refer approach-1): http://www.postgresql.org/message-id/001f01ce1c14$d3af0770$7b0d1650$@kap...@huawei.com What you did there will, I think, tend to miss a lot of compression opportunities. Suppose for example that the old tuple is ABCDEFGHIJKLMNOP and the new tuple is xABCDEFGHIJKLMNOP. After copying one literal byte we'll proceed to copy 9 more, missing the fact that there was a long match available after the first byte. That is right, but one idea to try that out was to see if we can reduce CPU usage at cost of compression, but we found that it didn't completely eliminate that problem. The advantage of the fingerprinting technique is that it's supposed to be resistant to that sort of thing. Okay, one question arise here is that can it be better in terms of CPU usage as compare to when we have used hash function for every 10th byte, if you have a feeling that it can improve situation, I can try a prototype implementation of same to check the results. Now the tough question is what are the possible options for this patch and which one to pick: a. optimize encoding technique, so that it can improve results in most cases even if not all. b. have a table level option or guc to enable/disable WAL compression. c. use some heuristics to check if chances of compression are good, then only perform compression. 1. apply this optimization for tuple size 128 and 2000 2. apply this optimization if number of modified columns are less than 25% (some threshold number) of total columns. I think we can get modified columns from target entry and use it if triggers haven't changed that tuple. I remember earlier there were concerns that this value can't be trusted completely, but I think using it as a heuristic is not a problem, even if this number is not right in some cases. d. forget about this optimization and reject the patch. I think by doing option 'b' and 'c' together can make this optimization usable in cases where it is actually useful. I agree that we probably want to do (b), and I suspect we want both a GUC and a reloption, assuming that can be done relatively cleanly. However, I think we should explore (a) more before we explore (c). Sure, but to explore (a), the scope is bit bigger. We have below options to explore (a): 1. try to optimize existing algorithm as used in patch, which we have tried but ofcourse we can spend some more time to see if anything more can be tried out. 2. try fingerprint technique as suggested by you above. 3. try some other standard methods like vcdiff, lz4 etc. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Tue, Nov 26, 2013 at 8:25 AM, Robert Haas robertmh...@gmail.com wrote: On Mon, Jul 22, 2013 at 2:31 PM, Greg Smith g...@2ndquadrant.com wrote: I spent a little time running the tests from Heikki's script under perf. On all three two short fields tests and also on the ten tiny fields, all changed test, we spend about 1% of the CPU time in pglz_delta_encode. I don't see any evidence that it's actually compressing anything at all; it appears to be falling out where we test the input length against the strategy, presumably because the default strategy (which we largely copy here) doesn't try to compress input data of less than 32 bytes. Given that this code isn't actually compressing anything in these cases, I'm a bit confused by Greg's report of substantial gains on ten tiny fields, all changed test; how can we win if we're not compressing? I think it is mainly due to variation of test data on Mac, if we see the results of Linux, there is not much difference in results. Also the results posted by Heikki or by me at below links doesn't show such inconsistency for ten tiny fields, all changed case. http://www.postgresql.org/message-id/51366323.8070...@vmware.com http://www.postgresql.org/message-id/001f01ce1c14$d3af0770$7b0d1650$@kap...@huawei.com (Refer test_readings.txt) I studied the hundred tiny fields, half nulled test case in some detail. Some thoughts: - There is a comment TODO: It would be nice to behave like the history and the source strings were concatenated, so that you could compress using the new data, too. If we're not already doing that, then how are we managing to compress WAL by more than one-quarter in the hundred tiny fields, all changed case? Algorithm is not doing concatenation of history and source strings, the hash table is formed just with history data and then later if match is not found then it is added to history, so this can be the reason for the above result. It looks to me like the patch IS doing that, and I'm not sure it's a good idea, especially because it's using pglz_hist_add_no_recycle rather than pglz_add_hist: we verify that hlen + slen 2 * PGLZ_HISTORY_SIZE but that doesn't seem good enough. On the hundred tiny fields, half nulled test case, removing that line reduces compression somewhat but also saves on CPU cycles. - pglz_find_match() is happy to walk all the way down even a really, really long bucket chain. It has some code that reduces good_match each time through, but it fails to produce a non-zero decrement once good_match * good_drop 100. So if we're searching an enormously deep bucket many times in a row, and there are actually no matches, we'll go flying down the whole linked list every time. I tried mandating a minimum decrease of 1 and didn't observe that it made any difference in the run time of this test case, but it still seems odd. For the record, it's not new behavior with this patch; pglz_compress() has the same issue as things stand today. I wonder if we ought to decrease the good match length by a constant rather than a percentage at each step. - pglz_delta_encode() seems to spend about 50% of its CPU time loading up the history data. It would be nice to find a way to reduce that effort. I thought about maybe only making a history entry for, say, every fourth input position rather than every one, but a quick test seems to suggest that's a big fail, probably because it's too easy to skip over the position where we would have made the right match via some short match. But maybe there's some more sophisticated strategy here that would work better. For example, see: http://en.wikipedia.org/wiki/Rabin_fingerprint The basic idea is that you use a rolling hash function to divide up the history data into chunks of a given average size. So we scan the history data, compute a rolling hash value at each point, and each time the bottom N bits are zero, we consider that to be the end of a chunk. We enter all the chunks into a hash table. The chunk size will vary, but on the average, given a reasonably well-behaved rolling hash function (the pglz one probably doesn't qualify) it'll happen every 2^N bytes, so perhaps for this purpose we'd choose N to be between 3 and 5. Then, we scan the input we want to compress and divide it into chunks in the same way. Chunks that don't exist in the history data get copied to the output, while those that do get replaced with a reference to their position in the history data. I'm not 100% certain that something like this would be better than trying to leverage the pglz machinery, but I think it might be worth trying. One possible advantage is that you make many fewer hash-table entries, which reduces both the cost of setting up the hash table and the cost of probing it; another is that if you find a hit in the hash table, you needn't search any further: you are done; this is related to the point that, for the most part, the
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Mon, Jul 22, 2013 at 2:31 PM, Greg Smith g...@2ndquadrant.com wrote: On the Mac, the only case that seems to have a slowdown now is hundred tiny fields, half nulled. It would be nice to understand just what is going on with that one. I got some ugly results in two short fields, no change too, along with a couple of other weird results, but I think those were testing procedure issues that can be ignored. The pgbench throttle work I did recently highlights that I can't really make a Mac quiet/consistent for benchmarking very well. Note that I ran all of the Mac tests with assertions on, to try and catch platform specific bugs. The Linux ones used the default build parameters. Amit has been asking me to look at this patch for a while, so I finally did. While I agree that it would be nice to get the CPU overhead down low enough that we can turn this on by default and forget about it, I'm not convinced that it's without value even if we can't. Fundamentally, this patch trades away some CPU in exchanged for decrease I/O. The testing thus far is all about whether the CPU overhead can be made trivial, which is a good question to ask, because if the answer is yes, then rather than trading something for something else, we just get something for free. Win! But even if that doesn't pan out, I think the fallback position should not be OK, well, if we can't get decreased I/O for free then forget it but rather OK, if we can't get decreased I/O for free then let's get decreased I/O in exchange for increased CPU usage. I spent a little time running the tests from Heikki's script under perf. On all three two short fields tests and also on the ten tiny fields, all changed test, we spend about 1% of the CPU time in pglz_delta_encode. I don't see any evidence that it's actually compressing anything at all; it appears to be falling out where we test the input length against the strategy, presumably because the default strategy (which we largely copy here) doesn't try to compress input data of less than 32 bytes. Given that this code isn't actually compressing anything in these cases, I'm a bit confused by Greg's report of substantial gains on ten tiny fields, all changed test; how can we win if we're not compressing? I studied the hundred tiny fields, half nulled test case in some detail. Some thoughts: - There is a comment TODO: It would be nice to behave like the history and the source strings were concatenated, so that you could compress using the new data, too. If we're not already doing that, then how are we managing to compress WAL by more than one-quarter in the hundred tiny fields, all changed case? It looks to me like the patch IS doing that, and I'm not sure it's a good idea, especially because it's using pglz_hist_add_no_recycle rather than pglz_add_hist: we verify that hlen + slen 2 * PGLZ_HISTORY_SIZE but that doesn't seem good enough. On the hundred tiny fields, half nulled test case, removing that line reduces compression somewhat but also saves on CPU cycles. - pglz_find_match() is happy to walk all the way down even a really, really long bucket chain. It has some code that reduces good_match each time through, but it fails to produce a non-zero decrement once good_match * good_drop 100. So if we're searching an enormously deep bucket many times in a row, and there are actually no matches, we'll go flying down the whole linked list every time. I tried mandating a minimum decrease of 1 and didn't observe that it made any difference in the run time of this test case, but it still seems odd. For the record, it's not new behavior with this patch; pglz_compress() has the same issue as things stand today. I wonder if we ought to decrease the good match length by a constant rather than a percentage at each step. - pglz_delta_encode() seems to spend about 50% of its CPU time loading up the history data. It would be nice to find a way to reduce that effort. I thought about maybe only making a history entry for, say, every fourth input position rather than every one, but a quick test seems to suggest that's a big fail, probably because it's too easy to skip over the position where we would have made the right match via some short match. But maybe there's some more sophisticated strategy here that would work better. For example, see: http://en.wikipedia.org/wiki/Rabin_fingerprint The basic idea is that you use a rolling hash function to divide up the history data into chunks of a given average size. So we scan the history data, compute a rolling hash value at each point, and each time the bottom N bits are zero, we consider that to be the end of a chunk. We enter all the chunks into a hash table. The chunk size will vary, but on the average, given a reasonably well-behaved rolling hash function (the pglz one probably doesn't qualify) it'll happen every 2^N bytes, so perhaps for this purpose we'd choose N to be between 3 and 5. Then, we scan the input we want to compress
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Tuesday, July 23, 2013 12:02 AM Greg Smith wrote: The v3 patch applies perfectly here now. Attached is a spreadsheet with test results from two platforms, a Mac laptop and a Linux server. I used systems with high disk speed because that seemed like a worst case for this improvement. The actual improvement for shrinking WAL should be even better on a system with slower disks. You are absolutely right. To mimic it on our system, by configuring RAMFS for database, it shows similar results. There are enough problems with the hundred tiny fields results that I think this not quite ready for commit yet. More comments on that below. This seems close though, close enough that I am planning to follow up to see if the slow disk results are better. Thanks for going extra mile to try for slower disks. Reviewing the wal-update-testsuite.sh test program, I think the only case missing that would be useful to add is ten tiny fields, one changed. I think that one is interesting to highlight because it's what benchmark programs like pgbench do very often. The GUC added by the program looks like this: postgres=# show wal_update_compression_ratio ; wal_update_compression_ratio -- 25 Is possible to add a setting here that disables the feature altogether? Yes, it can be done in below 2 ways: 1. Provide a new configuration parameter (wal_update_compression) to turn on/off this feature. 2. At table level user can be given option to configure That always makes it easier to consider a commit, knowing people can roll back the change if it makes performance worse. That would make performance testing easier too. wal-update-testsuit.sh takes as long as 13 minutes, it's long enough that I'd like the easier to automate comparison GUC disabling adds. If that's not practical to do given the intrusiveness of the code, it's not really necessary. I haven't looked at the change enough to be sure how hard this is. On the Mac, the only case that seems to have a slowdown now is hundred tiny fields, half nulled. It would be nice to understand just what is going on with that one. I got some ugly results in two short fields, no change too, along with a couple of other weird results, but I think those were testing procedure issues that can be ignored. The pgbench throttle work I did recently highlights that I can't really make a Mac quiet/consistent for benchmarking very well. Note that I ran all of the Mac tests with assertions on, to try and catch platform specific bugs. The Linux ones used the default build parameters. On Linux hundred tiny fields, half nulled was also by far the worst performing one, with a 30% increase in duration despite the 14% drop in WAL. Exactly what's going on there really needs to be investigated before this seems safe to commit. All of the hundred tiny fields cases seem pretty bad on Linux, with the rest of them running about a 11% duration increase. The main benefit of this patch is to reduce WAL for improving time in I/O, But I think for faster I/O systems, the calculation to reduce WAL has overhead. I will check how to optimize that calculation, but I think this feature should be consider with configuration knob as it can improve many cases. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Tuesday, July 23, 2013 12:27 AM Andres Freund wrote: On 2013-07-19 10:40:01 +0530, Hari Babu wrote: On Friday, July 19, 2013 4:11 AM Greg Smith wrote: On 7/9/13 12:09 AM, Amit Kapila wrote: I think the first thing to verify is whether the results posted can be validated in some other environment setup by another person. The testcase used is posted at below link: http://www.postgresql.org/message- id/51366323.8070...@vmware.com That seems easy enough to do here, Heikki's test script is excellent. The latest patch Hari posted on July 2 has one hunk that doesn't apply anymore now. The Head code change from Heikki is correct. During the patch rebase to latest PG LZ optimization code, the above code change is missed. Apart from the above changed some more changes are done in the patch, those are. FWIW I don't like this approach very much: * I'd be very surprised if this doesn't make WAL replay of update heavy workloads slower by at least factor of 2. Yes, if you just consider the cost of replay, but it involves other operations as well like for standby case transfer of WAL, Write of WAL, Read from WAL and then apply. So among them most operation's will be benefited from reduced WAL size, except apply where you need to decode. * It makes data recovery from WAL *noticeably* harder since data corruption now is carried forwards and you need the old data to decode new data This is one of the reasons why this optimization is done only when the new row goes in same page. * It makes changeset extraction either more expensive or it would have to be disabled there. I think, if there is any such implication, we can probably have the option of disable it I think my primary issue is that philosophically/architecturally I am of the opinion that a wal record should make sense of it's own without depending on heap data. And this patch looses that. Is the main worry about corruption getting propagated? With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 2013-07-23 18:59:11 +0530, Amit Kapila wrote: * I'd be very surprised if this doesn't make WAL replay of update heavy workloads slower by at least factor of 2. Yes, if you just consider the cost of replay, but it involves other operations as well like for standby case transfer of WAL, Write of WAL, Read from WAL and then apply. So among them most operation's will be benefited from reduced WAL size, except apply where you need to decode. I still think it's rather unlikely that they offset those. I've seen wal replay be a major bottleneck more than once... * It makes data recovery from WAL *noticeably* harder since data corruption now is carried forwards and you need the old data to decode new data This is one of the reasons why this optimization is done only when the new row goes in same page. That doesn't help all that much. It somewhat eases recovering data if full_page_writes are on, but it's realy hard to stitch together all changes if the corruption occured within a 1h long checkpoint... * It makes changeset extraction either more expensive or it would have to be disabled there. I think, if there is any such implication, we can probably have the option of disable it That can just be done on wal_level = logical, that's not the problem. It's certainly not with precedence that we have wal_level dependent optimizations. I think my primary issue is that philosophically/architecturally I am of the opinion that a wal record should make sense of it's own without depending on heap data. And this patch looses that. Is the main worry about corruption getting propagated? Not really. It feels wrong to me architecturally. That's subjective, I know. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Tuesday, July 23, 2013 7:06 PM Andres Freund wrote: On 2013-07-23 18:59:11 +0530, Amit Kapila wrote: * I'd be very surprised if this doesn't make WAL replay of update heavy workloads slower by at least factor of 2. Yes, if you just consider the cost of replay, but it involves other operations as well like for standby case transfer of WAL, Write of WAL, Read from WAL and then apply. So among them most operation's will be benefited from reduced WAL size, except apply where you need to decode. I still think it's rather unlikely that they offset those. I've seen wal replay be a major bottleneck more than once... * It makes data recovery from WAL *noticeably* harder since data corruption now is carried forwards and you need the old data to decode new data This is one of the reasons why this optimization is done only when the new row goes in same page. That doesn't help all that much. It somewhat eases recovering data if full_page_writes are on, but it's realy hard to stitch together all changes if the corruption occured within a 1h long checkpoint... I think once a record is corrupted on page, user has to reconstruct that page, it might be difficult to just reconstruct a record and this optimization will not carry forward any corruption from one to another Page. * It makes changeset extraction either more expensive or it would have to be disabled there. I think, if there is any such implication, we can probably have the option of disable it That can just be done on wal_level = logical, that's not the problem. It's certainly not with precedence that we have wal_level dependent optimizations. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
The v3 patch applies perfectly here now. Attached is a spreadsheet with test results from two platforms, a Mac laptop and a Linux server. I used systems with high disk speed because that seemed like a worst case for this improvement. The actual improvement for shrinking WAL should be even better on a system with slower disks. There are enough problems with the hundred tiny fields results that I think this not quite ready for commit yet. More comments on that below. This seems close though, close enough that I am planning to follow up to see if the slow disk results are better. Reviewing the wal-update-testsuite.sh test program, I think the only case missing that would be useful to add is ten tiny fields, one changed. I think that one is interesting to highlight because it's what benchmark programs like pgbench do very often. The GUC added by the program looks like this: postgres=# show wal_update_compression_ratio ; wal_update_compression_ratio -- 25 Is possible to add a setting here that disables the feature altogether? That always makes it easier to consider a commit, knowing people can roll back the change if it makes performance worse. That would make performance testing easier too. wal-update-testsuit.sh takes as long as 13 minutes, it's long enough that I'd like the easier to automate comparison GUC disabling adds. If that's not practical to do given the intrusiveness of the code, it's not really necessary. I haven't looked at the change enough to be sure how hard this is. On the Mac, the only case that seems to have a slowdown now is hundred tiny fields, half nulled. It would be nice to understand just what is going on with that one. I got some ugly results in two short fields, no change too, along with a couple of other weird results, but I think those were testing procedure issues that can be ignored. The pgbench throttle work I did recently highlights that I can't really make a Mac quiet/consistent for benchmarking very well. Note that I ran all of the Mac tests with assertions on, to try and catch platform specific bugs. The Linux ones used the default build parameters. On Linux hundred tiny fields, half nulled was also by far the worst performing one, with a 30% increase in duration despite the 14% drop in WAL. Exactly what's going on there really needs to be investigated before this seems safe to commit. All of the hundred tiny fields cases seem pretty bad on Linux, with the rest of them running about a 11% duration increase. This doesn't seem ready to commit for this CF, but the number of problem cases is getting pretty small now. Now that I've gotten more familiar with the test programs and the feature, I can run more performance tests on this at any time really. If updates addressing the trouble cases are ready from Amit or Hari before the next CF, send them out and I can look at them without waiting until that one starts. This is a very promising looking performance feature. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com WAL-lz-v3.xls Description: MS-Excel spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 2013-07-19 10:40:01 +0530, Hari Babu wrote: On Friday, July 19, 2013 4:11 AM Greg Smith wrote: On 7/9/13 12:09 AM, Amit Kapila wrote: I think the first thing to verify is whether the results posted can be validated in some other environment setup by another person. The testcase used is posted at below link: http://www.postgresql.org/message-id/51366323.8070...@vmware.com That seems easy enough to do here, Heikki's test script is excellent. The latest patch Hari posted on July 2 has one hunk that doesn't apply anymore now. The Head code change from Heikki is correct. During the patch rebase to latest PG LZ optimization code, the above code change is missed. Apart from the above changed some more changes are done in the patch, those are. FWIW I don't like this approach very much: * I'd be very surprised if this doesn't make WAL replay of update heavy workloads slower by at least factor of 2. * It makes data recovery from WAL *noticeably* harder since data corruption now is carried forwards and you need the old data to decode new data * It makes changeset extraction either more expensive or it would have to be disabled there. I think my primary issue is that philosophically/architecturally I am of the opinion that a wal record should make sense of it's own without depending on heap data. And this patch looses that. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 7/22/13 2:57 PM, Andres Freund wrote: * I'd be very surprised if this doesn't make WAL replay of update heavy workloads slower by at least factor of 2. I was thinking about what a benchmark of WAL replay would look like last year. I don't think that data is captured very well yet, and it should be. My idea was to break the benchmark into two pieces. One would take a base backup, then run a series of tests and archive the resulting the WAL. I doubt you can make a useful benchmark here without a usefully populated database, that's why the base backup step is needed. The first useful result then is to measure how long commit/archiving took and the WAL volume, which is what's done by the test harness for this program. Then the resulting backup would be setup for replay. tarring up the backup and WAL archive could even give you a repeatable test set for ones where it's only replay changes happening. Then the main number that's useful, total replay time, would be measured. The main thing I wanted this for wasn't for code changes; it was to benchmark configuration changes. I'd like to be able to answer questions like which I/O scheduler is best for a standby in a way that has real test data behind it. The same approach should useful for answering your concerns about the replay performance impact of this change too though. * It makes changeset extraction either more expensive or it would have to be disabled there. That argues that if committed at all, the ability to turn this off I was asking about would be necessary. It sounds like this *could* work like how minimal WAL archiving levels allow optimizations that are disabled at higher ones--like the COPY into a truncated/new table cheat. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 7/9/13 12:09 AM, Amit Kapila wrote: I think the first thing to verify is whether the results posted can be validated in some other environment setup by another person. The testcase used is posted at below link: http://www.postgresql.org/message-id/51366323.8070...@vmware.com That seems easy enough to do here, Heikki's test script is excellent. The latest patch Hari posted on July 2 has one hunk that doesn't apply anymore now. Inside src/backend/utils/adt/pg_lzcompress.c the patch tries to change this code: - if (hent) + if (hentno != INVALID_ENTRY) But that line looks like this now: if (hent != INVALID_ENTRY_PTR) Definitions of those: #define INVALID_ENTRY 0 #define INVALID_ENTRY_PTR (hist_entries[INVALID_ENTRY]) I'm not sure if different error handling may be needed here now due the commit that changed this, or if the patch wasn't referring to the right type of error originally. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
Greg, * Greg Smith (g...@2ndquadrant.com) wrote: That seems easy enough to do here, Heikki's test script is excellent. The latest patch Hari posted on July 2 has one hunk that doesn't apply anymore now. Inside src/backend/utils/adt/pg_lzcompress.c the patch tries to change this code: - if (hent) + if (hentno != INVALID_ENTRY) hentno certainly doesn't make much sense here- it's only used at the top of the function to keep things a bit cleaner when extracting the address into hent from hist_entries: hentno = hstart[pglz_hist_idx(input, end, mask)]; hent = hist_entries[hentno]; Indeed, as the referenced conditional is inside the following loop: while (hent != INVALID_ENTRY_PTR) and, since hentno == 0 implies hent == INVALID_ENTRY_PTR, the conditional would never fail (which is what was happening prior to Heikki commiting the fix for this, changing the conditional to what is below). But that line looks like this now: if (hent != INVALID_ENTRY_PTR) Right, this is correct- it's useful to check the new value for hent after it's been updated by: hent = hent-next; and see if it's possible to drop out early. I'm not sure if different error handling may be needed here now due the commit that changed this, or if the patch wasn't referring to the right type of error originally. I've not looked at anything regarding this beyond this email, but I'm pretty confident that the change Heikki committed was the correct one. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Friday, July 19, 2013 4:11 AM Greg Smith wrote: On 7/9/13 12:09 AM, Amit Kapila wrote: I think the first thing to verify is whether the results posted can be validated in some other environment setup by another person. The testcase used is posted at below link: http://www.postgresql.org/message-id/51366323.8070...@vmware.com That seems easy enough to do here, Heikki's test script is excellent. The latest patch Hari posted on July 2 has one hunk that doesn't apply anymore now. The Head code change from Heikki is correct. During the patch rebase to latest PG LZ optimization code, the above code change is missed. Apart from the above changed some more changes are done in the patch, those are. 1. corrected some comments in the code 2. Added a validity check as source and history length combined cannot be more than or equal to 8192. Thanks for the review, please find the latest patch attached in the mail. Regards, Hari babu. pglz-with-micro-optimization-compress-using-newdata-3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
The only environment I have available at the moment is a virtual box. That's probably not going to be very helpful for performance testing. __ *Mike Blackwell | Technical Analyst, Distribution Services/Rollout Management | RR Donnelley* 1750 Wallace Ave | St Charles, IL 60174-3401 Office: 630.313.7818 mike.blackw...@rrd.com http://www.rrdonnelley.com http://www.rrdonnelley.com/ * mike.blackw...@rrd.com* On Mon, Jul 8, 2013 at 11:09 PM, Amit Kapila amit.kap...@huawei.com wrote: On Tuesday, July 09, 2013 2:52 AM Mike Blackwell wrote: I can't comment on further direction for the patch, but since it was marked as Needs Review in the CF app I took a quick look at it. Thanks for looking into it. Last time Heikki has found test scenario's where the original patch was not performing good. He has also proposed a different approach for WAL encoding and sent the modified patch which has comparatively less negative performance impact and asked to check if the patch can reduce the performance impact for the scenario's mentioned by him. After that I found that with some modification's (use new tuple data for encoding) in his approach, it eliminates the negative performance impact and have WAL reduction for more number of cases. I think the first thing to verify is whether the results posted can be validated in some other environment setup by another person. The testcase used is posted at below link: http://www.postgresql.org/message-id/51366323.8070...@vmware.com It patches and compiles clean against the current Git HEAD, and 'make check' runs successfully. Does it need documentation for the GUC variable 'wal_update_compression_ratio'? This variable has been added to test the patch for different compression_ratio during development test. It was not decided to have this variable permanently as part of this patch, so currently there is no documentation for it. However if the decision comes out to be that it needs to be part of patch, then documentation for same can be updated. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wednesday, July 10, 2013 6:32 AM Mike Blackwell wrote: The only environment I have available at the moment is a virtual box. That's probably not going to be very helpful for performance testing. It's okay. Anyway thanks for doing the basic test for patch. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
I can't comment on further direction for the patch, but since it was marked as Needs Review in the CF app I took a quick look at it. It patches and compiles clean against the current Git HEAD, and 'make check' runs successfully. Does it need documentation for the GUC variable 'wal_update_compression_ratio'? __ *Mike Blackwell | Technical Analyst, Distribution Services/Rollout Management | RR Donnelley* 1750 Wallace Ave | St Charles, IL 60174-3401 Office: 630.313.7818 mike.blackw...@rrd.com http://www.rrdonnelley.com http://www.rrdonnelley.com/ * mike.blackw...@rrd.com* On Tue, Jul 2, 2013 at 2:26 AM, Hari Babu haribabu.ko...@huawei.com wrote: On Friday, June 07, 2013 5:07 PM Amit Kapila wrote: On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote: On 04.03.2013 06:39, Amit Kapila wrote: On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote: On 02/05/2013 11:53 PM, Amit Kapila wrote: Performance data for the patch is attached with this mail. Conclusions from the readings (these are same as my previous patch): The attached patch also just adds overhead in most cases, but the overhead is much smaller in the worst case. I think that's the right tradeoff here - we want to avoid scenarios where performance falls off the cliff. That said, if you usually just get a slowdown, we certainly can't make this the default, and if we can't turn it on by default, this probably just isn't worth it. The attached patch contains the variable-hash-size changes I posted in the Optimizing pglz compressor. But in the delta encoding function, it goes further than that, and contains some further micro- optimizations: the hash is calculated in a rolling fashion, and it uses a specialized version of the pglz_hist_add macro that knows that the input can't exceed 4096 bytes. Those changes shaved off some cycles, but you could probably do more. One idea is to only add every 10 bytes or so to the history lookup table; that would sacrifice some compressibility for speed. If you could squeeze pglz_delta_encode function to be cheap enough that we could enable this by default, this would be pretty cool patch. Or at least, the overhead in the cases that you get no compression needs to be brought down, to about 2-5 % at most I think. If it can't be done easily, I feel that this probably needs to be dropped. After trying some more on optimizing pglz_delta_encode(), I found that if we use new data also in history, then the results of compression and cpu utilization are much better. In addition to the pg lz micro optimization changes, following changes are done in modified patch 1. The unmatched new data is also added to the history which can be referenced later. 2. To incorporate this change in the lZ algorithm, 1 extra control bit is needed to indicate if data is from old or new tuple The patch is rebased to use the new PG LZ algorithm optimization changes which got committed recently. Performance Data - Head code: testname | wal_generated | duration -+---+-- two short fields, no change |1232911016 | 35.1784930229187 two short fields, one changed |1240322016 | 35.0436308383942 two short fields, both changed |1235318352 | 35.4989421367645 one short and one long field, no change |1042332336 | 23.4457180500031 ten tiny fields, all changed|1395194136 | 41.9023628234863 hundred tiny fields, first 10 changed | 626725984 | 21.2999589443207 hundred tiny fields, all changed| 621899224 | 21.6676609516144 hundred tiny fields, half changed | 623998272 | 21.2745981216431 hundred tiny fields, half nulled| 557714088 | 19.5902800559998 pglz-with-micro-optimization-compress-using-newdata-2: testname | wal_generated | duration -+---+-- two short fields, no change |1232903384 | 35.0115969181061 two short fields, one changed |1232906960 | 34.759307861 two short fields, both changed |1232903520 | 35.7665238380432 one short and one long field, no change | 649647992 | 19.4671010971069 ten tiny fields, all changed|1314957136 | 39.9727990627289 hundred tiny fields, first 10 changed | 458684024 | 17.8197758197784 hundred tiny fields, all changed| 461028464 | 17.3083391189575 hundred tiny fields, half changed | 456528696 | 17.1769199371338 hundred tiny fields, half nulled| 480548936 | 18.81720495224 Observation --- 1. It yielded compression in more cases (refer all cases of hundred tiny fields) 2.
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 07/08/2013 02:21 PM, Mike Blackwell wrote: I can't comment on further direction for the patch, but since it was marked as Needs Review in the CF app I took a quick look at it. It patches and compiles clean against the current Git HEAD, and 'make check' runs successfully. Does it need documentation for the GUC variable 'wal_update_compression_ratio'? Yes. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Tuesday, July 09, 2013 2:52 AM Mike Blackwell wrote: I can't comment on further direction for the patch, but since it was marked as Needs Review in the CF app I took a quick look at it. Thanks for looking into it. Last time Heikki has found test scenario's where the original patch was not performing good. He has also proposed a different approach for WAL encoding and sent the modified patch which has comparatively less negative performance impact and asked to check if the patch can reduce the performance impact for the scenario's mentioned by him. After that I found that with some modification's (use new tuple data for encoding) in his approach, it eliminates the negative performance impact and have WAL reduction for more number of cases. I think the first thing to verify is whether the results posted can be validated in some other environment setup by another person. The testcase used is posted at below link: http://www.postgresql.org/message-id/51366323.8070...@vmware.com It patches and compiles clean against the current Git HEAD, and 'make check' runs successfully. Does it need documentation for the GUC variable 'wal_update_compression_ratio'? This variable has been added to test the patch for different compression_ratio during development test. It was not decided to have this variable permanently as part of this patch, so currently there is no documentation for it. However if the decision comes out to be that it needs to be part of patch, then documentation for same can be updated. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 2013-03-05 23:26:59 +0200, Heikki Linnakangas wrote: On 04.03.2013 06:39, Amit Kapila wrote: The stats look fairly sane. I'm a little concerned about the apparent trend of falling TPS in the patched vs original tests for the 1-client test as record size increases, but it's only 0.0%-0.2%-0.4%, and the 0.4% case made other config changes too. Nonetheless, it might be wise to check with really big records and see if the trend continues. For bigger size (~2000) records, it goes into toast, for which we don't do this optimization. This optimization is mainly for medium size records. I've been doing investigating the pglz option further, and doing performance comparisons of the pglz approach and this patch. I'll begin with some numbers: unpatched (63d283ecd0bc5078594a64dfbae29276072cdf45): testname | wal_generated | duration -+---+-- two short fields, no change |1245525360 | 9.94613695144653 two short fields, one changed |1245536528 | 10.146910905838 two short fields, both changed |1245523160 | 11.2332470417023 one short and one long field, no change |1054926504 | 5.90477800369263 ten tiny fields, all changed|1411774608 | 13.4536008834839 hundred tiny fields, all changed| 635739680 | 7.57448387145996 hundred tiny fields, half changed | 636930560 | 7.56888699531555 hundred tiny fields, half nulled| 573751120 | 6.68991994857788 Amit's wal_update_changes_v10.patch: testname | wal_generated | duration -+---+-- two short fields, no change |1249722112 | 13.0558869838715 two short fields, one changed |1246145408 | 12.9947438240051 two short fields, both changed |1245951056 | 13.0262880325317 one short and one long field, no change | 678480664 | 5.70031690597534 ten tiny fields, all changed|1328873920 | 20.0167419910431 hundred tiny fields, all changed| 638149416 | 14.4236788749695 hundred tiny fields, half changed | 635560504 | 14.8770561218262 hundred tiny fields, half nulled| 558468352 | 16.2437210083008 pglz-with-micro-optimizations-1.patch: testname | wal_generated | duration -+---+-- two short fields, no change |1245519008 | 11.6702048778534 two short fields, one changed |1245756904 | 11.3233819007874 two short fields, both changed |1249711088 | 11.6836447715759 one short and one long field, no change | 664741392 | 6.44810795783997 ten tiny fields, all changed|1328085568 | 13.9679481983185 hundred tiny fields, all changed| 635974088 | 9.15514206886292 hundred tiny fields, half changed | 636309040 | 9.13769292831421 hundred tiny fields, half nulled| 496396448 | 8.77351498603821 In each test, a table is created with a large number of identical rows, and fillfactor=50. Then a full-table UPDATE is performed, and the UPDATE is timed. Duration is the time spent in the UPDATE (lower is better), and wal_generated is the amount of WAL generated by the updates (lower is better). The summary is that Amit's patch is a small win in terms of CPU usage, in the best case where the table has few columns, with one large column that is not updated. In all other cases it just adds overhead. In terms of WAL size, you get a big gain in the same best case scenario. Attached is a different version of this patch, which uses the pglz algorithm to spot the similarities between the old and new tuple, instead of having explicit knowledge of where the column boundaries are. This has the advantage that it will spot similarities, and be able to compress, in more cases. For example, you can see a reduction in WAL size in the hundred tiny fields, half nulled test case above. The attached patch also just adds overhead in most cases, but the overhead is much smaller in the worst case. I think that's the right tradeoff here - we want to avoid scenarios where performance falls off the cliff. That said, if you usually just get a slowdown, we certainly can't make this the default, and if we can't turn it on by default, this probably just isn't worth it. The attached patch contains the variable-hash-size changes I posted in the Optimizing pglz compressor. But in the delta encoding function, it goes further than that, and contains some further micro-optimizations: the hash is calculated in a rolling fashion, and it uses a specialized version of the pglz_hist_add macro that knows that the input can't exceed 4096 bytes. Those changes
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote: On 04.03.2013 06:39, Amit Kapila wrote: On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote: On 02/05/2013 11:53 PM, Amit Kapila wrote: Performance data for the patch is attached with this mail. Conclusions from the readings (these are same as my previous patch): I've been doing investigating the pglz option further, and doing performance comparisons of the pglz approach and this patch. I'll begin with some numbers: unpatched (63d283ecd0bc5078594a64dfbae29276072cdf45): testname | wal_generated | duration -+---+- - -+---+ two short fields, no change |1245525360 | 9.94613695144653 two short fields, one changed |1245536528 | 10.146910905838 two short fields, both changed |1245523160 | 11.2332470417023 one short and one long field, no change |1054926504 | 5.90477800369263 ten tiny fields, all changed|1411774608 | 13.4536008834839 hundred tiny fields, all changed| 635739680 | 7.57448387145996 hundred tiny fields, half changed | 636930560 | 7.56888699531555 hundred tiny fields, half nulled| 573751120 | 6.68991994857788 Amit's wal_update_changes_v10.patch: testname | wal_generated | duration -+---+- - -+---+ two short fields, no change |1249722112 | 13.0558869838715 two short fields, one changed |1246145408 | 12.9947438240051 two short fields, both changed |1245951056 | 13.0262880325317 one short and one long field, no change | 678480664 | 5.70031690597534 ten tiny fields, all changed|1328873920 | 20.0167419910431 hundred tiny fields, all changed| 638149416 | 14.4236788749695 hundred tiny fields, half changed | 635560504 | 14.8770561218262 hundred tiny fields, half nulled| 558468352 | 16.2437210083008 pglz-with-micro-optimizations-1.patch: testname | wal_generated | duration -+---+- - -+---+ two short fields, no change |1245519008 | 11.6702048778534 two short fields, one changed |1245756904 | 11.3233819007874 two short fields, both changed |1249711088 | 11.6836447715759 one short and one long field, no change | 664741392 | 6.44810795783997 ten tiny fields, all changed|1328085568 | 13.9679481983185 hundred tiny fields, all changed| 635974088 | 9.15514206886292 hundred tiny fields, half changed | 636309040 | 9.13769292831421 hundred tiny fields, half nulled| 496396448 | 8.77351498603821 For some of the tests, it doesn't even execute main part of compression/encoding. The reason is that the length of tuple is less than strategy min length, so it returns from below check in function pglz_delta_encode() if (strategy-match_size_good = 0 || slen strategy-min_input_size || slen strategy-max_input_size) return false; The tests for which it doesn't execute encoding are below: two short fields, no change two short fields, one changed two short fields, both changed ten tiny fields, all changed For above cases, the reason of difference in timings for both approaches with original could be due to the reason that this check is done after some processing. So I think if we check the length in log_heap_update, then there should not be timing difference for above test scenario's. I can check that once. This optimization helps only when tuple is of length 128~200 bytes and upto 1800 bytes (till it turns to toast), otherwise it could result in overhead without any major WAL reduction. Infact I think in one of my initial patch there is a check if length of tuple is greater than 128 bytes then perform the optimization. I shall try to run both patches for cases when length of tuple is 128~200 bytes, as this optimization has benefits in those cases. In each test, a table is created with a large number of identical rows, and fillfactor=50. Then a full-table UPDATE is performed, and the UPDATE is timed. Duration is the time spent in the UPDATE (lower is better), and wal_generated is the amount of WAL generated by the updates (lower is better). The summary is that Amit's patch is a small win in terms of CPU usage, in the best case where the table has few columns, with one
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 02/05/2013 11:53 PM, Amit Kapila wrote: Performance data for the patch is attached with this mail. Conclusions from the readings (these are same as my previous patch): 1. With orignal pgbench there is a max 7% WAL reduction with not much performance difference. 2. With 250 record pgbench there is a max wal reduction of 35% with not much performance difference. 3. With 500 and above record size in pgbench there is an improvement in the performance and wal reduction both. If the record size increases there is a gain in performance and wal size is reduced as well. Performance data for synchronous_commit = on is under progress, I shall post it once it is done. I am expecting it to be same as previous. Please find the performance readings for synchronous_commit = on. Each run is taken for 20 min. Conclusions from the readings with synchronous commit on mode: 1. With orignal pgbench there is a max 2% WAL reduction with not much performance difference. 2. With 500 record pgbench there is a max wal reduction of 3% with not much performance difference. 3. With 1800 record size in pgbench there is both an improvement in the performance (approx 3%) as well as wal reduction (44%). If the record size increases there is a very good reduction in WAL size. The stats look fairly sane. I'm a little concerned about the apparent trend of falling TPS in the patched vs original tests for the 1-client test as record size increases, but it's only 0.0%-0.2%-0.4%, and the 0.4% case made other config changes too. Nonetheless, it might be wise to check with really big records and see if the trend continues. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote: On 02/05/2013 11:53 PM, Amit Kapila wrote: Performance data for the patch is attached with this mail. Conclusions from the readings (these are same as my previous patch): 1. With orignal pgbench there is a max 7% WAL reduction with not much performance difference. 2. With 250 record pgbench there is a max wal reduction of 35% with not much performance difference. 3. With 500 and above record size in pgbench there is an improvement in the performance and wal reduction both. If the record size increases there is a gain in performance and wal size is reduced as well. Performance data for synchronous_commit = on is under progress, I shall post it once it is done. I am expecting it to be same as previous. Please find the performance readings for synchronous_commit = on. Each run is taken for 20 min. Conclusions from the readings with synchronous commit on mode: 1. With orignal pgbench there is a max 2% WAL reduction with not much performance difference. 2. With 500 record pgbench there is a max wal reduction of 3% with not much performance difference. 3. With 1800 record size in pgbench there is both an improvement in the performance (approx 3%) as well as wal reduction (44%). If the record size increases there is a very good reduction in WAL size. The stats look fairly sane. I'm a little concerned about the apparent trend of falling TPS in the patched vs original tests for the 1-client test as record size increases, but it's only 0.0%-0.2%-0.4%, and the 0.4% case made other config changes too. Nonetheless, it might be wise to check with really big records and see if the trend continues. For bigger size (~2000) records, it goes into toast, for which we don't do this optimization. This optimization is mainly for medium size records. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wednesday, January 30, 2013 8:32 PM Amit Kapila wrote: On Tuesday, January 29, 2013 7:42 PM Amit Kapila wrote: On Tuesday, January 29, 2013 3:53 PM Heikki Linnakangas wrote: On 29.01.2013 11:58, Amit Kapila wrote: Can there be another way with which current patch code can be made better, so that we don't need to change the encoding approach, as I am having feeling that this might not be performance wise equally good. The point is that I don't want to heap_delta_encode() to know the internals of pglz compression. You could probably make my patch more like yours in behavior by also passing an array of offsets in the new tuple to check, and only checking for matches as those offsets. I think it makes sense, because if we have offsets of both new and old tuple, we can internally use memcmp to compare columns and use same algorithm for encoding. I will change the patch according to this suggestion. I have modified the patch as per above suggestion. Apart from passing new and old tuple offsets, I have passed bitmaplength also, as we need to copy the bitmap of new tuple as it is into Encoded WAL Tuple. Please see if such API design is okay? I shall update the README and send the performance/WAL Reduction data for modified patch tomorrow. Updated patch including comments and README is attached with this mail. This patch contain exactly same design behavior as per previous. It takes care of API design suggestion of Heikki. The performance data is similar, as it is not complete, I shall send that tomorrow. With Regards, Amit Kapila. wal_update_changes_v10.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers