Re: [HACKERS] pgbench stuck with 100% cpu usage
On Fri, Sep 29, 2017 at 1:03 AM, Fabien COELHO <coe...@cri.ensmp.fr> wrote: > > The commit that introduced this code is 12788ae49e1933f463bc. So I amn >>> copying Heikki. >>> >> >> AFAICR the commit was mostly a heavy restructuring of previous >> unmaintainable spaghetti code. I'm not sure the problem was not there >> before under one form or another. >> >> I agree that it should error out & stop the client in this case at least. >> > > Here is a probable "fix", which does was the comment said should be done. > > Looks good to me. > I could not trigger an infinite loop with various kill -9 and other quick > stops. Could you try it on your side? > > Ok, I will try. But TBH I did not try to reproduce that either and I am not sure if I can. I discovered the problem when my laptop's battery started draining out much more quickly. Having seen the problem, it seems very obvious though. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] pgbench stuck with 100% cpu usage
On Fri, Sep 29, 2017 at 12:22 AM, Fabien COELHO <coe...@cri.ensmp.fr> wrote: > > While running some tests, I encountered a situation where pgbench gets >> stuck in an infinite loop, consuming 100% cpu. The setup was: >> >> - Start postgres server from the master branch >> - Initialise pgbench >> - Run pgbench -c 10 -T 100 >> - Stop postgres with -m immediate >> > > That is a strange test to run, but it would be better if the behavior was > not that one. Well, I think it's a very legitimate test, not for testing performance, but testing crash recovery and I use it very often. This particular test was run to catch another bug which will be reported separately. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
[HACKERS] pgbench stuck with 100% cpu usage
Hello, While running some tests, I encountered a situation where pgbench gets stuck in an infinite loop, consuming 100% cpu. The setup was: - Start postgres server from the master branch - Initialise pgbench - Run pgbench -c 10 -T 100 - Stop postgres with -m immediate Now it seems that pgbench gets stuck and it's state machine does not advance. Attaching it to debugger, I saw that one of the clients remain stuck in this loop forever. if (command->type == SQL_COMMAND) { if (!sendCommand(st, command)) { /* * Failed. Stay in CSTATE_START_COMMAND state, to * retry. ??? What the point or retrying? Should * rather abort? */ return; } else st->state = CSTATE_WAIT_RESULT; } sendCommand() returns false because the underlying connection is bad and PQsendQuery returns 0. Reading the comment, it seems that the author thought about this situation but decided to retry instead of abort. Not sure what was the rationale for that decision, may be to deal with transient failures? The commit that introduced this code is 12788ae49e1933f463bc. So I am copying Heikki. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Parallel worker error
On Wed, Aug 30, 2017 at 8:49 PM, Robert Haas <robertmh...@gmail.com> wrote: > > > But since that's an established design fl^H^Hprinciple, maybe that > means we should go with the approach of teaching SerializeGUCState() > to ignore role altogether and instead have ParallelWorkerMain call > SetCurrentRoleId using information passed via the FixedParallelState > (not sure of the precise details here). > > Seems like a reasonable approach to me. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Parallel worker error
On Wed, Aug 30, 2017 at 5:19 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > > > I am able to reproduce this without involving session authorization > guc as well. One needs to drop the newly created role from another > session, then also we can see the same error. > > Yeah, that's how I first created the case. But concurrent dropping of role (which is surprisingly allowed even when there are active connections with the role active) opens another set of errors. Hence I tried to reproduce this in a single session. While it might be interesting to fix the concurrent role drop problem someday, the single session problem looks more pressing. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
[HACKERS] Parallel worker error
Hello, While investing an issue in Postgres-XL 10, I came across this rather surprisingly behaviour in PG master. See this test case: create role testuser1; set role to testuser1; show role; -- shows testuser1 -- reset back to default role reset session authorization ; show role; -- shows none set force_parallel_mode TO 1; select count(*) from pg_class ; -- ok -- now drop the role and try parallel query again drop role testuser1 ; select count(*) from pg_class ; -- fails The last statement in this test fails with an error: ERROR: role "testuser1" does not exist CONTEXT: parallel worker Looks like the leader process when serialises the GUC state, saves the "role" value, which is still set to testuser1 (and which is now dropped). Later, when the parallel worker process tries to restore the GUC state, it fails on catalog lookup. Comments in show_role() mentions a kludge because apparently SET SESSION AUTHORIZATION cannot call set_config_option and change the current value of "role". So that probably explains why show_role() displays the correct information, but parallel worker fails to handle the case. It's quite possible that I don't understand the differences in "role" and "session authorization", but it still looks like a bug to me. May be SerializeGUCState() should check if SetRoleIsActive is true and only then save the role information? Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Fri, Jul 28, 2017 at 5:57 AM, Peter Geoghegan <p...@bowt.ie> wrote: > Pavan Deolasee <pavan.deola...@gmail.com> wrote: > > I'll be happy if someone wants to continue hacking the patch further and > > get it in a committable shape. I can stay actively involved. But TBH the > > amount of time I can invest is far as compared to what I could during the > > last cycle. > > That's disappointing. > > Yes, it is even more for me. But I was hard pressed to choose between Postgres-XL 10 and WARM. Given ever increasing interest in XL and my ability to control the outcome, I thought it makes sense to focus on XL for now. > I personally find it very difficult to assess something like this. One good thing is that the patch is ready and fully functional. So that allows those who are keen to run real performance tests and see the actual impact of the patch. > The > problem is that even if you can demonstrate that the patch is strictly > better than what we have today, the risk of reaching a local maxima > exists. Do we really want to double-down on HOT? > Well HOT has served us well for over a decade now. So I won't hesitate to place my bets on WARM. > > If I'm not mistaken, the goal of WARM is, roughly speaking, to make > updates that would not be HOT-safe today do a "partial HOT update". My > concern with that idea is that it doesn't do much for the worst case. > I see your point. But I would like to think this way: does the technology significantly help many common use cases, that are currently not addressed by HOT? It probably won't help all workloads, that's given. Also, we don't have any credible alternative while this patch has progressed quite a lot. May be Robert will soon present the pluggable storage/UNDO patch and that will cover everything and more that is currently covered by HOT/WARM. That will probably make many other things redundant. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Wed, Jul 26, 2017 at 6:26 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Tue, Apr 18, 2017 at 4:25 AM, Pavan Deolasee > <pavan.deola...@gmail.com> wrote: > > I'll include the fix in the next set of patches. > > I haven't see a new set of patches. Are you intending to continue > working on this? > > Looks like I'll be short on bandwidth to pursue this further, given other work commitments including upcoming Postgres-XL 10 release. While I haven't worked on the patch since April, I think it was in a pretty good shape where I left it. But it's going to be incredibly difficult to estimate the amount of further efforts required, especially with testing and validating all the use cases and finding optimisations to fix regressions in all those cases. Also, many fundamental concerns around the patch touching the core of the database engine can only be addressed if some senior hackers, like you, take serious interest in the patch. I'll be happy if someone wants to continue hacking the patch further and get it in a committable shape. I can stay actively involved. But TBH the amount of time I can invest is far as compared to what I could during the last cycle. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Fri, May 5, 2017 at 10:56 AM, Michael Paquier <michael.paqu...@gmail.com> wrote: > On Wed, May 3, 2017 at 12:25 AM, Peter Eisentraut > > > >>> Can we prevent HOT pruning during logical decoding? > >> > >> It does not sound much difficult to do, couldn't you just make it a > >> no-op with am_walsender? > > > > That's my hope. > > The only code path doing HOT-pruning and generating WAL is > heap_page_prune(). Do you think that we need to worry about FPWs as > well? > IMO the check should go inside heap_page_prune_opt(). Do we need to worry about wal_log_hints or checksums producing WAL because of hint bit updates? While I haven't read the thread, I am assuming if HOT pruning can happen, surely hint bits can get set too. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Assertion failure in REL9_5_STABLE
On Wed, Apr 19, 2017 at 11:19 AM, Andrew Gierth <and...@tao11.riddles.org.uk > wrote: > >>>>> "Pavan" == Pavan Deolasee <pavan.deola...@gmail.com> writes: > > Pavan> I am attaching a patch that throws a similar ERROR during > Pavan> planning even for 9.5. AFAICS in presence of grouping sets, we > Pavan> always decide to use sort-based implementation for grouping, but > Pavan> do not check if the columns support ordering or not. > > Hmmm. This is obviously my fault somewhere... the patch looks good, I'll > deal with it shortly. > JFR this got fixed via 7be3678a8cfb55dcfca90fa586485f835ab912a5 Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Assertion failure in REL9_5_STABLE
On Wed, Apr 19, 2017 at 11:19 AM, Andrew Gierth <and...@tao11.riddles.org.uk > wrote: > >>>>> "Pavan" == Pavan Deolasee <pavan.deola...@gmail.com> writes: > > Pavan> I am attaching a patch that throws a similar ERROR during > Pavan> planning even for 9.5. AFAICS in presence of grouping sets, we > Pavan> always decide to use sort-based implementation for grouping, but > Pavan> do not check if the columns support ordering or not. > > Hmmm. This is obviously my fault somewhere... the patch looks good, I'll > deal with it shortly. > > Thanks. It might be a good idea to include that test case from the master. Please let me know if you would like me to send a new patch which includes that. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
[HACKERS] Assertion failure in REL9_5_STABLE
Hi All, I stumbled upon this test case from the master that sets an assertion failure (see stack trace at the end) on REL9_5_STABLE. create temp table gstest4(id integer, v integer, unhashable_col bit(4), unsortable_col xid); insert into gstest4 values (1,1,b'','1'), (2,2,b'0001','1'), (3,4,b'0010','2'), (4,8,b'0011','2'), (5,16,b'','2'), (6,32,b'0001','2'), (7,64,b'0010','1'), (8,128,b'0011','1'); -- mixed hashable/sortable cases select unhashable_col, unsortable_col, grouping(unhashable_col, unsortable_col), count(*), sum(v) from gstest4 group by grouping sets ((unhashable_col),(unsortable_col)) order by 3, 5; I tested this with REL9_6_STABLE and it throws a more useful error message, presumably because the code was refactored quite heavily for upper planner changes. ERROR: could not implement GROUP BY DETAIL: Some of the datatypes only support hashing, while others only support sorting. I am attaching a patch that throws a similar ERROR during planning even for 9.5. AFAICS in presence of grouping sets, we always decide to use sort-based implementation for grouping, but do not check if the columns support ordering or not. I did not test it for other older branches because grouping sets were introduced in 9.5. Thanks, Pavan (lldb) bt * thread #1: tid = 0x, 0x7fff9c1dfdda libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP * frame #0: 0x7fff9c1dfdda libsystem_kernel.dylib`__pthread_kill + 10 frame #1: 0x7fff9c2cb787 libsystem_pthread.dylib`pthread_kill + 90 frame #2: 0x7fff9c145420 libsystem_c.dylib`abort + 129 frame #3: 0x00010f61a3f0 postgres`ExceptionalCondition(conditionName="!(sortOperators[i] != 0)", errorType="BadArgument", fileName="tuplesort.c", lineNumber=657) + 128 at assert.c:54 frame #4: 0x00010f65a07d postgres`tuplesort_begin_heap(tupDesc=0x7fb1528194d8, nkeys=1, attNums=0x7fb15280e9e0, sortOperators=0x7fb15280ea00, sortCollations=0x7fb15280ea20, nullsFirstFlags="", workMem=4096, randomAccess='\0') + 509 at tuplesort.c:657 frame #5: 0x00010f2e871d postgres`initialize_phase(aggstate=0x7fb152817828, newphase=0) + 477 at nodeAgg.c:456 frame #6: 0x00010f2e73f0 postgres`ExecInitAgg(node=0x7fb1528062e8, estate=0x7fb152817440, eflags=16) + 2656 at nodeAgg.c:2223 frame #7: 0x00010f2d18e1 postgres`ExecInitNode(node=0x7fb1528062e8, estate=0x7fb152817440, eflags=16) + 865 at execProcnode.c:296 frame #8: 0x00010f301231 postgres`ExecInitSort(node=0x7fb152806400, estate=0x7fb152817440, eflags=16) + 225 at nodeSort.c:202 frame #9: 0x00010f2d18a9 postgres`ExecInitNode(node=0x7fb152806400, estate=0x7fb152817440, eflags=16) + 809 at execProcnode.c:286 frame #10: 0x00010f2ccf20 postgres`InitPlan(queryDesc=0x7fb152803c40, eflags=16) + 1520 at execMain.c:957 frame #11: 0x00010f2cc81f postgres`standard_ExecutorStart(queryDesc=0x7fb152803c40, eflags=16) + 591 at execMain.c:237 frame #12: 0x00010f2cc5be postgres`ExecutorStart(queryDesc=0x7fb152803c40, eflags=0) + 62 at execMain.c:139 frame #13: 0x00010f48b112 postgres`PortalStart(portal=0x7fb151836c40, params=0x, eflags=0, snapshot=0x) + 722 at pquery.c:533 frame #14: 0x00010f4871c1 postgres`exec_simple_query(query_string="select unhashable_col, unsortable_col,\n grouping(unhashable_col, unsortable_col),\n count(*), sum(v)\n from gstest4 group by grouping sets ((unhashable_col),(unsortable_col))\n order by 3, 5;") + 945 at postgres.c:1065 frame #15: 0x00010f486525 postgres`PostgresMain(argc=1, argv=0x7fb151801f00, dbname="postgres", username="pavan") + 2245 at postgres.c:4051 frame #16: 0x00010f3ef392 postgres`BackendRun(port=0x7fb151700540) + 674 at postmaster.c:4254 frame #17: 0x00010f3ee64d postgres`BackendStartup(port=0x7fb151700540) + 365 at postmaster.c:3928 frame #18: 0x00010f3ed705 postgres`ServerLoop + 597 at postmaster.c:1698 frame #19: 0x00010f3eb066 postgres`PostmasterMain(argc=3, argv=0x7fb151403740) + 5414 at postmaster.c:1306 frame #20: 0x00010f32bddf postgres`main(argc=3, argv=0x7fb151403740) + 751 at main.c:228 frame #21: 0x7fff9c0b1255 libdyld.dylib`start + 1 frame #22: 0x7fff9c0b1255 libdyld.dylib`start + 1 -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services pg95_assertion_fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Fri, Apr 14, 2017 at 9:21 PM, Jaime Casanova <jaime.casanova@2ndquadrant. com> wrote: > > > Hi Pavan, > > I run a test on current warm patchset, i used pgbench with a scale of > 20 and a fillfactor of 90 and then start the pgbench run with 6 > clients in parallel i also run sqlsmith on it. > > And i got a core dump after sometime of those things running. > > The assertion that fails is: > > """ > LOG: statement: UPDATE pgbench_tellers SET tbalance = tbalance + 3519 > WHERE tid = 34; > TRAP: FailedAssertion("!(((bool) (((const void*)(>t_ctid) != > ((void *)0)) && (((>t_ctid)->ip_posid & uint16) 1) << 13) - > 1)) != 0", File: "../../../../src/include/access/htup_details.h", > Line: 659) > """ > Hi Jaime, Thanks for doing the tests and reporting the problem. Per our chat, the assertion failure occurs only after a crash recovery. I traced i down to the point where we were failing to set the root line pointer correctly during crash recovery. In fact, we were setting it, but after the local changes are copied to the on-disk image, thus failing to make to the storage. Can you please test with the attached patch and confirm it works? I was able to reproduce the exact same assertion on my end and the patch seems to fix it. But an additional check won't harm. I'll include the fix in the next set of patches. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services warm_crash_recovery_fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PANIC in pg_commit_ts slru after crashes
On Sat, Apr 15, 2017 at 12:53 AM, Jeff Janes <jeff.ja...@gmail.com> wrote: > > In the first statement executed after crash recovery, I sometimes get this > error: > > PANIC: XX000: could not access status of transaction 207580505 > DETAIL: Could not read from file "pg_commit_ts/1EF0" at offset 131072: > Success. > LOCATION: SlruReportIOError, slru.c:918 > STATEMENT: create temporary table aldjf (x serial) > > Other examples: > > PANIC: XX000: could not access status of transaction 3483853232 > DETAIL: Could not read from file "pg_commit_ts/20742" at offset 237568: > Success. > LOCATION: SlruReportIOError, slru.c:918 > STATEMENT: create temporary table aldjf (x serial) > > PANIC: XX000: could not access status of transaction 802552883 > DETAIL: Could not read from file "pg_commit_ts/779E" at offset 114688: > Success. > LOCATION: SlruReportIOError, slru.c:918 > STATEMENT: create temporary table aldjf (x serial) > > Based on the errno, I'm assuming the read was successful but returned the > wrong number of bytes (which was zero in the case I saw after changing the > code to log short reads). > > It then goes through recovery again and the problem does not immediately > re-occur if you attempt to connect again. I don't know why the file size > would have changed between attempts. > > > The problem bisects to the commit: > > commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71 > Author: Simon Riggs <si...@2ndquadrant.com> > Date: Tue Apr 4 15:56:56 2017 -0400 > > Speedup 2PC recovery by skipping two phase state files in normal path > > > It is not obvious to me how that is relevant. My test doesn't use > prepared transactions (and leaves the guc at zero), and this commit doesn't > touch the slru.c. > > I'm attaching the test harness. There is a patch which injects the > crash-faults and also allows xid fast-forward, a perl script that runs > until crash and assesses the consistency of the post-crash results, and a > shell script which sets up the database and then calls the perl script in a > loop. On 8 CPU machine, it takes about an hour for the PANIC to occur. > > The attached script bails out once it sees the PANIC (count.pl line 158) > if it didn't do that then it will proceed to connect again and retry, and > works fine. No apparent permanent data corruption. > > Any clues on how to investigate this further? > Since all those offsets fall on a page boundary, my guess is that we're somehow failing to handle a new page correctly. Looking at the patch itself, my feeling is that the following code in src/backend/access/transam/twophase.c might be causing the problem. 1841 1842 /* update nextXid if needed */ 1843 if (TransactionIdFollowsOrEquals(maxsubxid, ShmemVariableCache->nextXid)) 1844 { 1845 LWLockAcquire(XidGenLock, LW_EXCLUSIVE); 1846 ShmemVariableCache->nextXid = maxsubxid; 1847 TransactionIdAdvance(ShmemVariableCache->nextXid); 1848 LWLockRelease(XidGenLock); 1849 } The function PrescanPreparedTransactions() gets called at the start of the redo recovery and this specific block will get exercised irrespective of whether there are any prepared transactions or not. What I find particularly wrong here is that we are initialising maxsubxid to current value of ShmemVariableCache->nextXid when the function enters, but this block would then again increment ShmemVariableCache->nextXid, when there are no prepared transactions in the system. I wonder if we should do as in attached patch. It's not entirely clear to me why only CommitTS fails and not other slrus. One possible reason could be that CommitTS is started before this function gets called whereas CLOG and SUBTRANS get started after the function returns. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services fix_commit_ts_crash.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Wed, Apr 12, 2017 at 10:42 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Tue, Apr 11, 2017 at 1:20 PM, Pavan Deolasee > > > 5. Added code to set a CLEAR pointer to a WARM pointer when we know that > the > > entire chain is WARM. This should address the workload Dilip ran and > found > > regression (I don't think he got chance to confirm that) > > Which is clearly a thing that should happen before commit, and really, > you ought to be leading the effort to confirm that, not him. It's > good for him to verify that your fix worked, but you should test it > first. > Not sure why you think I did not do the tests. I did and reported that it helps reduce the regression. Last para here: https://www.postgresql. org/message-id/CABOikdOTstHK2y0rDk%2BY3Wx9HRe%2BbZtj3zuYGU% 3DVngneiHo5KQ%40mail.gmail.com I understand it might have got lost in the conversation and I possibly did a poor job of explaining it. From my perspective, I did not want say that everything is hunky-dory based on my own tests because 1. I probably do not have access to the same kind of machine Dilip has and 2. It's better to get it confirmed by someone who initially reported it. Again, I fully respect that he would be busy with other things and I do not expect him or anyone else to test/review my patch on priority. The only point I am trying to make is that I did my own tests and made sure that it helps. (Having said that, I am not sure if changing pointer state from CLEAR to WARM is indeed a good change. Having thought more about it and after looking at the page-split code, I now think that this might just confuse the WARM cleanup code and make algorithm that much harder to prove) > > 6. Enhanced stats collector to collect information about candidate WARM > > chains and added mechanism to control WARM cleanup at the heap as well as > > index level, based on configurable parameters. This gives user better > > control over the additional work that is required for WARM cleanup. > > I haven't seen previous discussion of this; therefore I doubt whether > we have agreement on these parameters. > Sure. I will bring these up in a more structured manner for everyone to comment. > > > 7. Added table level option to disable WARM if nothing else works. > > -1 from me. > Ok. It's kinda last resort for me too. But at some point, we might want to make that call if we find an important use case that regresses because of WARM and we see no way to fix that or at least not without a whole lot of complexity. > > > > I may have missed something, but there is no intention to ignore known > > regressions/reviews. Of course, I don't think that every regression will > be > > solvable, like if you run a CPU-bound workload, setting it up in a way > such > > that you repeatedly exercise the area where WARM is doing additional > work, > > without providing any benefit, may be you can still find regression. I am > > willing to fix them as long as they are fixable and we are comfortable > with > > the additional code complexity. IMHO certain trade-offs are good, but I > > understand that not everybody will agree with my views and that's ok. > > The point here is that we can't make intelligent decisions about > whether to commit this feature unless we know which situations get > better and which get worse and by how much. Sure. > I don't accept as a > general principle the idea that CPU-bound workloads don't matter. > Obviously, I/O-bound workloads matter too, but we can't throw > CPU-bound workloads under the bus. Yeah, definitely not suggesting that. > Now, avoiding index bloat does > also save CPU, so it is easy to imagine that WARM could come out ahead > even if each update consumes slightly more CPU when actually updating, > so we might not actually regress. If we do, I guess I'd want to know > why. Well the kind of tests we did to look for regression were worst case scenarios. For example, in the test where we found 10-15% regression, we used a wide index (so recheck cost is high), WARM updated all rows, disabled auto-vacuum (so no chain conversion) and then repeatedly selected the rows from the index, thus incurring recheck overhead and in fact, measuring only that. When I measured WARM on tables with small scale factor so that everything fits in memory, I found a modest 20% improvement in tps. So, you're right, WARM might also help in-memory workloads. But that will show up only if we measure UPDATEs and SELECTs both. If we measure only SELECTs and that too in a state where we are paying all price for having done a WARM update, obviously we will only see regression, if any. Not saying we should ignore that. We should in fact measure all possible loads, and try to fix as many as we can, especially if they resemble to a r
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Thu, Apr 13, 2017 at 2:04 AM, Peter Geoghegan <p...@bowt.ie> wrote: > On Wed, Apr 12, 2017 at 10:12 AM, Robert Haas <robertmh...@gmail.com> > wrote: > >> I may have missed something, but there is no intention to ignore known > >> regressions/reviews. Of course, I don't think that every regression > will be > >> solvable, like if you run a CPU-bound workload, setting it up in a way > such > >> that you repeatedly exercise the area where WARM is doing additional > work, > >> without providing any benefit, may be you can still find regression. I > am > >> willing to fix them as long as they are fixable and we are comfortable > with > >> the additional code complexity. IMHO certain trade-offs are good, but I > >> understand that not everybody will agree with my views and that's ok. > > > > The point here is that we can't make intelligent decisions about > > whether to commit this feature unless we know which situations get > > better and which get worse and by how much. I don't accept as a > > general principle the idea that CPU-bound workloads don't matter. > > Obviously, I/O-bound workloads matter too, but we can't throw > > CPU-bound workloads under the bus. Now, avoiding index bloat does > > also save CPU, so it is easy to imagine that WARM could come out ahead > > even if each update consumes slightly more CPU when actually updating, > > so we might not actually regress. If we do, I guess I'd want to know > > why. > > I myself wonder if this CPU overhead is at all related to LP_DEAD > recycling during page splits. With the respect to the tests that myself, Dilip and others did for WARM, I think we were kinda exercising the worst case scenario. Like in one case, we created a table with 40% fill factor, created an index with a large text column, WARM updated all rows in the table, turned off autovacuum so that chain conversion does not take place, and then repeatedly run select query on those rows using the index which did not receive WARM insert. IOW we were only measuring the overhead of doing recheck by constructing an index tuple from the heap tuple and then comparing it against the existing index tuple. And we did find regression, which is not entirely surprising because obviously that code path does extra work when it needs to do recheck. And we're only measuring that overhead without taking into account the benefits of WARM to the system in general. I think counter-argument to that is, such workload may exists somewhere which might be regressed. I have my suspicions that the recyling > has some relationship to locality, which leads me to want to > investigate how Claudio Freire's patch to consistently treat heap TID > as part of the B-Tree sort order could help, both in general, and for > WARM. > It could be, especially if we re-redesign recheck solely based on the index pointer state and the heap tuple state. That could be more performant for selects and could also be more robust, but will require index inserts to get hold of the old index pointer (based on root TID), compare it against the new index tuple and either skip the insert (if everything matches) or set a PREWARM flag on the old pointer, and insert the new tuple with POSTWARM flag. Searching for old index pointer will be non-starter for non-unique indexes, unless they are also sorted by TID, something that Claudio's patch does. What I am not sure is whether the patch on its own will stand the performance implications because it increases the index tuple width (and probably index maintenance cost too). Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Wed, Apr 12, 2017 at 9:23 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Tue, Apr 11, 2017 at 10:50 PM, Pavan Deolasee > > > > > And I fixed them as quickly as humanly possible. > > > > Yes, you have responded to them quickly, but I didn't get a chance to > re-verify all of those. However, I think the main point Robert wants > to say is that somebody needs to dig the complete patch to see if > there is any kind of problems with it. > There are no two views about that. I don't even claim that more problems won't be found during in-depth review. I was only responding to his view that I did not do much to address the regressions reported during the review/tests. > > > 5. Added code to set a CLEAR pointer to a WARM pointer when we know that > the > > entire chain is WARM. This should address the workload Dilip ran and > found > > regression (I don't think he got chance to confirm that) > > > > Have you by any chance tried to reproduce it at your end? I did reproduce and verified that the new technique helps the case [1] (see last para). I did not go extra length to check if there are more cases which can still cause regression, like recheck is applied only once to each tuple (so the new technique does not yield any benefit) and whether that still causes regression and by how much. However I ran pure pgbench workload (only HOT updates) with smallish scale factor so that everything fits in memory and did not find any regression. Having said that, it's my view that others need not agree to it, that we need to distinguish between CPU and IO load since WARM is designed to address IO problems and not so much CPU problems. We also need to see things in totality and probably measure updates and selects both if we are going to WARM update all tuples once and read them once. That doesn't mean we shouldn't perform more tests and I am more than willing to fix if we find regression in even a remotely real-world use case. Thanks, Pavan [1] https://www.postgresql.org/message-id/CABOikdOTstHK2y0rDk%2BY3Wx9HRe%2BbZtj3zuYGU%3DVngneiHo5KQ%40mail.gmail.com -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Tue, Apr 11, 2017 at 7:10 PM, Robert Haas <robertmh...@gmail.com> wrote: > > > Yes, and as Andres says, you don't help with those, and then you're > upset when your own patch doesn't get attention. I am not upset, I was obviously a bit disappointed which I think is a very natural emotion after spending weeks on it. I am not blaming any one individual (excluding myself) for that and neither the community at large for the outcome. And I've moved on. I know everyone is busy getting the release ready and I see no point discussing this endlessly. We have enough on our plates for next few weeks. > > Amit and others who have started to > dig into this patch a little bit found real problems pretty quickly > when they started digging. And I fixed them as quickly as humanly possible. > Those problems should be addressed, and > review should continue (from whatever source) until no more problems > can be found. Absolutely. > The patch may > or may not have any data-corrupting bugs, but regressions have been > found and not addressed. I don't know why you say that regressions are not addressed. Here are a few things I did to address the regressions/reviews/concerns, apart from fixing all the bugs discovered, but please let me know if there are things I've not addressed. 1. Improved the interesting attrs patch that Alvaro wrote to address the regression discovered in fetching more heap attributes. The patch that got committed in fact improved certain synthetic workloads over then master. 2. Based on Petr and your feedback, disabled WARM on toasted attributes to reduce overhead of fetching/decompressing the attributes. 3. Added code to avoid doing second index scan when the index does not contain any WARM pointers. This should address the situation Amit brought up where only one of the indexes receive WARM inserts. 4. Added code to kill wrong index pointers to do online cleanup. 5. Added code to set a CLEAR pointer to a WARM pointer when we know that the entire chain is WARM. This should address the workload Dilip ran and found regression (I don't think he got chance to confirm that) 6. Enhanced stats collector to collect information about candidate WARM chains and added mechanism to control WARM cleanup at the heap as well as index level, based on configurable parameters. This gives user better control over the additional work that is required for WARM cleanup. 7. Added table level option to disable WARM if nothing else works. 8. Added mechanism to disable WARM when more than 50% indexes are being updated. I ran some benchmarks with different percentage of indexes getting updated and thought this is a good threshold. I may have missed something, but there is no intention to ignore known regressions/reviews. Of course, I don't think that every regression will be solvable, like if you run a CPU-bound workload, setting it up in a way such that you repeatedly exercise the area where WARM is doing additional work, without providing any benefit, may be you can still find regression. I am willing to fix them as long as they are fixable and we are comfortable with the additional code complexity. IMHO certain trade-offs are good, but I understand that not everybody will agree with my views and that's ok. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Wed, Apr 5, 2017 at 11:57 PM, Andres Freund <and...@anarazel.de> wrote: > On 2017-04-05 09:36:47 -0400, Robert Haas wrote: > > By the way, the "Converting WARM chains back to HOT chains" section of > > README.WARM seems to be out of date. Any chance you could update that > > to reflect the current state and thinking of the patch? > > I propose we move this patch to the next CF. That shouldn't prevent you > working on it, although focusing on review of patches that still might > make it wouldn't hurt either. > > Thank you all for the reviews, feedback, tests, criticism. And apologies for keep pushing it till the last minute even though it was clear to me quite some time back the patch is not going to make it. But if I'd given up, it would have never received whatever little attention it got. The only thing that disappoints me is that the patch was held back on no strong technical grounds - at least none were clear to me. There were concerns about on-disk changes etc, but most on-disk changes were known for 7 months now. Reminds me of HOT development, when it would not receive adequate feedback for quite many months, probably for very similar reasons - complex patch, changes on-disk format, risky, even though performance gains were quite substantial. I was much more hopeful this time because we have many more experts now as compared to then, but we probably have equally more amount of complex patches to review/commit. I understand that we would like this patch to go in very early in the development cycle. So as Alvaro mentioned elsewhere, we will continue to work on it so that we can get it in as soon as v11 tree open. We shall soon submit a revised version, with the list of critical things so that we can discuss them here and get some useful feedback. I hope everyone understands that the feature of this kind won't happen without on-disk format changes. So to be able to address any concerns, we will need specific feedback and workable suggestions, if any. Finally, my apologies for not spending enough time reviewing other patches. I know its critical, and I'll try to improve on that. Congratulations to all whose work got accepted and many thanks to all reviewers/committers/CF managers. I know how difficult and thankless that work is. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Thu, Apr 6, 2017 at 12:20 AM, Peter Geoghegan <p...@bowt.ie> wrote: > On Wed, Apr 5, 2017 at 11:27 AM, Andres Freund <and...@anarazel.de> wrote: > > I propose we move this patch to the next CF. > > I agree. I think it's too late to be working out fine details around > TOAST like this. This is a patch that touches the storage format in a > fairly fundamental way. > > The idea of turning WARM on or off reminds me a little bit of the way > it was at one time suggested that HOT not be used against catalog > tables, a position that Tom pushed against. I agree. I am grateful that Tom put his put down and helped me find answers to all hard problems, including catalog tables and create index concurrently. So I was very clear in my mind from the very beginning that WARM must support all these things too. Obviously it still doesn't support everything like other index methods and expression indexes, but IMHO that's a much smaller problem. Also, making sure that WARM works on system tables helped me find any corner bugs which would have otherwise skipped via regular regression testing. > I'm not saying that it's > necessarily a bad idea, but we should exhaust alternatives, and have a > clear rationale for it. > One reason why it's probably a good idea is because we know WARM will not effective for all use cases and it might actually cause performance regression for some of them. Even worse and as Robert fears, it might cause data loss issues. Though TBH I haven't yet seen any concrete example where it breaks so badly that it causes data loss, but that may be because the patch still hasn't received enough eye balls or outside tests. Having table level option would allow us to incrementally improve things instead of making the initial patch so large that reviewing it is a complete nightmare. May be it's already a nightmare. It's not as if HOT would not have caused regression for some specific use cases. But I think the general benefit was so strong that we never invested time in finding and tuning for those specific cases, thus avoided some more complexity to the code. WARM's benefits are probably not the same as HOT or our standards may have changed or we probably have resources to do much more elaborate tests, which were missing 10 years back. But now that we are aware of some regressions, the choice is between spending considerable amount of time trying to handle every case vs doing it incrementally and start delivering to majority of the users, yet keeping the patch at a manageable level. Even if we were to provide table level option, my preference would be keep it ON by default. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Thu, Apr 6, 2017 at 1:06 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Wed, Apr 5, 2017 at 2:32 PM, Pavan Deolasee <pavan.deola...@gmail.com> > wrote: > > > The other way is to pass old tuple values along with the new tuple > values to > > amwarminsert, build index tuples and then do a comparison. For duplicate > > index tuples, skip WARM inserts. > > This is more what I was thinking. But maybe one of the other ideas > you wrote here is better; not sure. > > Ok. I think I suggested this as one of the ideas upthread, to support hash indexes for example. This might be a good safety-net, but AFAIC what we have today should work since we pretty much construct index tuples in a consistent way before doing a comparison. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Wed, Apr 5, 2017 at 8:42 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Tue, Apr 4, 2017 at 10:21 PM, Pavan Deolasee > <pavan.deola...@gmail.com> wrote: > > On Thu, Mar 30, 2017 at 7:55 PM, Robert Haas <robertmh...@gmail.com> > wrote: > >> but > >> try to access the TOAST table would be fatal; that probably would have > >> deadlock hazards among other problems. > > > > Hmm. I think you're right. We could make a copy of the heap tuple, drop > the > > lock and then access TOAST to handle that. Would that work? > > Yeah, but it might suck. :-) Well, better than causing a deadlock ;-) Lets see if we want to go down the path of blocking WARM when tuples have toasted attributes. I submitted a patch yesterday, but having slept over it, I think I made mistakes there. It might not be enough to look at the caller supplied new tuple because that may not have any toasted values, but the final tuple that gets written to the heap may be toasted. We could look at the new tuple's attributes to find if any indexed attributes are toasted, but that might suck as well. Or we can simply block WARM if the old or the new tuple has external attributes i.e. HeapTupleHasExternal() returns true. That could be overly restrictive because irrespective of whether the indexed attributes are toasted or just some other attribute is toasted, we will block WARM on such updates. May be that's not a problem. We will also need to handle the case where some older tuple in the chain has toasted value and that tuple is presented to recheck (I think we can handle that case fairly easily, but its not done in the code yet) because of a subsequent WARM update and the tuples updated by WARM did not have any toasted values (and hence allowed). Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Thu, Mar 30, 2017 at 7:55 PM, Robert Haas <robertmh...@gmail.com> wrote: > but > try to access the TOAST table would be fatal; that probably would have > deadlock hazards among other problems. Hmm. I think you're right. We could make a copy of the heap tuple, drop the lock and then access TOAST to handle that. Would that work? Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Thu, Mar 30, 2017 at 11:17 PM, Bruce Momjian <br...@momjian.us> wrote: > On Tue, Mar 21, 2017 at 04:04:58PM -0400, Bruce Momjian wrote: > > On Tue, Mar 21, 2017 at 04:56:16PM -0300, Alvaro Herrera wrote: > > > Bruce Momjian wrote: > > > > On Tue, Mar 21, 2017 at 04:43:58PM -0300, Alvaro Herrera wrote: > > > > > Bruce Momjian wrote: > > > > > > > > > > > I don't think it makes sense to try and save bits and add > complexity > > > > > > when we have no idea if we will ever use them, > > > > > > > > > > If we find ourselves in dire need of additional bits, there is a > known > > > > > mechanism to get back 2 bits from old-style VACUUM FULL. I assume > that > > > > > the reason nobody has bothered to write the code for that is that > > > > > there's no *that* much interest. > > > > > > > > We have no way of tracking if users still have pages that used the > bits > > > > via pg_upgrade before they were removed. > > > > > > Yes, that's exactly the code that needs to be written. > > > > Yes, but once it is written it will take years before those bits can be > > used on most installations. > > Actually, the 2 bits from old-style VACUUM FULL bits could be reused if > one of the WARM bits would be set when it is checked. The WARM bits > will all be zero on pre-9.0. The check would have to be checking the > old-style VACUUM FULL bit and checking that a WARM bit is set. > > We're already doing that in the submitted patch. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Fri, Mar 31, 2017 at 12:31 PM, Dilip Kumar <dilipbal...@gmail.com> wrote: > On Thu, Mar 30, 2017 at 5:27 PM, Amit Kapila <amit.kapil...@gmail.com> > wrote: > > I am not sure if we can consider it as completely synthetic because we > > might see some similar cases for json datatypes. Can we once try to > > see the impact when the same test runs from multiple clients? For > > your information, I am also trying to setup some tests along with one > > of my colleague and we will report the results once the tests are > > complete. > We have done some testing and below is the test details and results. > > Test: > I have derived this test from above test given by pavan[1] except > below difference. > > - I have reduced the fill factor to 40 to ensure that multiple there > is scope in the page to store multiple WARM chains. > - WARM updated all the tuples. > - Executed a large select to enforce lot of recheck tuple within single > query. > - Smaller tuple size (aid field is around ~100 bytes) just to ensure > tuple have sufficient space on a page to get WARM updated. > > Results: > --- > * I can see more than 15% of regression in this case. This regression > is repeatable. > * If I increase the fill factor to 90 than regression reduced to 7%, > may be only fewer tuples are getting WARM updated and others are not > because of no space left on page after few WARM update. > Thanks for doing the tests. The tests show us that if the table gets filled up with WARM chains, and they are not cleaned up and the table is subjected to read-only workload, we will see regression. Obviously, the test is completely CPU bound, something WARM is not meant to address.I am not yet certain if recheck is causing the problem. Yesterday I ran the test where I was seeing regression with recheck completely turned off and still saw regression. So there is something else that's going on with this kind of workload. Will check. Having said that, I think there are some other ways to fix some of the common problems with repeated rechecks. One thing that we can do it rely on the index pointer flags to decide whether recheck is necessary or not. For example, a WARM pointer to a WARM tuple does not require recheck. Similarly, a CLEAR pointer to a CLEAR tuple does not require recheck. A WARM pointer to a CLEAR tuple can be discarded immediately because the only situation where it can occur is in the case of aborted WARM updates. The only troublesome situation is a CLEAR pointer to a WARM tuple. That entirely depends on whether the index had received a WARM insert or not. What we can do though, if recheck succeeds for the first time and if the chain has only WARM tuples, we set the WARM bit on the index pointer. We can use the same hint mechanism as used for marking index pointers dead to minimise overhead. Obviously this will only handle the case when the same tuple is rechecked often. But if a tuple is rechecked only once then may be other overheads will kick-in, thus reducing the regression significantly. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Fri, Mar 31, 2017 at 11:16 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Fri, Mar 31, 2017 at 10:24 AM, Pavan Deolasee > <pavan.deola...@gmail.com> wrote: > > Having worked on it for some time now, I can say that WARM uses pretty > much > > the same infrastructure that HOT uses for cleanup/pruning tuples from the > > heap. So the risk of having a bug which can eat your data from the heap > is > > very low. Sure, it might mess up with indexes, return duplicate keys, not > > return a row when it should have. Not saying they are not bad bugs, but > > probably much less severe than someone removing live rows from the heap. > > Yes, that's true. If there's nothing wrong with the way pruning > works, then any other problem can be fixed by reindexing, I suppose. > > Yeah, I think so. > I'm not generally a huge fan of on-off switches for things like this, > but I know Simon likes them. I think the question is how much they > really insulate us from bugs. For the hash index patch, for example, > the only way to really get insulation from bugs added in this release > would be to ship both the old and the new code in separate index AMs > (hash, hash2). The code has been restructured so much in the process > of doing all of this that any other form of on-off switch would be > pretty hit-or-miss whether it actually provided any protection. > > Now, I am less sure about this case, but my guess is that you can't > really have this be something that can be flipped on and off for a > table. Once a table has any WARM updates in it, the code that knows > how to cope with that has to be enabled, and it'll work as well or > poorly as it does. That's correct. Once enabled, we will need to handle the case of two index pointers pointing to the same root. The only way to get rid of that is probably do a complete rewrite/reindex, I suppose. But I was mostly talking about immutable flag at table creation time as rightly guessed. > Now, I understand you to be suggesting a flag at > table-creation time that would, maybe, be immutable after that, but > even then - are we going to run completely unmodified 9.6 code for > tables where that's not enabled, and only go through any of the WARM > logic when it is enabled? Doesn't sound likely. The commits already > made from this patch series certainly affect everybody, and I can't > see us adding switches that bypass > ce96ce60ca2293f75f36c3661e4657a3c79ffd61 for example. I don't think I am going to claim that either. But probably only 5% of the new code would then be involved. Which is a lot less and a lot more manageable. Having said that, I think if we at all do this, we should only do it based on our experiences in the beta cycle, as a last resort. Based on my own experiences during HOT development, long running pgbench tests, with several concurrent clients, subjected to multiple AV cycles and periodic consistency checks, usually brings up issues related to heap corruption. So my confidence level is relatively high on that part of the code. That's not to suggest that there can't be any bugs. Obviously then there are other things such as regression to some workload or additional work required by vacuum etc. And I think we should address them and I'm fairly certain we can do that. It may not happen immediately, but if we provide right knobs, may be those who are affected can fall back to the old behaviour or not use the new code at all while we improve things for them. Some of these things I could have already implemented, but without a clear understanding of whether the feature will get in or not, it's hard to keep putting infinite efforts into the patch. All non-committers go through that dilemma all the time, I'm sure. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Fri, Mar 31, 2017 at 6:47 PM, Robert Haas <robertmh...@gmail.com> wrote: > > > 2. WARM is a non-optional feature which touches the on-disk format. > There is nothing more dangerous than that. If hash indexes have bugs, > people can avoid those bugs by not using them; there are good reasons > to suppose that hash indexes have very few existing users. The > expression evaluation changes, IMHO, are much more dangerous because > everyone will be exposed to them, but they will not likely corrupt > your data because they don't touch the on-disk format. WARM is even a > little more dangerous than that; everyone is exposed to those bugs, > and in the worst case they could eat your data. > > Having worked on it for some time now, I can say that WARM uses pretty much the same infrastructure that HOT uses for cleanup/pruning tuples from the heap. So the risk of having a bug which can eat your data from the heap is very low. Sure, it might mess up with indexes, return duplicate keys, not return a row when it should have. Not saying they are not bad bugs, but probably much less severe than someone removing live rows from the heap. And we can make it a table level property, keep it off by default, turn it off on system tables in this release and change the defaults only when we get more confidence assuming people use it by explicitly turning it on. Now may be that's not the right approach and keeping it off by default will mean it receives much less testing than we would like. So we can keep it on in the beta cycle and then take a call. I went a good length to make it work on system tables because during HOT development, Tom told me that it better work for everything or it doesn't work at all. But with WARM it works for system tables and I know no known bugs, but if we don't want to risk system tables, we might want to turn it off (just prior to release may be). Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Thu, Mar 30, 2017 at 7:27 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Thu, Mar 30, 2017 at 6:37 AM, Pavan Deolasee > <pavan.deola...@gmail.com> wrote: > > I think we can fix that by comparing compressed values. I know you had > > raised concerns, but Robert confirmed that (IIUC) it's not a problem > today. > > I'm not sure that's an entirely fair interpretation of what I said. > My point was that, while it may not be broken today, it might not be a > good idea to rely for correctness on it always being true. > > I take that point. We have a choice of fixing it today or whenever to support multiple compression techniques. We don't even know how that will look like and whether we will be able to look at compressed data and tell whether two values are compressed by exact same way. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Thu, Mar 30, 2017 at 5:27 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > > > How have you verified that? Have you checked that in > heap_prepare_insert it has called toast_insert_or_update() and then > returned a tuple different from what the input tup is? Basically, I > am easily able to see it and even the reason why the heap and index > tuples will be different. Let me try to explain, > toast_insert_or_update returns a new tuple which contains compressed > data and this tuple is inserted in heap where as slot still refers to > original tuple (uncompressed one) which is passed to heap_insert. > Now, ExecInsertIndexTuples and the calls under it like FormIndexDatum > will refer to the tuple in slot which is uncompressed and form the > values[] using uncompressed value. Ah, yes. You're right. Not sure why I saw things differently. That doesn't anything though because during recheck we'll get compressed value and not do anything with it. In the index we already have compressed value and we can compare them. Even if we decide to decompress everything and do the comparison, that should be possible. So I don't see a problem as far as correctness goes. > > So IIUC, in above test during initialization you have one WARM update > and then during actual test all are HOT updates, won't in such a case > the WARM chain will be converted to HOT by vacuum and then all updates > from thereon will be HOT and probably no rechecks? > There is no AV.. Just 1 tuple being HOT updated out of 100 tuples. Confirmed by looking at pg_stat_user_tables. Also made sure that the tuple doesn't get non-HOT updated in between, thus breaking the WARM chain. > > > > > > I then also repeated the tests, but this time using compressible values. > The > > regression in this case is much higher, may be 15% or more. > > > > Sounds on higher side. > > Yes, definitely. If we can't reduce that, we might want to provide table level option to explicitly turn WARM off on such tables. > IIUC, by the time you are comparing tuple attrs to check for modified > columns, you don't have the compressed values for new tuple. > > I think it depends. If the value is not being modified, then we will get both values as compressed. At least I confirmed with your example and running an update which only changes c1. Don't know if that holds for all cases. > > I know you had > > raised concerns, but Robert confirmed that (IIUC) it's not a problem > today. > > > > Yeah, but I am not sure if we can take Robert's statement as some sort > of endorsement for what the patch does. > > Sure. > > We will figure out how to deal with it if we ever add support for > different > > compression algorithms or compression levels. And I also think this is > kinda > > synthetic use case and the fact that there is not much regression with > > indexes as large as 2K bytes seems quite comforting to me. > > > > I am not sure if we can consider it as completely synthetic because we > might see some similar cases for json datatypes. Can we once try to > see the impact when the same test runs from multiple clients? Ok. Might become hard to control HOT behaviour though. Or will need to do mix of WARM/HOT updates. Will see if this is something easily doable by setting high FF etc. > For > your information, I am also trying to setup some tests along with one > of my colleague and we will report the results once the tests are > complete. > > That'll be extremely helpful, especially if its a something close to real-world scenario. Thanks for doing that. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Wed, Mar 29, 2017 at 4:42 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Wed, Mar 29, 2017 at 1:10 PM, Pavan Deolasee > <pavan.deola...@gmail.com> wrote: > > > > On Wed, Mar 29, 2017 at 12:02 PM, Amit Kapila <amit.kapil...@gmail.com> > > wrote: > >> > >> On Wed, Mar 29, 2017 at 11:52 AM, Amit Kapila <amit.kapil...@gmail.com> > >> wrote: > > > > Then during recheck, we pass already compressed values to > > index_form_tuple(). But my point is, the following code will ensure that > we > > don't compress it again. My reading is that the first check for > > !VARATT_IS_EXTENDED will return false if the value is already compressed. > > > > You are right. I was confused with previous check of VARATT_IS_EXTERNAL. > > Ok, thanks. > > > > TBH I couldn't find why the original index insertion code will always > supply > > uncompressed values. > > > > Just try by inserting large value of text column ('aa.bbb') > upto 2.5K. Then have a breakpoint in heap_prepare_insert and > index_form_tuple, and debug both the functions, you can find out that > even though we compress during insertion in heap, the index will > compress the original value again. > > Ok, tried that. AFAICS index_form_tuple gets compressed values. > > Yeah probably you are right, but I am not sure if it is good idea to > compare compressed values. > > Again, I don't see a problem there. > I think with this new changes in btrecheck, it would appear to be much > costlier as compare to what you have few versions back. I am afraid > that it can impact performance for cases where there are few WARM > updates in chain and many HOT updates as it will run recheck for all > such updates. My feeling is that the recheck could be costly for very fat indexes, but not doing WARM could be costly too for such indexes. We can possibly construct a worst case where 1. set up a table with a fat index. 2. do a WARM update to a tuple 3. then do several HOT updates to the same tuple 4. query the row via the fat index. Initialisation: -- Adjust parameters to force index scans -- enable_seqscan to false -- seq_page_cost = 1 DROP TABLE IF EXISTS pgbench_accounts; CREATE TABLE pgbench_accounts ( aid text, bid bigint, abalance bigint, filler1 text DEFAULT md5(random()::text), filler2 text DEFAULT md5(random()::text), filler3 text DEFAULT md5(random()::text), filler4 text DEFAULT md5(random()::text), filler5 text DEFAULT md5(random()::text), filler6 text DEFAULT md5(random()::text), filler7 text DEFAULT md5(random()::text), filler8 text DEFAULT md5(random()::text), filler9 text DEFAULT md5(random()::text), filler10 text DEFAULT md5(random()::text), filler11 text DEFAULT md5(random()::text), filler12 text DEFAULT md5(random()::text) ) WITH (fillfactor=90); \set end 0 \set start (:end + 1) \set end (:start + (:scale * 100)) INSERT INTO pgbench_accounts SELECT generate_series(:start, :end )::text || <2300 chars string>, (random()::bigint) % :scale, 0; CREATE UNIQUE INDEX pgb_a_aid ON pgbench_accounts(aid); CREATE INDEX pgb_a_filler1 ON pgbench_accounts(filler1); CREATE INDEX pgb_a_filler2 ON pgbench_accounts(filler2); CREATE INDEX pgb_a_filler3 ON pgbench_accounts(filler3); CREATE INDEX pgb_a_filler4 ON pgbench_accounts(filler4); -- Force a WARM update on one row UPDATE pgbench_accounts SET filler1 = 'X' WHERE aid = '100' || repeat('abcdefghij', 2); Test: -- Fetch the row using the fat index. Since the row contains a BEGIN; SELECT substring(aid, 1, 10) FROM pgbench_accounts WHERE aid = '100' || <2300 chars string> ORDER BY aid; UPDATE pgbench_accounts SET abalance = abalance + 100 WHERE aid = '100' || <2300 chars string>; END; I did 4 5-minutes runs with master and WARM and there is probably a 2-3% regression. (Results with 5 mins tests, txns is total for 5 mins, idx_scan is number of scans on the fat index) master: txns idx_scan 414117 828233 411109 822217 411848 823695 408424 816847 WARM: txns idx_scan 404139 808277 398880 797759 399949 799897 397927 795853 == I then also repeated the tests, but this time using compressible values. The regression in this case is much higher, may be 15% or more. INSERT INTO pgbench_accounts SELECT generate_series(:start, :end )::text || repeat('abcdefghij', 2), (random()::bigint) % :scale, 0; -- Fetch the row using the fat index. Since the row contains a BEGIN; SELECT substring(aid, 1, 10) FROM pgbench_accounts WHERE aid = '100' || repeat('abcdefghij', 2) ORDER BY aid; UPDATE pgbench_accounts SET abalance = abalance + 100 WHERE aid = '100' || repeat('abcdefghij', 2); END; (Results with 5 mins tests, txns is total for 5 mins, idx_scan is number of scans on the fat index) master: txns idx_scan 56976 113953 56822 113645 56915 113831
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Wed, Mar 29, 2017 at 12:02 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Wed, Mar 29, 2017 at 11:52 AM, Amit Kapila <amit.kapil...@gmail.com> > wrote: > > On Tue, Mar 28, 2017 at 10:35 PM, Pavan Deolasee > > <pavan.deola...@gmail.com> wrote: > >> > >> On Tue, Mar 28, 2017 at 7:04 PM, Amit Kapila <amit.kapil...@gmail.com> > >> wrote: > >>> > >>> For such an heap insert, we will pass > >>> the actual value of column to index_form_tuple during index insert. > >>> However during recheck when we fetch the value of c2 from heap tuple > >>> and pass it index tuple, the value is already in compressed form and > >>> index_form_tuple might again try to compress it because the size will > >>> still be greater than TOAST_INDEX_TARGET and if it does so, it might > >>> make recheck fail. > >>> > >> > >> Would it? I thought "if > >> (!VARATT_IS_EXTENDED(DatumGetPointer(untoasted_values[i]))" check > should > >> prevent that. But I could be reading those macros wrong. They are > probably > >> heavily uncommented and it's not clear what each of those VARATT_* > macro do. > >> > > > > That won't handle the case where it is simply compressed. You need > > check like VARATT_IS_COMPRESSED to take care of compressed heap > > tuples, but then also it won't work because heap_tuple_fetch_attr() > > doesn't handle compressed tuples. You need to use > > heap_tuple_untoast_attr() to handle the compressed case. Also, we > > probably need to handle other type of var attrs. Now, If we want to > > do all of that index_form_tuple() might not be the right place, we > > probably want to handle it in caller or provide an alternate API. > > > > Another related, index_form_tuple() has a check for VARATT_IS_EXTERNAL > not VARATT_IS_EXTENDED, so may be that is cause of confusion for you, > but as I mentioned even if you change the check heap_tuple_fetch_attr > won't suffice the need. > > I am confused :-( Assuming big-endian machine: VARATT_IS_4B_U - !toasted && !compressed VARATT_IS_4B_C - compressed (may or may not be toasted) VARATT_IS_4B - !toasted (may or may not be compressed) VARATT_IS_1B_E - toasted #define VARATT_IS_EXTERNAL(PTR) VARATT_IS_1B_E(PTR) #define VARATT_IS_EXTENDED(PTR) (!VARATT_IS_4B_U(PTR)) So VARATT_IS_EXTENDED means that the value is (toasted || compressed). If we are looking at a value from the heap (untoasted) then it implies in-heap compression. If we are looking at untoasted value, then it means compression in the toast. index_form_tuple() first checks if the value is externally toasted and fetches the untoasted value if so. After that it checks if !VARATT_IS_EXTENDED i.e. if the value is (!toasted && !compressed) and then only try to apply compression on that. It can't be a toasted value because if it was, we just untoasted it. But it can be compressed either in the heap or in the toast, in which case we don't try to compress it again. That makes sense because if the value is already compressed there is not point applying compression again. Now what you're suggesting (it seems) is that when in-heap compression is used and ExecInsertIndexTuples calls FormIndexDatum to create index tuple values, it always passes uncompressed heap values. So when the index tuple is originally inserted, it index_form_tuple() will try to compress it and see if it fits in the index. Then during recheck, we pass already compressed values to index_form_tuple(). But my point is, the following code will ensure that we don't compress it again. My reading is that the first check for !VARATT_IS_EXTENDED will return false if the value is already compressed. /* * If value is above size target, and is of a compressible datatype, * try to compress it in-line. */ if (!VARATT_IS_EXTENDED(DatumGetPointer(untoasted_values[i])) && VARSIZE(DatumGetPointer(untoasted_values[i])) > TOAST_INDEX_TARGET && (att->attstorage == 'x' || att->attstorage == 'm')) { Datum cvalue = toast_compress_datum(untoasted_values[i]); if (DatumGetPointer(cvalue) != NULL) { /* successful compression */ if (untoasted_free[i]) pfree(DatumGetPointer(untoasted_values[i])); untoasted_values[i] = cvalue; untoasted_free[i] = true; } } TBH I couldn't find why the original index insertion code will always supply uncompressed values. But even if does, and even if the recheck gets it in compressed form, I don't see how we will double-compress that. As far as,
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Tue, Mar 28, 2017 at 8:34 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Mon, Mar 27, 2017 at 10:25 PM, Pavan Deolasee > <pavan.deola...@gmail.com> wrote: > > On Tue, Mar 28, 2017 at 1:59 AM, Robert Haas <robertmh...@gmail.com> > wrote: > >> On Thu, Mar 23, 2017 at 2:47 PM, Pavan Deolasee > >> <pavan.deola...@gmail.com> wrote: > >> > It's quite hard to say that until we see many more benchmarks. As > author > >> > of > >> > the patch, I might have got repetitive with my benchmarks. But I've > seen > >> > over 50% improvement in TPS even without chain conversion (6 indexes > on > >> > a 12 > >> > column table test). > >> > >> This seems quite mystifying. What can account for such a large > >> performance difference in such a pessimal scenario? It seems to me > >> that without chain conversion, WARM can only apply to each row once > >> and therefore no sustained performance improvement should be possible > >> -- unless rows are regularly being moved to new blocks, in which case > >> those updates would "reset" the ability to again perform an update. > >> However, one would hope that most updates get done within a single > >> block, so that the row-moves-to-new-block case wouldn't happen very > >> often. > > > > I think you're confusing between update chains that stay within a block > vs > > HOT/WARM chains. Even when the entire update chain stays within a block, > it > > can be made up of multiple HOT/WARM chains and each of these chains offer > > ability to do one WARM update. So even without chain conversion, every > > alternate update will be a WARM update. So the gains are perpetual. > > You're right, I had overlooked that. But then I'm confused: how does > the chain conversion stuff help as much as it does? You said that you > got a 50% improvement from WARM, because we got to skip half the index > updates. But then you said with chain conversion you got an > improvement of more like 100%. However, I would think that on this > workload, chain conversion shouldn't save much. If you're sweeping > through the database constantly performing updates, the updates ought > to be a lot more frequent than the vacuums. > > No? These tests were done on a very large table of 80M rows. The table itself was wide with 15 columns and a few indexes. So in a 8hr test, master could do only 55M updates where as WARM did 105M updates. There were 4 autovacuum cycles in both these runs. So while there were many updates, I am sure autovacuum must have helped to increase the percentage of WARM updates (from ~50% after steady state to ~67% after steady state). Also I said more than 50%, but it was probably close to 65%. Unfortunately these tests were done on different hardware, with different settings and even slightly different scale factors. So they may not be exactly comparable. But there is no doubt chain conversion will help to some extent. I'll repeat the benchmark with chain conversion turned off and report the exact difference. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Tue, Mar 28, 2017 at 7:04 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > > > For such an heap insert, we will pass > the actual value of column to index_form_tuple during index insert. > However during recheck when we fetch the value of c2 from heap tuple > and pass it index tuple, the value is already in compressed form and > index_form_tuple might again try to compress it because the size will > still be greater than TOAST_INDEX_TARGET and if it does so, it might > make recheck fail. > > Would it? I thought "if (!VARATT_IS_EXTENDED(DatumGetPointer(untoasted_values[i]))" check should prevent that. But I could be reading those macros wrong. They are probably heavily uncommented and it's not clear what each of those VARATT_* macro do. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Tue, Mar 28, 2017 at 4:07 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > > Noted few cosmetic issues in 0005_warm_updates_v21: > > 1. > pruneheap.c(939): warning C4098: 'heap_get_root_tuples' : 'void' > function returning a value > Thanks. Will fix. > > 2. > + * HCWC_WARM_UPDATED_TUPLE - a tuple with HEAP_WARM_UPDATED is found > somewhere > + *in the chain. Note that when a tuple is WARM > + *updated, both old and new versions are marked > + *with this flag/ > + * > + * HCWC_WARM_TUPLE - a tuple with HEAP_WARM_TUPLE is found somewhere in > + * the chain. > + * > + * HCWC_CLEAR_TUPLE - a tuple without HEAP_WARM_TUPLE is found somewhere > in > + * the chain. > > Description of all three flags is same. > > Well the description is different (and correct), but given that it confused you, I think I should rewrite those comments. Will do. > 3. > + * HCWC_WARM_UPDATED_TUPLE - a tuple with HEAP_WARM_UPDATED is found > somewhere > + *in the chain. Note that when a tuple is WARM > + *updated, both old and new versions are marked > + * with this flag/ > > Spurious '/' at end of line. > > Thanks. Will fix. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Tue, Mar 28, 2017 at 4:05 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > > > As asked previously, can you explain me on what basis are you > considering it robust? The comments on top of datumIsEqual() clearly > indicates the danger of using it for toasted values (Also, it will > probably not give the answer you want if either datum has been > "toasted".). Hmm. I don' see why the new code in recheck is unsafe. The index values themselves can't be toasted (IIUC), but they can be compressed. index_form_tuple() already untoasts any toasted heap attributes and compresses them if needed. So once we pass heap values via index_form_tuple() we should have exactly the same index values as they were inserted. Or am I missing something obvious here? If you think that because we are using it during > heap_update to find modified columns, then I think that is not right > comparison, because there we are comparing compressed value (of old > tuple) with uncompressed value (of new tuple) which should always give > result as false. > > Hmm, this seems like a problem. While HOT could tolerate occasional false results (i.e. reporting a heap column as modified even though it it not), WARM assumes that if the heap has reported different values, then they better be different and should better result in different index values. Because that's how recheck later works. Since index expressions are not supported, I wonder if toasted heap values are the only remaining problem in this area. So heap_tuple_attr_equals() should first detoast the heap values and then do the comparison. I already have a test case that fails for this reason, so let me try this approach. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Tue, Mar 28, 2017 at 1:32 AM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > Is the WARM tap test suite supposed to work when applied without all the > other patches? I just tried applied that one and running "make check -C > src/test/modules", and it seems to hang after giving "ok 5" for > t/002_warm_stress.pl. (I had to add a Makefile too, attached.) > > Yeah, sorry. Looks like I forgot to git add the Makefile. BTW just tested on Ubuntu, and it works fine on that too. FWIW I'm using perl v5.22.1 and IPC::Run 0.94 (assuming I got the versions correctly). $ make -C src/test/modules/warm/ prove-check make: Entering directory '/home/ubuntu/postgresql/src/test/modules/warm' rm -rf /home/ubuntu/postgresql/src/test/modules/warm/tmp_check/log cd . && TESTDIR='/home/ubuntu/postgresql/src/test/modules/warm' PATH="/home/ubuntu/postgresql/tmp_install/home/ubuntu/pg-master-install/bin:$PATH" LD_LIBRARY_PATH="/home/ubuntu/postgresql/tmp_install/home/ubuntu/pg-master-install/lib" PGPORT='65432' PG_REGRESS='/home/ubuntu/postgresql/src/test/modules/warm/../../../../src/test/regress/pg_regress' prove -I ../../../../src/test/perl/ -I . --verbose t/*.pl t/001_recovery.pl . 1..2 ok 1 - balanace matches after recovery ok 2 - sum(delta) matches after recovery ok t/002_warm_stress.pl .. 1..10 ok 1 - dummy test passed ok 2 - Fine match ok 3 - psql exited normally ok 4 - psql exited normally ok 5 - psql exited normally ok 6 - psql exited normally ok 7 - psql exited normally ok 8 - psql exited normally ok 9 - psql exited normally ok 10 - Fine match ok All tests successful. Files=2, Tests=12, 22 wallclock secs ( 0.03 usr 0.00 sys + 7.94 cusr 2.41 csys = 10.38 CPU) Result: PASS make: Leaving directory '/home/ubuntu/postgresql/src/test/modules/warm' Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Mon, Mar 27, 2017 at 4:45 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Sat, Mar 25, 2017 at 1:24 PM, Amit Kapila <amit.kapil...@gmail.com> > wrote: > > On Fri, Mar 24, 2017 at 11:49 PM, Pavan Deolasee > > <pavan.deola...@gmail.com> wrote: > >> > >> On Fri, Mar 24, 2017 at 6:46 PM, Amit Kapila <amit.kapil...@gmail.com> > >> wrote: > >>> > > > >> While looking at this problem, it occurred to me that the assumptions > made > >> for hash indexes are also wrong :-( Hash index has the same problem as > >> expression indexes have. A change in heap value may not necessarily > cause a > >> change in the hash key. If we don't detect that, we will end up having > two > >> hash identical hash keys with the same TID pointer. This will cause the > >> duplicate key scans problem since hashrecheck will return true for both > the > >> hash entries. > > Isn't it possible to detect duplicate keys in hashrecheck if we > compare both hashkey and tid stored in index tuple with the > corresponding values from heap tuple? > > Hmm.. I thought that won't work. For example, say we have a tuple (X, Y, Z) in the heap with a btree index on X and a hash index on Y. If that is updated to (X, Y', Z) and say we do a WARM update and insert a new entry in the hash index. Now if Y and Y' both generate the same hashkey, we will have exactly similar looking <hashkey, TID> tuples in the hash index leading to duplicate key scans. I think one way to solve this is to pass both old and new heap values to amwarminsert and expect each AM to detect duplicates and avoid creating of a WARM pointer if index keys are exactly the same (we can do that since there already exists another index tuple with the same keys pointing to the same root TID). Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Tue, Mar 28, 2017 at 7:49 AM, Bruce Momjian <br...@momjian.us> wrote: > On Mon, Mar 27, 2017 at 04:29:56PM -0400, Robert Haas wrote: > > On Thu, Mar 23, 2017 at 2:47 PM, Pavan Deolasee > > <pavan.deola...@gmail.com> wrote: > > > It's quite hard to say that until we see many more benchmarks. As > author of > > > the patch, I might have got repetitive with my benchmarks. But I've > seen > > > over 50% improvement in TPS even without chain conversion (6 indexes > on a 12 > > > column table test). > > > > This seems quite mystifying. What can account for such a large > > performance difference in such a pessimal scenario? It seems to me > > that without chain conversion, WARM can only apply to each row once > > and therefore no sustained performance improvement should be possible > > -- unless rows are regularly being moved to new blocks, in which case > > those updates would "reset" the ability to again perform an update. > > However, one would hope that most updates get done within a single > > block, so that the row-moves-to-new-block case wouldn't happen very > > often. > > > > I'm perplexed. > > Yes, I asked the same question in this email: > > https://www.postgresql.org/message-id/2017032119. > ge16...@momjian.us > > And I've answered it so many times by now :-) Just to add more to what I just said in another email, note that HOT/WARM chains are created when a new root line pointer is created in the heap (a line pointer that has an index pointing to it). And a new root line pointer is created when a non-HOT/non-WARM update is performed. As soon as you do a non-HOT/non-WARM update, the next update can again be a WARM update even when everything fits in a single block. That's why for a workload which doesn't do HOT updates and where not all index keys are updated, you'll find every alternate update to a row to be a WARM update, even when there is no chain conversion. That itself can save lots of index bloat, reduce IO on the index and WAL. Let me know if its still not clear and I can draw some diagrams to explain it. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Tue, Mar 28, 2017 at 1:59 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Thu, Mar 23, 2017 at 2:47 PM, Pavan Deolasee > <pavan.deola...@gmail.com> wrote: > > It's quite hard to say that until we see many more benchmarks. As author > of > > the patch, I might have got repetitive with my benchmarks. But I've seen > > over 50% improvement in TPS even without chain conversion (6 indexes on > a 12 > > column table test). > > This seems quite mystifying. What can account for such a large > performance difference in such a pessimal scenario? It seems to me > that without chain conversion, WARM can only apply to each row once > and therefore no sustained performance improvement should be possible > -- unless rows are regularly being moved to new blocks, in which case > those updates would "reset" the ability to again perform an update. > However, one would hope that most updates get done within a single > block, so that the row-moves-to-new-block case wouldn't happen very > often. > > I think you're confusing between update chains that stay within a block vs HOT/WARM chains. Even when the entire update chain stays within a block, it can be made up of multiple HOT/WARM chains and each of these chains offer ability to do one WARM update. So even without chain conversion, every alternate update will be a WARM update. So the gains are perpetual. For example, if I take a simplistic example of a table with just one tuple and four indexes and where every update updates just one of the indexes. Assuming no WARM chain conversion this is what would happen for every update: 1. WARM update, new entry in just one index 2. Regular update, new entries in all indexes 3. WARM update, new entry in just one index 4. Regular update, new entries in all indexes At the end of N updates (assuming all fit in the same block), one index will have N entries and rest will have N/2 entries. Compare that against master: 1. Regular update, new entries in all indexes 2. Regular update, new entries in all indexes 3. Regular update, new entries in all indexes 4. Regular update, new entries in all indexes At the end of N updates (assuming all fit in the same block), all indexes will have N entries. So with WARM we reduce bloats in 3 indexes. And WARM works almost in a perpetual way even without chain conversion. If you see the graph I earlier shared (attach again), without WARM chain conversion the rate of WARM updates settle down to 50%, which is not surprising given what I explained above. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Tue, Mar 28, 2017 at 1:32 AM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > Is the WARM tap test suite supposed to work when applied without all the > other patches? I just tried applied that one and running "make check -C > src/test/modules", and it seems to hang after giving "ok 5" for > t/002_warm_stress.pl. (I had to add a Makefile too, attached.) > > These tests should run without WARM. I wonder though if IPC::Run's start/pump/finish facility is fully portable. Andrew on off-list conversation reminded me that there are no (or may be one) tests currently using that in Postgres. I've run these tests on OSX, will try on some linux platform too. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
[HACKERS] Stale comments in vacuumlazy.c
I happened to notice a stale comment at the very beginning of vacuumlazy.c. ISTM we forgot to fix that when we introduced FSM. With FSM, vacuum no longer needed to track per-page free space info. I propose attached fix. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services vacuumlazy_comment_fixes.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Sat, 25 Mar 2017 at 11:03 PM, Peter Geoghegan <p...@bowt.ie> wrote: > On Sat, Mar 25, 2017 at 12:54 AM, Amit Kapila <amit.kapil...@gmail.com> > wrote: > > I am not sure how do you want to binary compare two datums, if you are > > thinking datumIsEqual(), that won't work. I think you need to use > > datatype specific compare function something like what we do in > > _bt_compare(). > > How will that interact with types like numeric, that have display > scale or similar? > > I wonder why Amit thinks that datumIsEqual won't work once we convert the heap values to index tuple and then fetch using index_get_attr. After all that's how the current index tuple was constructed when it was inserted. In fact, we must not rely on _bt_compare because that might return "false positive" even for two different heap binary values (I think). To decide whether to do WARM update or not in heap_update we only rely on binary comparison. Could it happen that for two different binary heap values, we still compute the same index attribute? Even when expression indexes are not supported? Thanks, Pavan > -- > Peter Geoghegan > -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Fri, Mar 24, 2017 at 6:46 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > > > I was worried for the case if the index is created non-default > collation, will the datumIsEqual() suffice the need. Now again > thinking about it, I think it will because in the index tuple we are > storing the value as in heap tuple. However today it occurred to me > how will this work for toasted index values (index value > > TOAST_INDEX_TARGET). It is mentioned on top of datumIsEqual() that it > probably won't work for toasted values. Have you considered that > point? > > No, I haven't and thanks for bringing that up. And now that I think more about it and see the code, I think the naive way of just comparing index attribute value against heap values is probably wrong. The example of TOAST_INDEX_TARGET is one such case, but I wonder if there are other varlena attributes that we might store differently in heap and index. Like index_form_tuple() ->heap_fill_tuple seem to some churning for varlena. It's not clear to me if index_get_attr will return the values which are binary comparable to heap values.. I wonder if calling index_form_tuple on the heap values, fetching attributes via index_get_attr on both index tuples and then doing a binary compare is a more robust idea. Or may be that's just duplicating efforts. While looking at this problem, it occurred to me that the assumptions made for hash indexes are also wrong :-( Hash index has the same problem as expression indexes have. A change in heap value may not necessarily cause a change in the hash key. If we don't detect that, we will end up having two hash identical hash keys with the same TID pointer. This will cause the duplicate key scans problem since hashrecheck will return true for both the hash entries. That's a bummer as far as supporting WARM for hash indexes is concerned, unless we find a way to avoid duplicate index entries. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Fri, Mar 24, 2017 at 4:04 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Fri, Mar 24, 2017 at 12:25 AM, Pavan Deolasee > <pavan.deola...@gmail.com> wrote: > > > > > > On Thu, Mar 23, 2017 at 7:53 PM, Amit Kapila <amit.kapil...@gmail.com> > > > > The general sense I've got > > here is that we're ok to push some work in background if it helps the > > real-time queries, and I kinda agree with that. > > > > I don't think we can define this work as "some" work, it can be a lot > of work depending on the number of indexes. Also, I think for some > cases it will generate maintenance work without generating benefit. > For example, when there is one index on a table and there are updates > for that index column. > > That's a fair point. I think we can address this though. At the end of first index scan we would know how many warm pointers the index has and whether it's worth doing a second scan. For the case you mentioned, we will do a second scan just on that one index and skip on all other indexes and still achieve the same result. On the other hand, if one index receives many updates and other indexes are rarely updated then we might leave behind a few WARM chains behind and won't be able to do IOS on those pages. But given the premise that other indexes are receiving rare updates, it may not be a problem. Note: the code is not currently written that way, but it should be a fairly small change. The other thing that we didn't talk about is that vacuum will need to track dead tuples and warm candidate chains separately which increases memory overhead. So for very large tables, and for the same amount of maintenance_work_mem, one round of vacuum will be able to clean lesser pages. We can work out more compact representation, but something not done currently. > > > But this is clearly not > > PG 10 material. > > > > I don't see much discussion about this aspect of the patch, so not > sure if it is acceptable to increase the cost of vacuum. Now, I don't > know if your idea of GUC's make it such that the additional cost will > occur seldom and this additional pass has a minimal impact which will > make it acceptable. Yeah, I agree. I'm trying to schedule some more benchmarks, but any help is appreciated. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Thu, Mar 23, 2017 at 7:53 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Thu, Mar 23, 2017 at 3:44 PM, Pavan Deolasee > > > > > Yes, this is a very fair point. The way I proposed to address this > upthread > > is by introducing a set of threshold/scale GUCs specific to WARM. So > users > > can control when to invoke WARM cleanup. Only if the WARM cleanup is > > required, we do 2 index scans. Otherwise vacuum will work the way it > works > > today without any additional overhead. > > > > I am not sure on what basis user can set such parameters, it will be > quite difficult to tune those parameters. I think the point is > whatever threshold we keep, once that is crossed, it will perform two > scans for all the indexes. Well, that applies to even vacuum parameters, no? The general sense I've got here is that we're ok to push some work in background if it helps the real-time queries, and I kinda agree with that. If WARM improves things in a significant manner even with these additional maintenance work, it's still worth doing. Having said that, I see many ways we can improve on this later. Like we can track somewhere else information about tuples which may have received WARM updates (I think it will need to be a per-index bitmap or so) and use that to do WARM chain conversion in a single index pass. But this is clearly not PG 10 material. > IIUC, this conversion of WARM chains is > required so that future updates can be WARM or is there any other > reason? I see this as a big penalty for future updates. > It's also necessary for index-only-scans. But I don't see this as a big penalty for future updates, because if there are indeed significant WARM updates then not preparing for future updates will result in write-amplification, something we are trying to solve here and something which seems to be showing good gains. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Thu, Mar 23, 2017 at 11:44 PM, Mithun Cy <mithun...@enterprisedb.com> wrote: > Hi Pavan, > On Thu, Mar 23, 2017 at 12:19 AM, Pavan Deolasee > <pavan.deola...@gmail.com> wrote: > > Ok, no problem. I did some tests on AWS i2.xlarge instance (4 vCPU, 30GB > > RAM, attached SSD) and results are shown below. But I think it is > important > > to get independent validation from your side too, just to ensure I am not > > making any mistake in measurement. I've attached naively put together > > scripts which I used to run the benchmark. If you find them useful, > please > > adjust the paths and run on your machine. > > I did a similar test appears. Your v19 looks fine to me, it does not > cause any regression, On the other hand, I also ran tests reducing > table fillfactor to 80 there I can see a small regression 2-3% in > average when updating col2 and on updating col9 again I do not see any > regression. > > Thanks Mithun for repeating the tests and confirming that the v19 patch looks ok. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Thu, Mar 23, 2017 at 4:08 PM, Pavan Deolasee <pavan.deola...@gmail.com> wrote: > > > On Thu, Mar 23, 2017 at 3:02 PM, Amit Kapila <amit.kapil...@gmail.com> > wrote: > >> >> >> That sounds like you are dodging the actual problem. I mean you can >> put that same PageIsFull() check in master code as well and then you >> will most probably again see the same regression. > > > Well I don't see it that way. There was a specific concern about a > specific workload that WARM might regress. I think this change addresses > that. Sure if you pick that one piece, put it in master first and then > compare against rest of the WARM code, you will see a regression. > BTW the PageIsFull() check may not help as much in master as it does with WARM. In master we anyways bail out early after couple of column checks. In master it may help to reduce the 10% drop that we see while updating last index column, but if we compare master and WARM with the patch applied, regression should be quite nominal. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Thu, Mar 23, 2017 at 3:02 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > > > That sounds like you are dodging the actual problem. I mean you can > put that same PageIsFull() check in master code as well and then you > will most probably again see the same regression. Well I don't see it that way. There was a specific concern about a specific workload that WARM might regress. I think this change addresses that. Sure if you pick that one piece, put it in master first and then compare against rest of the WARM code, you will see a regression. But I thought what we were worried is WARM causing regression to some existing user, who might see her workload running 10% slower, which this change mitigates. > Also, I think if we > test at fillfactor 80 or 75 (which is not unrealistic considering an > update-intensive workload), then we might again see regression. Yeah, we might, but it will be lesser than before, may be 2% instead of 10%. And by doing this we are further narrowing an already narrow test case. I think we need to see things in totality and weigh in costs-benefits trade offs. There are numbers for very common workloads, where WARM may provide 20, 30 or even more than 100% improvements. Andres and Alvaro already have other ideas to address this problem even further. And as I said, we can pass-in index specific information and make that routine bail-out even earlier. We need to accept that WARM will need to do more attr checks than master, especially when there are more than 1 indexes on the table, and sometimes those checks will go waste. I am ok if we want to provide table-specific knob to disable WARM, but not sure if others would like that idea. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
Thanks Amit. v19 addresses some of the comments below. On Thu, Mar 23, 2017 at 10:28 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Wed, Mar 22, 2017 at 4:06 PM, Amit Kapila <amit.kapil...@gmail.com> > wrote: > > On Tue, Mar 21, 2017 at 6:47 PM, Pavan Deolasee > > <pavan.deola...@gmail.com> wrote: > >> > >>> > >> > >> Please find attached rebased patches. > >> > > > > Few comments on 0005_warm_updates_v18.patch: > > > > Few more comments on 0005_warm_updates_v18.patch: > 1. > @@ -234,6 +241,25 @@ index_beginscan(Relation heapRelation, > scan->heapRelation = heapRelation; > scan->xs_snapshot = snapshot; > > + /* > + * If the index supports recheck, > make sure that index tuple is saved > + * during index scans. Also build and cache IndexInfo which is used by > + * amrecheck routine. > + * > + * XXX Ideally, we should look at > all indexes on the table and check if > + * WARM is at all supported on the base table. If WARM is not supported > + * then we don't need to do any recheck. > RelationGetIndexAttrBitmap() does > + * do that and sets rd_supportswarm after looking at all indexes. But we > + * don't know if the function was called earlier in the > session when we're > + * here. We can't call it now because there exists a risk of causing > + * deadlock. > + */ > + if (indexRelation->rd_amroutine->amrecheck) > + { > +scan->xs_want_itup = true; > + scan->indexInfo = BuildIndexInfo(indexRelation); > + } > + > > Don't we need to do this rechecking during parallel scans? Also what > about bitmap heap scans? > > Yes, we need to handle parallel scans. Bitmap scans are not a problem because it can never return the same TID twice. I fixed this though by moving this inside index_beginscan_internal. > 2. > +++ b/src/backend/access/nbtree/nbtinsert.c > - > typedef struct > > Above change is not require. > > Sure. Fixed. > 3. > +_bt_clear_items(Page page, OffsetNumber *clearitemnos, uint16 nclearitems) > +void _hash_clear_items(Page page, OffsetNumber *clearitemnos, > + uint16 nclearitems) > > Both the above functions look exactly same, isn't it better to have a > single function like page_clear_items? If you want separation for > different index types, then we can have one common function which can > be called from different index types. > > Yes, makes sense. Moved that to bufpage.c. The reason why I originally had index-specific versions because I started by putting WARM flag in IndexTuple header. But since hash index does not have the bit free, moved everything to TID bit-flag. I still left index-specific wrappers, but they just call PageIndexClearWarmTuples. > 4. > - if (callback(htup, callback_state)) > + flags = ItemPointerGetFlags(>t_tid); > + is_warm = ((flags & BTREE_INDEX_WARM_POINTER) != 0); > + > + if (is_warm) > + stats->num_warm_pointers++; > + else > + stats->num_clear_pointers++; > + > + result = callback(htup, is_warm, callback_state); > + if (result == IBDCR_DELETE) > + { > + if (is_warm) > + stats->warm_pointers_removed++; > + else > + stats->clear_pointers_removed++; > > The patch looks to be inconsistent in collecting stats for btree and > hash. I don't see above stats are getting updated in hash index code. > > Fixed. The hashbucketcleanup signature is just getting a bit too long. May be we should move some of these counters in a structure and pass that around. Not done here though. > 5. > +btrecheck(Relation indexRel, IndexInfo *indexInfo, IndexTuple indexTuple, > + Relation heapRel, HeapTuple heapTuple) > { > .. > + if (!datumIsEqual(values[i - 1], indxvalue, att->attbyval, > + att->attlen)) > .. > } > > Will this work if the index is using non-default collation? > > Not sure I understand that. Can you please elaborate? > 6. > +++ b/src/backend/access/nbtree/nbtxlog.c > @@ -390,83 +390,9 @@ btree_xlog_vacuum(XLogReaderState *record) > -#ifdef UNUSED > xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record); > > /* > - * This section of code is thought to be no longer needed, after analysis > - * of the calling paths. It is retained to allow the code to be reinstated > - * if a flaw is revealed in that thinking. > - * > .. > > Why does this patch need to remove the above code under #ifdef UNUSED > > Yeah, it isn't strictly necessary. But that dead code was coming in the way and hence I decided to strip it out. I can put it back if it's an issue or remove that as a separate commit first. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Wed, Mar 22, 2017 at 4:53 PM, Mithun Cy <mithun...@enterprisedb.com> wrote: > On Wed, Mar 22, 2017 at 3:44 PM, Pavan Deolasee > <pavan.deola...@gmail.com> wrote: > > > > This looks quite weird to me. Obviously these numbers are completely > > non-comparable. Even the time for VACUUM FULL goes up with every run. > > > > May be we can blame it on AWS instance completely, but the pattern in > your > > tests looks very similar where the number slowly and steadily keeps going > > up. If you do complete retest but run v18/v19 first and then run master, > may > > be we'll see a complete opposite picture? > > > > For those tests I have done tests in the order --- <Master, patch18, > patch18, Master> both the time numbers were same. Hmm, interesting. > One different thing > I did was I was deleting the data directory between tests and creating > the database from scratch. Unfortunately the machine I tested this is > not available. I will test same with v19 once I get the machine and > report you back. Ok, no problem. I did some tests on AWS i2.xlarge instance (4 vCPU, 30GB RAM, attached SSD) and results are shown below. But I think it is important to get independent validation from your side too, just to ensure I am not making any mistake in measurement. I've attached naively put together scripts which I used to run the benchmark. If you find them useful, please adjust the paths and run on your machine. I reverted back to UNLOGGED table because with WAL the results looked very weird (as posted earlier) even when I was taking a CHECKPOINT before each set and had set max_wal_size and checkpoint_timeout high enough to avoid any checkpoint during the run. Anyways, that's a matter of separate investigation and not related to this patch. I did two kinds of tests. a) update last column of the index b) update second column of the index v19 does considerably better than even master for the last column update case and pretty much inline for the second column update test. The reason is very clear because v19 determines early in the cycle that the buffer is already full and there is very little chance of doing a HOT update on the page. In that case, it does not check any columns for modification. The master on the other hand will scan through all 9 columns (for last column update case) and incur the same kind of overhead of doing wasteful work. The first/second/fourth column shows response time in ms and third and fifth column shows percentage difference over master. (I hope the table looks fine, trying some text-table generator tool :-). Apologies if it looks messed up) +---+ | Second column update | +---+ | Master | v18 | v19 | +---+-+-+ | 96657.681 | 108122.868 | 11.86% | 96873.642 | 0.22% | +---+++++ | 98546.35 | 110021.27 | 11.64% | 97569.187 | -0.99% | +---+++++ | 99297.231 | 110550.054 | 11.33% | 100241.261 | 0.95% | +---+++++ | 97196.556 | 110235.808 | 13.42% | 97216.207 | 0.02% | +---+++++ | 99072.432 | 110477.327 | 11.51% | 97950.687 | -1.13% | +---+++++ | 96730.056 | 109217.939 | 12.91% | 96929.617 | 0.21% | +---+++++ +---+ | Last column update | +---+ | Master | v18| v19 | +++-+ | 112545.537 | 116563.608 | 3.57% | 103067.276 | -8.42% | +++---+++ | 110169.224 | 115753.991 | 5.07% | 104411.44 | -5.23% | +++---+++ | 112280.874 | 116437.11 | 3.70% | 104868.98 | -6.60% | +++---+++ | 113171.556 | 116813.262 | 3.22% | 103907.012 | -8.19% | +++---+++ | 110721.881 | 117442.709 | 6.07% | 104124.131 | -5.96% | +++---+++ | 112138.601 | 115834.549 | 3.30% | 104858.624 | -6.49% | +++---+++ Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services interest_attrs_tests.tar.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Wed, Mar 22, 2017 at 3:51 AM, Mithun Cy <mithun...@enterprisedb.com> wrote: > > CREATE INDEX testindx ON testtab (col1, col2, col3, col4, col5, col6, > col7, col8, col9); > Performance measurement tests: Ran12 times to eliminate run to run > latencies. > == > VACUUM FULL; > BEGIN; > UPDATE testtab SET col2 = md5(random()::text); > ROLLBACK; > > Response time recorded shows there is a much higher increase in response > time from 10% to 25% after the patch. > > After doing some tests on my side, I now think that there is something else going on, unrelated to the patch. I ran the same benchmark on AWS i2.xlarge machine with 32GB RAM. shared_buffers set to 16GB, max_wal_size to 256GB, checkpoint_timeout to 60min and autovacuum off. I compared master and v19, every time running 6 runs of the test. The database was restarted whenever changing binaries, tables dropped/recreated and checkpoint taken after each restart (but not between 2 runs, which I believe what you did too.. but correct me if that's a wrong assumption). Instead of col2, I am updating col9, but that's probably not too much relevant. VACUUM FULL; BEGIN; UPDATE testtab SET col9 = md5(random()::text); ROLLBACK; First set of 6 runs with master: 163629.8 181183.8 194788.1 194606.1 194589.9 196002.6 (database restart, table drop/create, checkpoint) First set of 6 runs with v19: 190566.55 228274.489 238110.202 239304.681 258748.189 284882.4 (database restart, table drop/create, checkpoint) Second set of 6 runs with master: 232267.5 298259.6 312315.1 341817.3 360729.2 385210.7 This looks quite weird to me. Obviously these numbers are completely non-comparable. Even the time for VACUUM FULL goes up with every run. May be we can blame it on AWS instance completely, but the pattern in your tests looks very similar where the number slowly and steadily keeps going up. If you do complete retest but run v18/v19 first and then run master, may be we'll see a complete opposite picture? Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Wed, Mar 22, 2017 at 8:43 AM, Pavan Deolasee <pavan.deola...@gmail.com> wrote: > > > BTW may I request another test with the attached patch? In this patch, we > check if the PageIsFull() even before deciding which attributes to check > for modification. If the page is already full, there is hardly any chance > of doing a HOT update (there could be a corner case where the new tuple is > smaller than the tuple used in previous UPDATE and we have just enough > space to do HOT update this time, but I can think that's too narrow). > > I would also request you to do a slightly different test where instead of updating the second column, we update the last column of the index i.e. col9. Would really appreciate if you share results with both master and v19 patch. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Wed, Mar 22, 2017 at 3:51 AM, Mithun Cy <mithun...@enterprisedb.com> wrote: > On Tue, Mar 21, 2017 at 8:10 PM, Robert Haas <robertmh...@gmail.com> > wrote: > > If the WAL writing hides the loss, then I agree that's not a big > > concern. But if the loss is still visible even when WAL is written, > > then I'm not so sure. > > The tests table schema was taken from earlier tests what Pavan has posted > [1], hence it is UNLOGGED all I tried to stress the tests. Instead of > updating 1 row at a time through pgbench (For which I and Pavan both did > not see any regression), I tried to update all the rows in the single > statement. > Sorry, I did not mean to suggest that you set it up wrongly, I was just trying to point out that the test case itself may not be very practical. But given your recent numbers, the regression is clearly non-trivial and something we must address. > I have changed the settings as recommended and did a quick test as above > in our machine by removing UNLOGGED world in create table statement. > > Patch Tested : Only 0001_interesting_attrs_v18.patch in [2] > > Response time recorded shows there is a much higher increase in response > time from 10% to 25% after the patch. > > Thanks for repeating the tests. They are very useful. It might make sense to reverse the order or do 6 tests each and alternate between patched and unpatched master just to get rid of any other anomaly. BTW may I request another test with the attached patch? In this patch, we check if the PageIsFull() even before deciding which attributes to check for modification. If the page is already full, there is hardly any chance of doing a HOT update (there could be a corner case where the new tuple is smaller than the tuple used in previous UPDATE and we have just enough space to do HOT update this time, but I can think that's too narrow). Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services 0001_interesting_attrs_v19.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Tue, Mar 21, 2017 at 10:34 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Tue, Mar 21, 2017 at 12:49 PM, Bruce Momjian <br...@momjian.us> wrote: > > On Tue, Mar 21, 2017 at 09:25:49AM -0400, Robert Haas wrote: > >> On Tue, Mar 21, 2017 at 8:41 AM, Pavan Deolasee > >> > TBH I see many artificial scenarios here. It will be very useful if > he can > >> > rerun the query with some of these restrictions lifted. I'm all for > >> > addressing whatever we can, but I am not sure if this test > demonstrates a > >> > real world usage. > >> > >> That's a very fair point, but if these patches - or some of them - are > >> going to get committed then these things need to get discussed. Let's > >> not just have nothing-nothing-nothing giant unagreed code drop. > > > > First, let me say I love this feature for PG 10, along with > > multi-variate statistics. > > > > However, not to be a bummer on this, but the persistent question I have > > is whether we are locking ourselves into a feature that can only do > > _one_ index-change per WARM chain before a lazy vacuum is required. Are > > we ever going to figure out how to do more changes per WARM chain in the > > future, and is our use of so many bits for this feature going to > > restrict our ability to do that in the future. > > > > I know we have talked about it, but not recently, and if everyone else > > is fine with it, I am too, but I have to ask these questions. > > I think that's a good question. I previously expressed similar > concerns. On the one hand, it's hard to ignore the fact that, in the > cases where this wins, it already buys us a lot of performance > improvement. On the other hand, as you say (and as I said), it eats > up a lot of bits, and that limits what we can do in the future. I think we can save a bit few bits, at some additional costs and/or complexity. It all depends on what matters us more. For example, we can choose not to use HEAP_LATEST_TUPLE bit and instead always find the root tuple the hard way. Since only WARM would ever need to find that information, may be it's ok since WARM's other advantage will justify that. Or we cache the information computed during earlier heap_prune_page call and use that (just guessing that we can make it work, no concrete idea at this moment). We can also save HEAP_WARM_UPDATED flag since this is required only for abort-handling case. We can find a way to push that information down to the old tuple if UPDATE aborts and we detect the broken chain. Again, not fully thought through, but doable. Of course, we will have to carefully evaluate all code paths and make sure that we don't lose that information ever. If the consumption of bits become a deal breaker then I would first trade the HEAP_LATEST_TUPLE bit and then HEAP_WARM_UPDATED just from correctness perspective. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Tue, Mar 21, 2017 at 10:47 PM, Bruce Momjian <br...@momjian.us> wrote: > On Tue, Mar 21, 2017 at 01:04:14PM -0400, Robert Haas wrote: > > > I know we have talked about it, but not recently, and if everyone else > > > is fine with it, I am too, but I have to ask these questions. > > > > I think that's a good question. I previously expressed similar > > concerns. On the one hand, it's hard to ignore the fact that, in the > > cases where this wins, it already buys us a lot of performance > > improvement. On the other hand, as you say (and as I said), it eats > > up a lot of bits, and that limits what we can do in the future. On > > the one hand, there is a saying that a bird in the hand is worth two > > in the bush. On the other hand, there is also a saying that one > > should not paint oneself into the corner. > > > > I'm not sure we've had any really substantive discussion of these > > issues. Pavan's response to my previous comments was basically "well, > > I think it's worth it", which is entirely reasonable, because he > > presumably wouldn't have written the patch that way if he thought it > > sucked. But it might not be the only opinion. > > Early in the discussion we talked about allowing multiple changes per > WARM chain if they all changed the same index and were in the same > direction so there were no duplicates, but it was complicated. There > was also discussion about checking the index during INSERT/UPDATE to see > if there was a duplicate. However, those ideas never led to further > discussion. > Well, once I started thinking about how to do vacuum etc, I realised that any mechanism which allows unlimited (even handful) updates per chain is going to be very complex and error prone. But if someone has ideas to do that, I am open. I must say though, it will make an already complex problem even more complex. > > I know the current patch yields good results, but only on a narrow test > case, Hmm. I am kinda surprised you say that because I never thought it was a narrow test case that we are targeting here. But may be I'm wrong. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Tue, Mar 21, 2017 at 6:55 PM, Robert Haas <robertmh...@gmail.com> wrote: > > I think that very wide columns and highly indexed tables are not > particularly unrealistic, nor do I think updating all the rows is > particularly unrealistic. Ok. But those who update 10M rows in a single transaction, would they really notice 5-10% variation? I think it probably makes sense to run those updates in smaller transactions and see if the regression is still visible (otherwise tweaking synchronous_commit is mute anyways). > Sure, it's not everything, but it's > something. Now, I would agree that all of that PLUS unlogged tables > with fsync=off is not too realistic. What kind of regression would we > observe if we eliminated those last two variables? > Hard to say. I didn't find any regression on the machines available to me even with the original test case that I used, which was pretty bad case to start with (sure, Mithun tweaked it further to create even worse scenario). May be the kind of machines he has access to, it might show up even with those changes. > > > I think that whether the code ends up getting contorted is an > important consideration here. For example, if the first of the things > you mention can be done without making the code ugly, then I think > that would be worth doing; it's likely to help fairly often in > real-world cases. The problem with making the code contorted and > ugly, as you say that the second idea would require, is that it can > easily mask bugs. Agree. That's probably one reason why Alvaro wrote the patch to start with. I'll give the first of those two options a try. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Tue, Mar 21, 2017 at 5:34 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Tue, Mar 21, 2017 at 6:56 AM, Amit Kapila <amit.kapil...@gmail.com> > wrote: > >> Hmm, that test case isn't all that synthetic. It's just a single > >> column bulk update, which isn't anything all that crazy, and 5-10% > >> isn't nothing. > >> > >> I'm kinda surprised it made that much difference, though. > >> > > > > I think it is because heap_getattr() is not that cheap. We have > > noticed the similar problem during development of scan key push down > > work [1]. > > Yeah. So what's the deal with this? Is somebody working on figuring > out a different approach that would reduce this overhead? Are we > going to defer WARM to v11? Or is the intent to just ignore the 5-10% > slowdown on a single column update and commit everything anyway? I think I should clarify something. The test case does a single column update, but it also has columns which are very wide, has an index on many columns (and it updates a column early in the list). In addition, in the test Mithun updated all 10million rows of the table in a single transaction, used UNLOGGED table and fsync was turned off. TBH I see many artificial scenarios here. It will be very useful if he can rerun the query with some of these restrictions lifted. I'm all for addressing whatever we can, but I am not sure if this test demonstrates a real world usage. Having said that, may be if we can do a few things to reduce the overhead. - Check if the page has enough free space to perform a HOT/WARM update. If not, don't look for all index keys. - Pass bitmaps separately for each index and bail out early if we conclude neither HOT nor WARM is possible. In this case since there is just one index and as soon as we check the second column we know neither HOT nor WARM is possible, we will return early. It might complicate the API a lot, but I can give it a shot if that's what is needed to make progress. Any other ideas? Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Tue, Mar 14, 2017 at 7:17 AM, Alvaro Herrerawrote: > > @@ -234,6 +236,21 @@ index_beginscan(Relation heapRelation, > > scan->heapRelation = heapRelation; > > scan->xs_snapshot = snapshot; > > > > + /* > > + * If the index supports recheck, make sure that index tuple is > saved > > + * during index scans. > > + * > > + * XXX Ideally, we should look at all indexes on the table and > check if > > + * WARM is at all supported on the base table. If WARM is not > supported > > + * then we don't need to do any recheck. > RelationGetIndexAttrBitmap() does > > + * do that and sets rd_supportswarm after looking at all indexes. > But we > > + * don't know if the function was called earlier in the session > when we're > > + * here. We can't call it now because there exists a risk of > causing > > + * deadlock. > > + */ > > + if (indexRelation->rd_amroutine->amrecheck) > > + scan->xs_want_itup = true; > > + > > return scan; > > } > > I didn't like this comment very much. But it's not necessary: you have > already given relcache responsibility for setting rd_supportswarm. The > only problem seems to be that you set it in RelationGetIndexAttrBitmap > instead of RelationGetIndexList, but it's not clear to me why. I think > if the latter function is in charge, then we can trust the flag more > than the current situation. I looked at this today. AFAICS we don't have access to rd_amroutine in RelationGetIndexList since we don't actually call index_open() in that function. Would it be safe to do that? I'll give it a shot, but thought of asking here first. Thanks, Pavan
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Wed, Mar 15, 2017 at 12:46 AM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > Pavan Deolasee wrote: > > On Tue, Mar 14, 2017 at 7:17 AM, Alvaro Herrera < > alvhe...@2ndquadrant.com> > > wrote: > > > > I have already commented about the executor involvement in btrecheck(); > > > that doesn't seem good. I previously suggested to pass the EState down > > > from caller, but that's not a great idea either since you still need to > > > do the actual FormIndexDatum. I now think that a workable option would > > > be to compute the values/isnulls arrays so that btrecheck gets them > > > already computed. > > > > I agree with your complaint about modularity violation. What I am unclear > > is how passing values/isnulls array will fix that. The way code is > > structured currently, recheck routines are called by index_fetch_heap(). > So > > if we try to compute values/isnulls in that function, we'll still need > > access EState, which AFAIU will lead to similar violation. Or am I > > mis-reading your idea? > > You're right, it's still a problem. BTW I realised that we don't really need those executor bits in recheck routines. We don't support WARM when attributes in index expressions are modified. So we really don't need to do any comparison for those attributes. I've written a separate form of FormIndexDatum() which will only return basic index attributes and comparing them should be enough. Will share rebased and updated patch soon. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Mon, Mar 20, 2017 at 8:11 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Sun, Mar 19, 2017 at 3:05 AM, Pavan Deolasee > <pavan.deola...@gmail.com> wrote: > > On Thu, Mar 16, 2017 at 12:53 PM, Robert Haas <robertmh...@gmail.com> > wrote: > > >> > >> /me scratches head. > >> > >> Aren't pre-warm and post-warm just (better) names for blue and red? > >> > > > > Yeah, sounds better. > > My point here wasn't really about renaming, although I do think > renaming is something that should get done. My point was that you > were saying we need to mark index pointers as common, pre-warm, and > post-warm. But you're pretty much already doing that, I think. I > guess you don't have "common", but you do have "pre-warm" and > "post-warm". > > Ah, I mis-read that. Strictly speaking, we already have common (blue) and post-warm (red), and I just finished renaming them to CLEAR (of WARM bit) and WARM. May be it's still not the best name, but I think it looks better than before. But the larger point is that we don't have an easy to know if an index pointer which was inserted with the original heap tuple (i.e. pre-WARM update) should only return pre-WARM tuples or should it also return post-WARM tuples. Right now we make that decision by looking at the index-keys and discard the pointer whose index-key does not match the ones created from heap-keys. If we need to change that then at every WARM update, we will have to go back to the original pointer and change it's state to pre-warm. That looks more invasive and requires additional index management. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Thu, Mar 16, 2017 at 12:53 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Wed, Mar 15, 2017 at 3:44 PM, Pavan Deolasee > <pavan.deola...@gmail.com> wrote: > > I couldn't find a better way without a lot of complex infrastructure. > Even > > though we now have ability to mark index pointers and we know that a > given > > pointer either points to the pre-WARM chain or post-WARM chain, this does > > not solve the case when an index does not receive a new entry. In that > case, > > both pre-WARM and post-WARM tuples are reachable via the same old index > > pointer. The only way we could deal with this is to mark index pointers > as > > "common", "pre-warm" and "post-warm". But that would require us to update > > the old pointer's state from "common" to "pre-warm" for the index whose > keys > > are being updated. May be it's doable, but might be more complex than the > > current approach. > > /me scratches head. > > Aren't pre-warm and post-warm just (better) names for blue and red? > > Yeah, sounds better. Just to make it clear, the current design sets the following information: HEAP_WARM_TUPLE - When a row gets WARM updated, both old and new versions of the row are marked with HEAP_WARM_TUPLE flag. This allows us to remember that a certain row was WARM-updated, even if the update later aborts and we cleanup the new version and truncate the chain. All subsequent tuple versions will carry this flag until a non-HOT updates happens, which breaks the HOT chain. HEAP_WARM_RED - After first WARM update, the new version of the tuple is marked with this flag. This flag is also carried forward to all future HOT updated tuples. So the only tuple that has HEAP_WARM_TUPLE but not HEAP_WARM_RED is the old version before the WARM update. Also, all tuples marked with HEAP_WARM_RED flag satisfies HOT property (i.e. all index key columns share the same value). Similarly, all tuples NOT marked with HEAP_WARM_RED also satisfy HOT property. I've so far called them Red and Blue chains respectively. In addition, in the current patch, the new index pointers resulted from WARM updates are marked BTREE_INDEX_RED_POINTER/HASH_INDEX_RED_POINTER I think per your suggestion we can change HEAP_WARM_RED to HEAP_WARM_TUPLE and similarly rename the index pointers to BTREE/HASH_INDEX_WARM_POINTER and replace HEAP_WARM_TUPLE with something like HEAP_WARM_UPDATED_TUPLE to signify that this or some previous version of this chain was once WARM-updated. Does that sound ok? I can change the patch accordingly. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Tue, Mar 14, 2017 at 7:16 PM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > Pavan Deolasee wrote: > > On Tue, Mar 14, 2017 at 7:17 AM, Alvaro Herrera < > alvhe...@2ndquadrant.com> > > wrote: > > > > I have already commented about the executor involvement in btrecheck(); > > > that doesn't seem good. I previously suggested to pass the EState down > > > from caller, but that's not a great idea either since you still need to > > > do the actual FormIndexDatum. I now think that a workable option would > > > be to compute the values/isnulls arrays so that btrecheck gets them > > > already computed. > > > > I agree with your complaint about modularity violation. What I am unclear > > is how passing values/isnulls array will fix that. The way code is > > structured currently, recheck routines are called by index_fetch_heap(). > So > > if we try to compute values/isnulls in that function, we'll still need > > access EState, which AFAIU will lead to similar violation. Or am I > > mis-reading your idea? > > You're right, it's still a problem. (Honestly, I think the whole idea > of trying to compute a fake index tuple starting from a just-read heap > tuple is a problem in itself; Why do you think so? > I just wonder if there's a way to do the > recheck that doesn't involve such a thing.) > I couldn't find a better way without a lot of complex infrastructure. Even though we now have ability to mark index pointers and we know that a given pointer either points to the pre-WARM chain or post-WARM chain, this does not solve the case when an index does not receive a new entry. In that case, both pre-WARM and post-WARM tuples are reachable via the same old index pointer. The only way we could deal with this is to mark index pointers as "common", "pre-warm" and "post-warm". But that would require us to update the old pointer's state from "common" to "pre-warm" for the index whose keys are being updated. May be it's doable, but might be more complex than the current approach. > > > I wonder if we should instead invent something similar to IndexRecheck(), > > but instead of running ExecQual(), this new routine will compare the > index > > values by the given HeapTuple against given IndexTuple. ISTM that for > this > > to work we'll need to modify all callers of index_getnext() and teach > them > > to invoke the AM specific recheck method if xs_tuple_recheck flag is set > to > > true by index_getnext(). > > Yeah, grumble, that idea does sound intrusive, but perhaps it's > workable. What about bitmap indexscans? AFAICS we already have a > recheck there natively, so we only need to mark the page as lossy, which > we're already doing anyway. > Yeah, bitmap indexscans should be ok. We need recheck logic only to avoid duplicate scans and since a TID can only occur once in the bitmap, there is no risk for duplicate results. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Tue, Mar 14, 2017 at 7:19 PM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > Pavan Deolasee wrote: > > > BTW I wanted to share some more numbers from a recent performance test. I > > thought it's important because the latest patch has fully functional > chain > > conversion code as well as all WAL-logging related pieces are in place > > too. I ran these tests on a box borrowed from Tomas (thanks!). This has > > 64GB RAM and 350GB SSD with 1GB on-board RAM. I used the same test setup > > that I used for the first test results reported on this thread i.e. a > > modified pgbench_accounts table with additional columns and additional > > indexes (one index on abalance so that every UPDATE is a potential WARM > > update). > > > > In a test where table + indexes exceeds RAM, running for 8hrs and > > auto-vacuum parameters set such that we get 2-3 autovacuums on the table > > during the test, we see WARM delivering more than 100% TPS as compared to > > master. In this graph, I've plotted a moving average of TPS and the > spikes > > that we see coincides with the checkpoints (checkpoint_timeout is set to > > 20mins and max_wal_size large enough to avoid any xlog-based > checkpoints). > > The spikes are more prominent on WARM but I guess that's purely because > it > > delivers much higher TPS. I haven't shown here but I see WARM updates > close > > to 65-70% of the total updates. Also there is significant reduction in > WAL > > generated per txn. > > Impressive results. Labels on axes would improve readability of the chart > :-) > > Sorry about that. I was desperately searching for Undo button after hitting "send" for the very same reason :-) Looks like I used gnuplot after a few years. Just to make it clear, the X-axis is duration of tests in seconds and Y-axis is 450s moving average of TPS. BTW 450 is no magic figure. I collected stats every 15s and took a moving average of last 30 samples. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Tue, Mar 14, 2017 at 7:17 AM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > > @@ -234,6 +236,21 @@ index_beginscan(Relation heapRelation, > > scan->heapRelation = heapRelation; > > scan->xs_snapshot = snapshot; > > > > + /* > > + * If the index supports recheck, make sure that index tuple is > saved > > + * during index scans. > > + * > > + * XXX Ideally, we should look at all indexes on the table and > check if > > + * WARM is at all supported on the base table. If WARM is not > supported > > + * then we don't need to do any recheck. > RelationGetIndexAttrBitmap() does > > + * do that and sets rd_supportswarm after looking at all indexes. > But we > > + * don't know if the function was called earlier in the session > when we're > > + * here. We can't call it now because there exists a risk of > causing > > + * deadlock. > > + */ > > + if (indexRelation->rd_amroutine->amrecheck) > > + scan->xs_want_itup = true; > > + > > return scan; > > } > > I didn't like this comment very much. But it's not necessary: you have > already given relcache responsibility for setting rd_supportswarm. The > only problem seems to be that you set it in RelationGetIndexAttrBitmap > instead of RelationGetIndexList, but it's not clear to me why. Hmm. I think you're right. Will fix that way and test. > > I noticed that nbtinsert.c and nbtree.c have a bunch of new includes > that they don't actually need. Let's remove those. nbtutils.c does > need them because of btrecheck(). Right. It's probably a left over from the way I wrote the first version. Will fix. Speaking of which: > > I have already commented about the executor involvement in btrecheck(); > that doesn't seem good. I previously suggested to pass the EState down > from caller, but that's not a great idea either since you still need to > do the actual FormIndexDatum. I now think that a workable option would > be to compute the values/isnulls arrays so that btrecheck gets them > already computed. I agree with your complaint about modularity violation. What I am unclear is how passing values/isnulls array will fix that. The way code is structured currently, recheck routines are called by index_fetch_heap(). So if we try to compute values/isnulls in that function, we'll still need access EState, which AFAIU will lead to similar violation. Or am I mis-reading your idea? I wonder if we should instead invent something similar to IndexRecheck(), but instead of running ExecQual(), this new routine will compare the index values by the given HeapTuple against given IndexTuple. ISTM that for this to work we'll need to modify all callers of index_getnext() and teach them to invoke the AM specific recheck method if xs_tuple_recheck flag is set to true by index_getnext(). Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Skip all-visible pages during second HeapScan of CIC
On Wed, Mar 8, 2017 at 8:08 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Tue, Mar 7, 2017 at 9:12 PM, Andres Freund <and...@anarazel.de> wrote: > > > > > I wonder however, if careful snapshot managment couldn't solve this as > > well. I have *not* thought a lot about this, but afaics we can easily > > prevent all-visible from being set in cases it'd bother us by having an > > "appropriate" xmin / registered snapshot. > > Yeah, but that's a tax on the whole system. > > May be we can do that only when patched CIC (or any other operation which may not like concurrent VM set) is in progress. We could use what Stephen suggested upthread to find that state. But right now it's hard to think because there is nothing on either side so we don't know what gets impacted by aggressive VM set and how. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Skip all-visible pages during second HeapScan of CIC
On Wed, Mar 8, 2017 at 7:33 AM, Robert Haaswrote: > On Tue, Mar 7, 2017 at 4:26 PM, Stephen Frost wrote: > > Right, that's what I thought he was getting at and my general thinking > > was that we would need a way to discover if a CIC is ongoing on the > > relation and therefore heap_page_prune(), or anything else, would know > > that it can't twiddle the bits in the VM due to the ongoing CIC. > > Perhaps a lock isn't the right answer there, but it would have to be > > some kind of cross-process communication that operates at a relation > > level.. > > Well, I guess that's one option. I lean toward the position already > taken by Andres and Peter, namely, that it's probably not a great idea > to pursue this optimization. Fair point. I'm not going to "persist" with the idea too long. It seemed like a good, low-risk feature to me which can benefit certain use cases quite reasonably. It's not uncommon to create indexes (or reindex existing indexes to remove index bloats) on extremely large tables and avoiding a second heap scan can hugely benefit such cases. Holding up the patch for something for which we don't even have a proposal yet seemed a bit strange at first, but I see the point. Anyways, for a recently vacuumed table of twice the size of RAM and on a machine with SSDs, the patched CIC version runs about 25% faster. That's probably the best case scenario. I also realised that I'd attached a slightly older version of the patch. So new version attached, but I'll remove the patch from CF if I don't see any more votes in favour of the patch. Thanks, Pavan cic_skip_all_visible_v4.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cleanup: avoid direct use of ip_posid/ip_blkid
On Thu, Mar 2, 2017 at 9:55 PM, Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 2/22/17 08:38, Pavan Deolasee wrote: > > One reason why these macros are not always used is because they > > typically do assert-validation to ensure ip_posid has a valid value. > > There are a few places in code, especially in GIN and also when we are > > dealing with user-supplied TIDs when we might get a TID with invalid > > ip_posid. I've handled that by defining and using separate macros which > > skip the validation. This doesn't seem any worse than what we are > > already doing. > > I wonder why we allow that. Shouldn't the tid type reject input that > has ip_posid == 0? > Yes, I think it seems sensible to disallow InvalidOffsetNumber (or > MaxOffsetNumber) in user-supplied value. But there are places in GIN and with INSERT ON CONFLICT where we seem to use special values in ip_posid to mean different things. So we might still need some way to accept invalid values there. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
[HACKERS] Skip all-visible pages during second HeapScan of CIC
Hello All, During the second heap scan of CREATE INDEX CONCURRENTLY, we're only interested in the tuples which were inserted after the first scan was started. All such tuples can only exists in pages which have their VM bit unset. So I propose the attached patch which consults VM during second scan and skip all-visible pages. We do the same trick of skipping pages only if certain threshold of pages can be skipped to ensure OS's read-ahead is not disturbed. The patch obviously shows significant reduction of time for building index concurrently for very large tables, which are not being updated frequently and which was vacuumed recently (so that VM bits are set). I can post performance numbers if there is interest. For tables that are being updated heavily, the threshold skipping was indeed useful and without that we saw a slight regression. Since VM bits are only set during VACUUM which conflicts with CIC on the relation lock, I don't see any risk of incorrectly skipping pages that the second scan should have scanned. Comments? Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services cic_skip_all_visible_v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Fri, Feb 24, 2017 at 9:47 PM, Robert Haas <robertmh...@gmail.com> wrote: > > And there are no bugs, right? :-) Yeah yeah absolutely nothing. Just like any other feature committed to Postgres so far ;-) I need to polish the chain conversion patch a bit and also add missing support for redo, hash indexes etc. Support for hash indexes will need overloading of ip_posid bits in the index tuple (since there are no free bits left in hash tuples). I plan to work on that next and submit a fully functional patch, hopefully before the commit-fest starts. (I have mentioned the idea of overloading ip_posid bits a few times now and haven't heard any objection so far. Well, that could either mean that nobody has read those emails seriously or there is general acceptance to that idea.. I am assuming latter :-)) Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Fri, Feb 24, 2017 at 3:42 PM, Robert Haas <robertmh...@gmail.com> wrote: > > > Wow, OK. In my view, that makes the chain conversion code pretty much > essential, because if you had WARM without chain conversion then the > visibility map gets more or less irrevocably less effective over time, > which sounds terrible. Yes. I decided to complete chain conversion patch when I realised that IOS will otherwise become completely useful if large percentage of rows are updated just once. So I agree. It's not an optional patch and should get in with the main WARM patch. > But it sounds to me like even with the chain > conversion, it might take multiple vacuum passes before all visibility > map bits are set, which isn't such a great property (thus e.g. > fdf9e21196a6f58c6021c967dc5776a16190f295). > > The chain conversion algorithm first converts the chains during vacuum and then checks if the page can be set all-visible. So I'm not sure why it would take multiple vacuums before a page is set all-visible. The commit you quote was written to ensure that we make another attempt to set the page all-visible after al dead tuples are removed from the page. Similarly, we will convert all WARM chains to HOT chains and then check for all-visibility of the page. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Fri, Feb 24, 2017 at 3:23 PM, Robert Haas <robertmh...@gmail.com> wrote: > > I don't immediately see how this will work with index-only scans. If > the tuple is HOT updated several times, HOT-pruned back to a single > version, and then the page is all-visible, the index entries are > guaranteed to agree with the remaining tuple, so it's fine to believe > the data in the index tuple. But with WARM, that would no longer be > true, unless you have some trick for that... > > Well the trick is to not allow index-only scans on such pages by not marking them all-visible. That's why when a tuple is WARM updated, we carry that information in the subsequent versions even when later updates are HOT updates. The chain conversion algorithm will handle this by clearing those bits and thus allowing index-only scans again. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Fri, Feb 24, 2017 at 12:28 AM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > Bruce Momjian wrote: > > On Thu, Feb 23, 2017 at 03:45:24PM -0300, Alvaro Herrera wrote: > > > > > and potentially trim the first HOT chain as those tuples become > > > > invisible. > > > > > > That can already happen even without WARM, no? > > > > Uh, the point is that with WARM those four early tuples can be removed > > via a prune, rather than requiring a VACUUM. Without WARM, the fourth > > tuple can't be removed until the index is cleared by VACUUM. > > I *think* that the WARM-updated one cannot be pruned either, because > it's pointed to by at least one index (otherwise it'd have been a HOT > update). The ones prior to that can be removed either way. > > No, even the WARM-updated can be pruned and if there are further HOT updates, those can be pruned too. All indexes and even multiple pointers from the same index are always pointing to the root of the WARM chain and that line pointer does not go away unless the entire chain become dead. The only material difference between HOT and WARM is that since there are two index pointers from the same index to the same root line pointer, we must do recheck. But HOT-pruning and all such things remain the same. Let's take an example. Say, we have a table (a int, b int, c text) and two indexes on first two columns. H W H (1, 100, 'foo') -> (1, 100, 'bar') --> (1, 200, 'bar') -> (1, 200, 'foo') The first update will be a HOT update, the second update will be a WARM update and the third update will again be a HOT update. The first and third update do not create any new index entry, though the second update will create a new index entry in the second index. Any further WARM updates to this chain is not allowed, but further HOT updates are ok. If all but the last version become DEAD, HOT-prune will remove all of them and turn the first line pointer into REDIRECT line pointer. At this point, the first index has one index pointer and the second index has two index pointers, but all pointing to the same root line pointer, which has not become REDIRECT line pointer. Redirect o---> (1, 200, 'foo') I think the part you want (be able to prune the WARM updated tuple) is > part of what Pavan calls "turning the WARM chain into a HOT chain", so > not part of the initial patch. Pavan can explain this part better, and > also set me straight in case I'm wrong in the above :-) > > Umm.. it's a bit different. Without chain conversion, we still don't allow further WARM updates to the above chain because that might create a third index pointer and our recheck logic can't cope up with duplicate scans. HOT updates are allowed though. The latest patch that I proposed will handle this case and convert such chains into regular HOT-pruned chains. To do that, we must remove the duplicate (and now wrong) index pointer to the chain. Once we do that and change the state on the heap tuple, we can once again do WARM update to this tuple. Note that in this example the chain has just one tuple, which will be the case typically, but the algorithm can deal with the case where there are multiple tuples but with matching index keys. Hope this helps. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Fri, Feb 24, 2017 at 2:13 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Thu, Feb 23, 2017 at 9:21 PM, Bruce Momjian <br...@momjian.us> wrote: > > > > > I have what might be a supid question. As I remember, WARM only allows > > a single index-column change in the chain. Why are you seeing such a > > large performance improvement? I would have thought it would be that > > high if we allowed an unlimited number of index changes in the chain. > > I'm not sure how the test case is set up. If the table has multiple > indexes, each on a different column, and only one of the indexes is > updated, then you figure to win because now the other indexes need > less maintenance (and get less bloated). If you have only a single > index, then I don't see how WARM can be any better than HOT, but maybe > I just don't understand the situation. > > That's correct. If you have just one index and if the UPDATE modifies indexed indexed, the UPDATE won't be a WARM update and the patch gives you no benefit. OTOH if the UPDATE doesn't modify any indexed columns, then it will be a HOT update and again the patch gives you no benefit. It might be worthwhile to see if patch causes any regression in these scenarios, though I think it will be minimal or zero. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Thu, Feb 23, 2017 at 11:53 PM, Bruce Momjian <br...@momjian.us> wrote: > On Thu, Feb 23, 2017 at 03:03:39PM -0300, Alvaro Herrera wrote: > > Bruce Momjian wrote: > > > > > As I remember, WARM only allows > > > a single index-column change in the chain. Why are you seeing such a > > > large performance improvement? I would have thought it would be that > > > high if we allowed an unlimited number of index changes in the chain. > > > > The second update in a chain creates another non-warm-updated tuple, so > > the third update can be a warm update again, and so on. > > Right, before this patch they would be two independent HOT chains. It > still seems like an unexpectedly-high performance win. Are two > independent HOT chains that much more expensive than joining them via > WARM? In these tests, there are zero HOT updates, since every update modifies some index column. With WARM, we could reduce regular updates to half, even when we allow only one WARM update per chain (chain really has a single tuple for this discussion). IOW approximately half updates insert new index entry in *every* index and half updates insert new index entry *only* in affected index. That itself does a good bit for performance. So to answer your question: yes, joining two HOT chains via WARM is much cheaper because it results in creating new index entries just for affected indexes. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Thu, Feb 23, 2017 at 11:30 PM, Bruce Momjian <br...@momjian.us> wrote: > On Wed, Feb 1, 2017 at 10:46:45AM +0530, Pavan Deolasee wrote: > > > contains a WARM tuple. Alternate ideas/suggestions and review of > the > > design > > > are welcome! > > > > t_infomask2 contains one last unused bit, > > > > > > Umm, WARM is using 2 unused bits from t_infomask2. You mean there is > another > > free bit after that too? > > We are obviously going to use several heap or item pointer bits for > WARM, and once we do that it is going to be hard to undo that. Pavan, > are you saying you could do more with WARM if you had more bits? Are we > sure we have given you all the bits we can? Do we want to commit to a > lesser feature because the bits are not available? > > btree implementation is complete as much as I would like (there are a few TODOs, but no show stoppers), at least for the first release. There is a free bit in btree index tuple header that I could use for chain conversion. In the heap tuples, I can reuse HEAP_MOVED_OFF because that bit will only be set along with HEAP_WARM_TUPLE bit. Since none of the upgraded clusters can have HEAP_WARM_TUPLE bit set, I think we are safe. WARM currently also supports hash indexes, but there is no free bit left in hash index tuple header. I think I can work around that by using a bit from ip_posid (not yet implemented/tested, but seems doable). IMHO if we can do that i.e. support btree and hash indexes to start with, we should be good to go for the first release. We can try to support other popular index AMs in the subsequent release. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
[HACKERS] Cleanup: avoid direct use of ip_posid/ip_blkid
Hello All, I would like to propose the attached patch which removes all direct references to ip_posid and ip_blkid members of ItemPointerData struct and instead use ItemPointerGetOffsetNumber and ItemPointerGetBlockNumber macros respectively to access these members. My motivation to do this is to try to free-up some bits from ip_posid field, given that it can only be maximum 13-bits long. But even without that, it seems like a good cleanup to me. One reason why these macros are not always used is because they typically do assert-validation to ensure ip_posid has a valid value. There are a few places in code, especially in GIN and also when we are dealing with user-supplied TIDs when we might get a TID with invalid ip_posid. I've handled that by defining and using separate macros which skip the validation. This doesn't seem any worse than what we are already doing. make check-world with --enable-tap-tests passes. Comments/suggestions? Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services remove_ip_posid_blkid_ref_v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
Hi Tom, On Wed, Feb 1, 2017 at 3:51 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > > (I'm a little more concerned by Alvaro's apparent position that WARM > is a done deal; I didn't think so. Are there any specific aspects of the design that you're not comfortable with? I'm sure there could be some rough edges in the implementation that I'm hoping will get handled during the further review process. But if there are some obvious things I'm overlooking please let me know. Probably the same question to Andres/Robert who has flagged concerns. On my side, I've run some very long tests with data validation and haven't found any new issues with the most recent patches. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sun, Feb 19, 2017 at 3:43 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Fri, Feb 17, 2017 at 11:15 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > > > Ah, nah, scratch that. If any post-index-build pruning had occurred on a > > page, the evidence would be gone --- the non-matching older tuples would > > be removed and what remained would look consistent. But it wouldn't > > match the index. You might be able to find problems if you were willing > > to do the expensive check on *every* HOT chain, but that seems none too > > practical. > > If the goal is just to detect tuples that aren't indexed and should > be, what about scanning each index and building a TIDBitmap of all of > the index entries, and then scanning the heap for non-HOT tuples and > probing the TIDBitmap for each one? If you find a heap entry for > which you didn't find an index entry, go and recheck the index and see > if one got added concurrently. If not, whine. > > This particular case of corruption results in a heap tuple getting indexed by a wrong key (or to be precise, indexed by its old value). So the only way to detect the corruption is to look at each index key and check if it matches with the corresponding heap tuple. We could write some kind of self join that can use a sequential scan and an index-only scan (David Rowley had suggested something of that sort internally here), but we can't guarantee index-only scan on a table which is being concurrently updated. So not sure even that will catch every possible case. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] [COMMITTERS] pgsql: Release note updates.
On Tue, Feb 7, 2017 at 9:29 PM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > Alvaro Herrera wrote: > > > > > Hmm. Now that I think about it, it is probably possible to have a > > transaction started before CIC that inserted a bunch of rows, and then > > runs the UPDATE during the CIC race window. Maybe there's a reason the > > bug wouldn't hit in that case but I don't see it, and I'm not able to > > test it right now to verify. > > Pavan adds that it's possible to have a transaction do INSERT while CIC > is running, then some other transaction executes the UPDATE. > > Just to elaborate on that, once a backend ends up with stale cached bitmaps, AFAICS only a second relcache flush can correct that. This may not come for long time. In fact since CIC is already holding a lock on the table, I think second relcache flush will only happen at the end of phase 2. This can take a long time, especially for very large tables. In meanwhile, the compromised backend can run many transactions with the stale information. Those transactions can see not only the existing rows, but also new rows inserted by other new but now committed-good transactions. It's all theory and I haven't had time to try this out. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] [COMMITTERS] pgsql: Release note updates.
On Tue, Feb 7, 2017 at 5:23 PM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > Tom Lane wrote: > > Release note updates. > > > > Add item for last-minute CREATE INDEX CONCURRENTLY fix. > > Hi, > > Sorry for not noticing earlier, but there is a bug in the notes: > > + If CREATE INDEX CONCURRENTLY was used to build an index > + that depends on a column not previously indexed, then rows inserted > + or updated by transactions that ran concurrently with > + the CREATE INDEX command could have received incorrect > + index entries. > > CIC bug does not affect inserted rows, only updated rows, since the > bogus bitmap is only used to compute whether to omit index tuples for > HOT considerations. > That's correct. > > Also, the bollixed rows do not receive incorrect index entries -- they > just do not receive any index entry at all. > > I think it's somewhat both. While it's true that the updated rows may not get new index entries as they deserve, they will be reachable from the older index entries. So while a query such as "SELECT * FROM tab WHERE key = newval" may not return any result, "SELECT * FROM tab WHERE key = oldval" may actually return the updated (and wrong) tuple. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 11:54 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > After some discussion among the release team, we've concluded that the > best thing to do is to push Pavan's/my patch into today's releases. > This does not close the matter by any means: we should continue to > study whether there are related bugs or whether there's a more principled > way of fixing this bug. But that patch clearly makes things better, > and we shouldn't let worries about whether there are more bugs stop > us from providing some kind of fix to users. > > I've made the push, and barring negative reports from the buildfarm, > it will be in today's releases. > Thank you for taking care of it. Buildfarm is looking green until now. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee <pavan.deola...@gmail.com> wrote: > >> > I like this approach. I'll run the patch on a build with > CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. > While it looks certain that the fix will miss this point release, FWIW I ran this patch with CACHE_CLOBBER_ALWAYS and it seems to be working as expected.. Hard to run all the tests, but a small subset of regression and test_decoding seems ok. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 9:41 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > > > Hmm. Consider that the first time relcahe invalidation occurs while > computing id_attrs, so now the retry logic will compute the correct > set of attrs (considering two indexes, if we take the example given by > Alvaro above.). I don't quite get that. Since rd_idattr must be already cached at that point and we don't expect to process a relcache flush between successive calls to RelationGetIndexAttrBitmap(), we should return a consistent copy of rd_idattr. But may be I am missing something. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 5:44 AM, Andres Freund <and...@anarazel.de> wrote: > On 2017-02-05 12:51:09 -0500, Tom Lane wrote: > > Michael Paquier <michael.paqu...@gmail.com> writes: > > > On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule <pavel.steh...@gmail.com> > wrote: > > >> I agree with Pavan - a release with known important bug is not good > idea. > > > > > This bug has been around for some time, so I would recommend taking > > > the time necessary to make the best fix possible, even if it means > > > waiting for the next round of minor releases. > > > > I think the way to think about this sort of thing is, if we had found > > this bug when a release wasn't imminent, would we consider it bad enough > > to justify an unscheduled release cycle? I have to think the answer for > > this one is "probably not". > > +1. I don't think we serve our users by putting out a nontrivial bugfix > hastily. Nor do I think we serve them in this instance by delaying the > release to cover this fix; there's enough other fixes in the release > imo. > > I'm bit a surprised with this position. What do we tell our users now that we know this bug exists? They can't fully trust CIC and they don't have any mechanism to confirm if the newly built index is corruption free or not. I'm not suggesting that we should hastily refactor the code, but at the very least some sort of band-aid fix which helps the situation, yet have minimal side-effects, is warranted. I believe proposed retry patch does exactly that. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 8:15 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee <pavan.deola...@gmail.com> > wrote: > > > > > > On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > >> > >> > >> > >> > 2. In the second patch, we tried to recompute attribute lists if a > >> > relcache > >> > flush happens in between and index list is invalidated. We've seen > >> > problems > >> > with that, especially it getting into an infinite loop with > >> > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently > relcache > >> > flushes keep happening. > >> > >> Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache > flush > >> happen whenever it possibly could. > > > > > > Ah, ok. That explains why the retry logic as originally proposed was > causing > > infinite loop. > > > >> > >> The way to deal with that without > >> looping is to test whether the index set *actually* changed, not whether > >> it just might have changed. > > > > > > I like this approach. I'll run the patch on a build with > > CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the > > fact that retry logic is now limited to only when index set changes, > which > > fits in the situation we're dealing with. > > > > Don't you think it also has the same problem as mentioned by me in > Alvaro's patch and you also agreed on it? No, I don't think so. There can't be any more relcache flushes occurring between the first RelationGetIndexAttrBitmap() and the subsequent RelationGetIndexAttrBitmap() calls. The problem with the earlier proposed approach was that we were not caching, yet returning stale information. That implied the next call will again recompute a possibly different value. The retry logic is trying to close a small race yet important race condition where index set invalidation happens, but we cache stale information. > Do you see any need to fix > it locally in > RelationGetIndexAttrBitmap(), why can't we clear at transaction end? > > We're trying to fix it in RelationGetIndexAttrBitmap() because that's where the race exists. I'm not saying there aren't other and better ways of handling it, but we don't know (I've studied Andres's proposal yet). Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 5:41 AM, Peter Geoghegan <p...@bowt.ie> wrote: > On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas <robertmh...@gmail.com> wrote: > > I don't think this kind of black-and-white thinking is very helpful. > > Obviously, data corruption is bad. However, this bug has (from what > > one can tell from this thread) been with us for over a decade; it must > > necessarily be either low-probability or low-severity, or somebody > > would've found it and fixed it before now. Indeed, the discovery of > > this bug was driven by new feature development, not a user report. It > > seems pretty clear that if we try to patch this and get it wrong, the > > effects of our mistake could easily be a lot more serious than the > > original bug. > > +1. The fact that it wasn't driven by a user report convinces me that > this is the way to go. > > I'm not sure that just because the bug wasn't reported by a user, makes it any less critical. As Tomas pointed down thread, the nature of the bug is such that the users may not discover it very easily, but that doesn't mean it couldn't be affecting them all the time. We can now correlate many past reports of index corruption to this bug, but we don't have evidence to prove that. Lack of any good tool or built-in checks probably makes it even harder. The reason why I discovered this with WARM is because it now has a built-in recheck logic, which was discarding some tuples returned by the index scan. It took me lots of code review and inspection to finally conclude that this must be an existing bug. Even then for lack of any utility, I could not detect this easily with master. That doesn't mean I wasn't able to reproduce, but detection was a problem. If you look at the reproduction setup, one in every 10, if not 5, CREATE INDEX CONCURRENTLY ends up creating a corrupt index. That's a good 10% probability. And this is on a low end laptop, with just 5-10 concurrent clients running. Probability of hitting the problem will be much higher on a bigger machine, with many users (on a decent AWS machine, I would find every third CIC to be corrupt). Moreover the setup is not doing anything extraordinary. Just concurrent updates which change between HOT to non-HOT and a CIC. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > > > > 2. In the second patch, we tried to recompute attribute lists if a > relcache > > flush happens in between and index list is invalidated. We've seen > problems > > with that, especially it getting into an infinite loop with > > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache > > flushes keep happening. > > Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache flush > happen whenever it possibly could. Ah, ok. That explains why the retry logic as originally proposed was causing infinite loop. > The way to deal with that without > looping is to test whether the index set *actually* changed, not whether > it just might have changed. > I like this approach. I'll run the patch on a build with CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the fact that retry logic is now limited to only when index set changes, which fits in the situation we're dealing with. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sat, Feb 4, 2017 at 11:54 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > > Based on Pavan's comments, I think trying to force this into next week's > releases would be extremely unwise. If the bug went undetected this long, > it can wait for a fix for another three months. Yes, I think bug existed ever since and went unnoticed. One reason could be that the race happens only when the new index turns HOT updates into non-HOT updates. Another reason could be that we don't have checks in place to catch these kinds of corruption. Having said that, since we have discovered the bug, at least many 2ndQuadrant customers have expressed worry and want to know if the fix will be available in 9.6.2 and other minor releases. Since the bug can lead to data corruption, the worry is justified. Until we fix the bug, there will be a constant worry about using CIC. If we can have some kind of band-aid fix to plug in the hole, that might be enough as well. I tested my first patch (which will need some polishing) and that works well AFAIK. I was worried about prepared queries and all, but that seems ok too. RelationGetIndexList() always get called within ExecInitModifyTable. The fix seems quite unlikely to cause any other side effects. Another possible band-aid is to throw another relcache invalidation in CIC. Like adding a dummy index_set_state_flags() within yet another transaction. Seems ugly, but should fix the problem for now and not cause any impact on relcache mechanism whatsoever. That seems better than > risking new breakage when it's barely 48 hours to the release wrap > deadline. We do not have time to recover from any mistakes. > I'm not sure what the project policies are, but can we consider delaying the release by a week for issues like these? Or do you think it will be hard to come up with a proper fix for the issue and it will need some serious refactoring? Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sat, Feb 4, 2017 at 12:10 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > > > If we do above, then I think primary key attrs won't be returned > because for those we are using relation copy rather than an original > working copy of attrs. See code below: > > switch (attrKind) > { > .. > case INDEX_ATTR_BITMAP_PRIMARY_KEY: > return bms_copy(relation->rd_pkattr); > > I don't think that would a problem since if relation_rd_indexattr is NULL, primary key attrs will be recomputed and returned. > > Apart from above, I think after the proposed patch, it will be quite > possible that in heap_update(), hot_attrs, key_attrs and id_attrs are > computed using different index lists (consider relcache flush happens > after computation of hot_attrs or key_attrs) which was previously not > possible. I think in the later code we assume that all the three > should be computed using the same indexlist. For ex. see the below > code: > This seems like a real problem to me. I don't know what the consequences are, but definitely having various attribute lists to have different view of the set of indexes doesn't seem right. The problem we are trying to fix is very clear by now. Relcache flush clears rd_indexvalid and rd_indexattr together, but because of the race, rd_indexattr is recomputed with the old information and gets cached again, while rd_indexvalid (and rd_indexlist) remains unset. That leads to the index corruption in CIC and may cause other issues too that we're not aware right now. We want cached attribute lists to become invalid whenever index list is invalidated and that's not happening right now. So we have tried three approaches so far. 1. Simply reset rd_indexattr whenever we detect rd_indexvalid has been reset. This was the first patch. It's a very simple patch and has worked for me with CACHE_CLOBBER_ALWAYS and even fixes the CIC bug. The only downside it seems that we're invalidating cached attribute lists outside the normal relcache invalidation path. It also works on the premise that RelationGetIndexList() will be called every time before RelationGetIndexAttrBitmap() is called, otherwise we could still end up using stale cached copies. I wonder if cached plans can be a problem here. 2. In the second patch, we tried to recompute attribute lists if a relcache flush happens in between and index list is invalidated. We've seen problems with that, especially it getting into an infinite loop with CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache flushes keep happening. 3. We tried the third approach where we don't cache attribute lists if know that index set has changed and hope that the next caller gets the correct info. But Amit rightly identified problems with that approach too. So we either need to find a 4th approach or stay with the first patch unless we see a problem there too (cached plans, anything else). Or may be fix CREATE INDEX CONCURRENTLY so that at least the first phase which add the index entry to the catalog happens with a strong lock. This will lead to momentary blocking of write queries, but protect us from the race. I'm assuming this is the only code that can cause the problem, and I haven't checked other potential hazards. Ideas/suggestions? Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Thu, Feb 2, 2017 at 10:14 PM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > > > I'm going to study the bug a bit more, and put in some patch before the > upcoming minor tag on Monday. > > Looking at the history and some past discussions, it seems Tomas reported somewhat similar problem and Andres proposed a patch here https://www.postgresql.org/message-id/20140514155204.ge23...@awork2.anarazel.de which got committed via b23b0f5588d2e2. Not exactly the same issue, but related to relcache flush happening in index_open(). I wonder if we should just add a recheck logic in RelationGetIndexAttrBitmap() such that if rd_indexvalid is seen as 0 at the end, we go back and start again from RelationGetIndexList(). Or is there any code path that relies on finding the old information? The following comment at the top of RelationGetIndexAttrBitmap() also seems like a problem to me. CIC works with lower strength lock and hence it can change the set of indexes even though caller is holding the RowExclusiveLock, no? The comment seems to suggest otherwise. 4748 * Caller had better hold at least RowExclusiveLock on the target relation 4749 * to ensure that it has a stable set of indexes. This also makes it safe 4750 * (deadlock-free) for us to take locks on the relation's indexes. I've attached a revised patch based on these thoughts. It looks a bit more invasive than the earlier one-liner, but looks better to me. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services invalidate_indexattr_v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Thu, Feb 2, 2017 at 6:14 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > > /* > + * If the index list was invalidated, we better also invalidate the index > + * attribute list (which should automatically invalidate other attributes > + * such as primary key and replica identity) > + */ > > + relation->rd_indexattr = NULL; > + > > I think setting directly to NULL will leak the memory. > Good catch, thanks. Will fix. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Jan 30, 2017 at 7:20 PM, Pavan Deolasee <pavan.deola...@gmail.com> wrote: > > Based on my investigation so far and the evidence that I've collected, > what probably happens is that after a relcache invalidation arrives at the > new transaction, it recomputes the rd_indexattr but based on the old, > cached rd_indexlist. So the rd_indexattr does not include the new columns. > Later rd_indexlist is updated, but the rd_indexattr remains what it was. > > I was kinda puzzled this report did not get any attention, though it's clearly a data corruption issue. Anyways, now I know why this is happening and can reproduce this very easily via debugger and two sessions. The offending code is RelationGetIndexAttrBitmap(). Here is the sequence of events: Taking the same table as in the report: CREATE TABLE cic_tab_accounts ( aid bigint UNIQUE, abalance bigint, aid1 bigint ); 1. Session S1 calls DROP INDEX pgb_a_aid1 and completes 2. Session S2 is in process of rebuilding its rd_indexattr bitmap (because of previous relcache invalidation caused by DROP INDEX). To be premise, assume that the session has finished rebuilding its index list. Since DROP INDEX has just happened, 4793 indexoidlist = RelationGetIndexList(relation); 3. S1 then starts CREATE INDEX CONCURRENTLY pgb_a_aid1 ON cic_tab_accounts(aid1) 4. S1 creates catalog entry for the new index, commits phase 1 transaction and builds MVCC snapshot for the second phase and also finishes the initial index build 5. S2 now proceeds. Now notice that S2 had computed the index list based on the old view i.e. just one index. The following comments in relcahce.c are now significant: 4799 /* 4800 * Copy the rd_pkindex and rd_replidindex value computed by 4801 * RelationGetIndexList before proceeding. This is needed because a 4802 * relcache flush could occur inside index_open below, resetting the 4803 * fields managed by RelationGetIndexList. (The values we're computing 4804 * will still be valid, assuming that caller has a sufficient lock on 4805 * the relation.) 4806 */ That's what precisely goes wrong. 6. When index_open is called, relcache invalidation generated by the first transaction commit in CIC gets accepted and processed. As part of this invalidation, rd_indexvalid is set to 0, which invalidates rd_indexlist too 7. But S2 is currently using index list obtained at step 2 above (which has only old index). It goes ahead and computed rd_indexattr based on just the old index. 8. While relcahce invalidation processing at step 6 cleared rd_indexattr (along with rd_indexvalid), it is again set at 4906 relation->rd_indexattr = bms_copy(indexattrs); But what gets set at step 8 is based on the old view. There is no way rd_indexattr will be invalidated until S2 receives another relcache invalidation. This will happen when the CIC proceeds. But until then, every update done by S2 will generate broken HOT chains, leading to data corruption. I can reproduce this entire scenario using gdb sessions. This also explains why the patch I sent earlier helps to solve the problem. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Thu, Feb 2, 2017 at 3:49 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvhe...@2ndquadrant.com> writes: > > Tom Lane wrote: > >> I think what we ought to do about this is invent additional API > >> functions, say > >> > >> Oid CatalogTupleInsertWithInfo(Relation heapRel, HeapTuple tup, > >> CatalogIndexState indstate); > >> void CatalogTupleUpdateWithInfo(Relation heapRel, ItemPointer otid, > >> HeapTuple tup, CatalogIndexState indstate); > >> > >> and use these in place of simple_heap_foo plus CatalogIndexInsert > >> in the places where this optimization had been applied. > > > This looks reasonable enough to me. > > Done. > > Thanks for taking care of this. Shame that I missed this because I'd specifically noted the special casing for large objects etc. But looks like while changing 180+ call sites, I forgot my notes. Thanks again, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Tue, Jan 31, 2017 at 7:21 PM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > > > > +#define HeapTupleHeaderGetNextTid(tup, next_ctid) \ > > > +do { \ > > > + AssertMacro(!((tup)->t_infomask2 & HEAP_LATEST_TUPLE)); \ > > > + ItemPointerCopy(&(tup)->t_ctid, (next_ctid)); \ > > > +} while (0) > > > > > Actually, I think this macro could just return the TID so that it can > be > > > used as struct assignment, just like ItemPointerCopy does internally -- > > > callers can do > > >ctid = HeapTupleHeaderGetNextTid(tup); > > > > While I agree with your proposal, I wonder why we have ItemPointerCopy() > in > > the first place because we freely copy TIDs as struct assignment. Is > there > > a reason for that? And if there is, does it impact this specific case? > > I dunno. This macro is present in our very first commit d31084e9d1118b. > Maybe it's an artifact from the Lisp to C conversion. Even then, we had > some cases of iptrs being copied by struct assignment, so it's not like > it didn't work. Perhaps somebody envisioned that the internal details > could change, but that hasn't happened in two decades so why should we > worry about it now? If somebody needs it later, it can be changed then. > > May I suggest in that case that we apply the attached patch which removes all references to ItemPointerCopy and its definition as well? This will avoid confusion in future too. No issues noticed in regression tests. > > There is one issue that bothers me. The current implementation lacks > > ability to convert WARM chains into HOT chains. The README.WARM has some > > proposal to do that. But it requires additional free bit in tuple header > > (which we don't have) and of course, it needs to be vetted and > implemented. > > If the heap ends up with many WARM tuples, then index-only-scans will > > become ineffective because index-only-scan can not skip a heap page, if > it > > contains a WARM tuple. Alternate ideas/suggestions and review of the > design > > are welcome! > > t_infomask2 contains one last unused bit, Umm, WARM is using 2 unused bits from t_infomask2. You mean there is another free bit after that too? > and we could reuse vacuum > full's bits (HEAP_MOVED_OUT, HEAP_MOVED_IN), but that will need some > thinking ahead. Maybe now's the time to start versioning relations so > that we can ensure clusters upgraded to pg10 do not contain any of those > bits in any tuple headers. > Yeah, IIRC old VACUUM FULL was removed in 9.0, which is good 6 year old. Obviously, there still a chance that a pre-9.0 binary upgraded cluster exists and upgrades to 10. So we still need to do something about them if we reuse these bits. I'm surprised to see that we don't have any mechanism in place to clear those bits. So may be we add something to do that. I'd some other ideas (and a patch too) to reuse bits from t_ctid.ip_pos given that offset numbers can be represented in just 13 bits, even with the maximum block size. Can look at that if it comes to finding more bits. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services remove_itempointercopy.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Tue, Jan 31, 2017 at 7:37 PM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > Pavan Deolasee wrote: > > > > > Sounds good. Should I submit that as a separate patch on current master? > > Yes, please. > > Attached. Two new APIs added. - CatalogInsertHeapAndIndex which does a simple_heap_insert followed by catalog updates - CatalogUpdateHeapAndIndex which does a simple_heap_update followed by catalog updates There are only a handful callers remain for simple_heap_insert/update after this patch. They are typically working with already opened indexes and hence I left them unchanged. make check-world passes with the patch. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services catalog_update.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Tue, Jan 31, 2017 at 7:21 PM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > Pavan Deolasee wrote: > > On Thu, Jan 26, 2017 at 2:38 AM, Alvaro Herrera < > alvhe...@2ndquadrant.com> > > wrote: > > > > The simple_heap_update + CatalogUpdateIndexes pattern is getting > > > obnoxious. How about creating something like catalog_heap_update which > > > does both things at once, and stop bothering each callsite with the > WARM > > > stuff? > > > > What I realised that there are really 2 patterns: > > 1. simple_heap_insert, CatalogUpdateIndexes > > 2. simple_heap_update, CatalogUpdateIndexes > > > > There are only couple of places where we already have indexes open or > have > > more than one tuple to update, so we call CatalogIndexInsert directly. > What > > I ended up doing in the attached patch is add two new APIs which combines > > the two steps of each of these patterns. It seems much cleaner to me and > > also less buggy for future users. I hope I am not missing a reason not to > > do combine these steps. > > CatalogUpdateIndexes was just added as a convenience function on top of > a very common pattern. If we now have a reason to create a second one > because there are now two very common patterns, it seems reasonable to > have two functions. I think I would commit the refactoring to create > these functions ahead of the larger WARM patch, since I think it'd be > bulky and largely mechanical. (I'm going from this description; didn't > read your actual code.) > Sounds good. Should I submit that as a separate patch on current master? Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Thu, Jan 26, 2017 at 2:38 AM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > Looking at your 0002 patch now. It no longer applies, but the conflicts > are trivial to fix. Please rebase and resubmit. > > Thanks. > > Maybe I will spend some time trying to > convert it to Perl using PostgresNode. > > Agree. I put together a test harness to hammer the WARM code as much as we can. This harness has already discovered some bugs, especially around index creation part. It also discovered one outstanding bug in master, so it's been useful. But I agree to rewrite it using perl. > > I think having the "recheck" index methods create an ExecutorState looks > out of place. How difficult is it to pass the estate from the calling > code? > I couldn't find an easy way given the place where recheck is required. Can you suggest something? > > IMO heap_get_root_tuple_one should be called just heap_get_root_tuple(). > That function and its plural sibling heap_get_root_tuples() should > indicate in their own comments what the expectations are regarding the > root_offsets output argument, rather than deferring to the comments in > the "internal" function, since they differ on that point; for the rest > of the invariants I think it makes sense to say "Also see the comment > for heap_get_root_tuples_internal". I wonder if heap_get_root_tuple > should just return the ctid instead of assigning the value to a > passed-in pointer, i.e. > OffsetNumber > heap_get_root_tuple(Page page, OffsetNumber target_offnum) > { > OffsetNumberoff; > heap_get_root_tuples_internal(page, target_offnum, ); > return off; > } > > Yes, all of that makes sense. Will fix. > > The simple_heap_update + CatalogUpdateIndexes pattern is getting > obnoxious. How about creating something like catalog_heap_update which > does both things at once, and stop bothering each callsite with the WARM > stuff? In fact, given that CatalogUpdateIndexes is used in other > places, maybe we should leave its API alone and create another function, > so that we don't have to change the many places that only do > simple_heap_insert. (Places like OperatorCreate which do either insert > or update could just move the index update call into each branch.) > > What I ended up doing is I added two new APIs. - CatalogUpdateHeapAndIndex - CatalogInsertHeapAndIndex I could replace almost all occurrences of simple_heap_update + CatalogUpdateIndexes with the first API and simple_heap_insert + CatalogUpdateIndexes with the second API. This looks like a good improvement to me anyways since there are about 180 places where these functions are called almost in the same pattern. May be it will also avoid a bug when someone forgets to update the index after inserting/updating heap. > . > I wonder if heap_hot_search_buffer() and heap_hot_search() should return > a tri-valued enum instead of boolean; that idea looks reasonable in > theory but callers have to do more work afterwards, so maybe not. > > Ok. I'll try to rearrange it a bit. May be we just have one API after all? There are only a very few callers of these APIs. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Wed, Jan 25, 2017 at 10:06 PM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > Reading 0001_track_root_lp_v9.patch again: > > Thanks for the review. > > +/* > > + * We use the same HEAP_LATEST_TUPLE flag to check if the tuple's > t_ctid field > > + * contains the root line pointer. We can't use the same > > + * HeapTupleHeaderIsHeapLatest macro because that also checks for > TID-equality > > + * to decide whether a tuple is at the of the chain > > + */ > > +#define HeapTupleHeaderHasRootOffset(tup) \ > > +( \ > > + ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0 \ > > +) > > > > +#define HeapTupleHeaderGetRootOffset(tup) \ > > +( \ > > + AssertMacro(((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0), \ > > + ItemPointerGetOffsetNumber(&(tup)->t_ctid) \ > > +) > > Interesting stuff; it took me a bit to see why these macros are this > way. I propose the following wording which I think is clearer: > > Return whether the tuple has a cached root offset. We don't use > HeapTupleHeaderIsHeapLatest because that one also considers the slow > case of scanning the whole block. > Umm, not scanning the whole block, but HeapTupleHeaderIsHeapLatest compares t_ctid with the passed in TID and returns true if those matches. To know if root lp is cached, we only rely on the HEAP_LATEST_TUPLE flag. Though if the flag is set, then it implies latest tuple too. > > Please flag the macros that have multiple evaluation hazards -- there > are a few of them. > Can you please tell me an example? I must be missing something. > > > > +/* > > + * Get TID of next tuple in the update chain. Caller should have > checked that > > + * we are not already at the end of the chain because in that case > t_ctid may > > + * actually store the root line pointer of the HOT chain whose member > this > > + * tuple is. > > + */ > > +#define HeapTupleHeaderGetNextTid(tup, next_ctid) \ > > +do { \ > > + AssertMacro(!((tup)->t_infomask2 & HEAP_LATEST_TUPLE)); \ > > + ItemPointerCopy(&(tup)->t_ctid, (next_ctid)); \ > > +} while (0) > > Actually, I think this macro could just return the TID so that it can be > used as struct assignment, just like ItemPointerCopy does internally -- > callers can do > ctid = HeapTupleHeaderGetNextTid(tup); > > Yes, makes sense. Will fix. > > > The API of RelationPutHeapTuple appears a bit contorted, where > root_offnum is both input and output. I think it's cleaner to have the > argument be the input, and have the output offset be the return value -- > please check whether that simplifies things; for example I think this: > > > + root_offnum = InvalidOffsetNumber; > > + RelationPutHeapTuple(relation, buffer, heaptup, > false, > > + _offnum); > > becomes > > root_offnum = RelationPutHeapTuple(relation, buffer, heaptup, > false, > InvalidOffsetNumber); > > Make sense. Will fix. > > > Many comments lack finishing periods in complete sentences, which looks > odd. Please fix. > Sorry, not sure where I picked that style from. I see that the existing code has both styles, though I will add finishing periods because I like that way too. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
[HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
elcache invalidation correctly and third (PID 73091) which got it wrong. Specifically, look at line #77 where snapshot is obtained for the second phase: 2017-01-29 10:44:31.238 UTC [73119] [xid=0]LOG: CIC (pgb_a_aid1) building first phase with snapshot (4670891:4670893) At line 108, you'll see a new transaction with XID 4670908 still using the old attribute list. 2017-01-29 10:44:31.889 UTC [73091] [xid=4670908]LOG: heap_update (cic_tab_accounts), hot_attrs ((b 9)) Other session, which I included for comparison, sees the new index and recomputes its rd_indexattr in time and hence that update works as expected. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services cic_bug.tar.gz Description: GNU Zip compressed data relcache_issue_2.log Description: Binary data invalidate_indexattr.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failure in commit_ts tap tests
On Sat, Jan 21, 2017 at 9:39 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Pavan Deolasee <pavan.deola...@gmail.com> writes: > > On Sat, Jan 21, 2017 at 9:09 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > >> Hm, but what of the "null" value? Also, I get > >> > >> $ perl -e 'use warnings; use Test::More; ok("2017-01-01" != "null", > "ok");' > >> Argument "null" isn't numeric in numeric ne (!=) at -e line 1. > >> Argument "2017-01-01" isn't numeric in numeric ne (!=) at -e line 1. > >> ok 1 - ok > > > It declares the test as "passed", right? > > Oh! So it does. That is one darn weird behavior of the != operator. > > Indeed! See this: # first numeric matches, doesn't check beyond that $ perl -e 'if ("2017-23" != "2017-24") {print "Not equal\n"} else {print "Equal\n"}' Equal # first numeric doesn't match, operators works ok $ perl -e 'if ("2017-23" != "2018-24") {print "Not equal\n"} else {print "Equal\n"}' Not equal # comparison of numeric with non-numeric, works ok $ perl -e 'if ("2017-23" != "Foo") {print "Not equal\n"} else {print "Equal\n"}' Not equal # numeric on RHS, works ok $ perl -e 'if ("Foo" != "2018-24") {print "Not equal\n"} else {print "Equal\n"}' Not equal These tests show that the operator returns the correct result it finds a numeric value at the start of the string, either on LHS or RHS. Also, it will only compare the numeric values until first non-numeric character is found. # no numeric on either side $ perl -e 'if ("Fri 2017-23" != "Fri 2017-23") {print "Not equal\n"} else {print "Equal\n"}' Equal *# no numeric on either side, arbitrary strings declared as equal* $ perl -e 'if ("Fri 2017-23" != "Foo") {print "Not equal\n"} else {print "Equal\n"}' Equal These two tests show why we saw no failure earlier. If neither LHS or RHS string has a starting numeric value, the operator declares the arguments as equal, irrespective of their values. I tested the same with == operator and that also exhibits the same behaviour. Weird and I wonder how it's not a source of constant bugs in perl code (I don't use perl a lot, so may be those who do are used to either turning warnings on or know this already. > > > There's still the point that we're not actually exercising this script > in the buildfarm ... > Yes indeed. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services