Re: testing concurrency (was Re: [HACKERS] subtransaction assert failure)
Neil Conway <[EMAIL PROTECTED]> writes: > On Fri, 2004-09-17 at 01:43, Tom Lane wrote: >> PS: this points up once again that the regression tests do not do very >> well at testing concurrent behavior. We need to think about some sort >> of parallel-test harness that would let us write tests that have better >> odds of catching bugs like this. > One incremental improvement to the existing RT system would be to make > each test self-contained (i.e. as far as possible it would assume > nothing about DBMS state before it ran, and not leave the DBMS state > modified when it finished running). If writing completely self-contained > tests isn't possible, we could specify a dependency graph for the tests. > Then we could have the regression test system generate a randomized > grouping of tests to be executed. This is pretty much what we have now, and it's quite useless, because the tests that are run concurrently are very careful to stay out of each others' way. There are certainly no tests that deliberately exercise concurrent-update scenarios, and AFAICS the scheme you are suggesting would not do that either. There are some speculations in the archives about writing a test engine that could fire up multiple database connections and then issue specific command sequences across the set of connections. It's clear that you could test questions like does-this-lock-really-block-that-operation that way. I'm not sure that this approach does anything much to catch problems with small windows-of-opportunity though :-( Mulling it over ... it seems the real issue is that the regression tests are designed for reproducibility, whereas what we'd really like to probe for this sort of bug is *non* reproducible behavior. Not sure how to approach that, except that maybe we shouldn't think in terms of tests that every Joe Bob and Harry can run and not need any intelligence to interpret the results of ... regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
testing concurrency (was Re: [HACKERS] subtransaction assert failure)
On Fri, 2004-09-17 at 01:43, Tom Lane wrote: > PS: this points up once again that the regression tests do not do very > well at testing concurrent behavior. We need to think about some sort > of parallel-test harness that would let us write tests that have better > odds of catching bugs like this. I agree -- something like this would be very useful. I'm a little stumped about how to actually write such a thing, though... One incremental improvement to the existing RT system would be to make each test self-contained (i.e. as far as possible it would assume nothing about DBMS state before it ran, and not leave the DBMS state modified when it finished running). If writing completely self-contained tests isn't possible, we could specify a dependency graph for the tests. Then we could have the regression test system generate a randomized grouping of tests to be executed. In order to have reproducible results, we could associate each randomized grouping with a numeric key (there should be some way to produce a 1-to-1 mapping). So someone reporting a regression test problem would say: "I see the following regression test failures, the test key is xyz", and a developer could run the regression tests specifying that key and have the RT system run the same combination of tests in the same order that produced the problem the user saw. This would be a complement to the existing method of running the RTs. I still think we need a separate way to test parallel behavior though, in addition to the above idea. Can anyone think of a good way to do this? -Neil ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] subtransaction assert failure
On Thu, 16 Sep 2004, Tom Lane wrote: > Gavin Sherry <[EMAIL PROTECTED]> writes: > > Interestingly, I *cannot* recreate on the single CPU system and I cannot > > get abort() to generate a core. > > By that do you mean that you don't see any corefile in the DB directory > when you look after the dust settles? Right. I was actually just doing a find . -name core and came up with no results. > > I ran into the same problem yesterday in another connection, and > eventually realized that the corefile is getting removed because of the > logic I added recently to do WAL replay of CREATE/DROP DATABASE. The > regression test sequence is short enough (on modern machines) that there > may not be any checkpoint between its start and the point where you have > a crash, so that the initial "CREATE DATABASE regression" operation is > still in the range of WAL entries to be replayed. In dbcommands.c > it sez: > > /* > * Our theory for replaying a CREATE is to forcibly drop the > * target subdirectory if present, then re-copy the source data. > * This may be more work than needed, but it is simple to > * implement. > */ > > So what's happening is that WAL replay is wiping the database directory > (including the core file). > > I don't really want to change the CREATE DATABASE replay logic, so I was > thinking of suggesting that we hack around this by modifying pg_regress > to force a checkpoint right after its CREATE DATABASE. Then any crashes > during the regression tests wouldn't cause a replay of the CREATE. This > is mighty ugly though :-( Yes, a bit ugly. Not as ugly as my putting a for(;;) in SubTransGetTopmostTransaction() when the Assert condition would fail so that I could get a useful backtrace. Gavin ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] subtransaction assert failure
Neil Conway <[EMAIL PROTECTED]> writes: > I'm seeing an intermittent assertion failure while running "make check" > with current sources. > TRAP: FailedAssertion("!(TransactionIdFollowsOrEquals(xid, > RecentXmin))", File: > "/Users/neilc/pgsql/src/backend/access/transam/subtrans.c", Line: 146) Hmm, I haven't seen this myself but I see what's causing it --- sloppy thinking about the role of RecentXmin. We need another variable named something like TransactionXmin, which would be the RecentXmin first computed during the current transaction. (This would be a copy of MyProc->xmin; we could just look directly at that but I think I prefer a separate backend-local variable.) The tests that protect access to pg_subtrans need to use this, not RecentXmin, as the cutoff point. The problem of course is that RecentXmin is the xmin associated with the most recently computed snapshot, but there may be older snapshots still in use in the transaction. The time qual test code needs to be able to get answers for transactions that could be present in any of the transaction's snapshots, not only the most recent. Mea culpa, will fix. regards, tom lane PS: this points up once again that the regression tests do not do very well at testing concurrent behavior. We need to think about some sort of parallel-test harness that would let us write tests that have better odds of catching bugs like this. ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] subtransaction assert failure
Michael Glaesemann wrote: On Sep 16, 2004, at 8:48 PM, Andrew Dunstan wrote: I am also seeing regular "make check" failures on my test buildfarm system (FC1, 2.4.22 kernel, cassert turned on): Mmm :) buildfarm! I like the sound of that! Let us know how it goes, eh? As you saw, the actual running of tests and upload of results is functional. It's designed as a distributed farm, somewhat along the lines of what Samba uses. In a few days when I have the registration and display pieces done I will be calling for a handful of volunteers to help me test it. Farm members will need internet access to pull from CVS and to push test results, plus Perl with the LWP, HTTP::Request::Common, MIME::Base64, Digest::SHA1 and File::Find modules (most of these come with a base Perl install these days). The whole thing runs from cron - no user action required. Configuration for a farm member would be trivial. cheers andrew ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] subtransaction assert failure
Gavin Sherry <[EMAIL PROTECTED]> writes: > Interestingly, I *cannot* recreate on the single CPU system and I cannot > get abort() to generate a core. By that do you mean that you don't see any corefile in the DB directory when you look after the dust settles? I ran into the same problem yesterday in another connection, and eventually realized that the corefile is getting removed because of the logic I added recently to do WAL replay of CREATE/DROP DATABASE. The regression test sequence is short enough (on modern machines) that there may not be any checkpoint between its start and the point where you have a crash, so that the initial "CREATE DATABASE regression" operation is still in the range of WAL entries to be replayed. In dbcommands.c it sez: /* * Our theory for replaying a CREATE is to forcibly drop the * target subdirectory if present, then re-copy the source data. * This may be more work than needed, but it is simple to * implement. */ So what's happening is that WAL replay is wiping the database directory (including the core file). I don't really want to change the CREATE DATABASE replay logic, so I was thinking of suggesting that we hack around this by modifying pg_regress to force a checkpoint right after its CREATE DATABASE. Then any crashes during the regression tests wouldn't cause a replay of the CREATE. This is mighty ugly though :-( regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] subtransaction assert failure
On Thu, 16 Sep 2004, Dennis Bjorklund wrote: > > Interestingly, I *cannot* recreate on the single CPU system and I cannot > > I've been trying to reproduce it on a single cpu system with fedora 2, > but no luck. I spook too soon. After having run it even more times I finally got a failure as well. -- /Dennis Björklund ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] subtransaction assert failure
On Sep 16, 2004, at 8:48 PM, Andrew Dunstan wrote: I am also seeing regular "make check" failures on my test buildfarm system (FC1, 2.4.22 kernel, cassert turned on): Mmm :) buildfarm! I like the sound of that! Let us know how it goes, eh? Michael Glaesemann grzm myrealbox com ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] subtransaction assert failure
On Thu, 16 Sep 2004, Gavin Sherry wrote: > On Thu, 16 Sep 2004, Neil Conway wrote: > > > I can consistently repro this, provided I have the patience to run "make > > check" enough times. Can anyone else repro the problem? > > I've managed to recreate this. > > I've tried on two linux 2.4 systems. One way and four way systems. > Interestingly, I *cannot* recreate on the single CPU system and I cannot > get abort() to generate a core. Neil has found the same problem under > Linux. Attached is a bt which does not appear to be corrupted: #0 0x080adb70 in SubTransGetTopmostTransaction (xid=6384) at subtrans.c:149 #1 0x08201984 in HeapTupleSatisfiesSnapshot (tuple=0x4074eba8, snapshot=0x8388908) at tqual.c:798 #2 0x08083f17 in heapgettup (relation=0x40b17ba0, dir=1, tuple=0x83c09ec, buffer=0x83c0a04, snapshot=0x8388908, nkeys=0, key=0x0, pages=2) at heapam.c:312 #3 0x08084db2 in heap_getnext (scan=0x83c09d8, direction=ForwardScanDirection) at heapam.c:830 #4 0x0812518a in SeqNext (node=0x0) at nodeSeqscan.c:102 #5 0x0811d664 in ExecScan (node=0x83c0594, accessMtd=0x8125110 ) at execScan.c:98 #6 0x081251af in ExecSeqScan (node=0x83c0594) at nodeSeqscan.c:139 #7 0x0811904e in ExecProcNode (node=0x83c0594) at execProcnode.c:303 #8 0x08117a89 in ExecutePlan (estate=0x83bfd44, planstate=0x83c0594, operation=CMD_UPDATE, numberTuples=0, direction=137262828, dest=0x82a1020) at execMain.c:1060 #9 0x08116eeb in ExecutorRun (queryDesc=0x8388934, direction=ForwardScanDirection, count=0) at execMain.c:226 #10 0x0812abbe in _SPI_pquery (queryDesc=0x8388934, tcount=0) at spi.c:1498 #11 0x0812a978 in _SPI_execute_plan (plan=0x83d620c, Values=0x83840dc, Nulls=0x838448c " ~", '\177' ..., snapshot=0x0, crosscheck_snapshot=0x0, read_only=0 '\0', tcount=0) at spi.c:1427 #12 0x081290b4 in SPI_execute_plan (plan=0x83d620c, Values=0x83840dc, Nulls=0x838448c " ~", '\177' ..., read_only=0 '\0', tcount=0) at spi.c:309 #13 0x40b48d69 in exec_stmt_execsql (estate=0xbfffcee0, stmt=0x82e76ec) at pl_exec.c:2155 #14 0x40b47782 in exec_stmt (estate=0xbfffcee0, stmt=0x838f988) at pl_exec.c:1108 #15 0x40b47649 in exec_stmts (estate=0xbfffcee0, stmts=0x838fa20) at pl_exec.c:1024 #16 0x40b47a9c in exec_stmt_if (estate=0xbfffcee0, stmt=0x835dbf0) at pl_exec.c:1260 #17 0x40b476ea in exec_stmt (estate=0xbfffcee0, stmt=0x835dbf0) at pl_exec.c:1068 #18 0x40b47649 in exec_stmts (estate=0xbfffcee0, stmts=0x835dc08) at pl_exec.c:1024 #19 0x40b4751d in exec_stmt_block (estate=0xbfffcee0, block=0x835dcb8) at pl_exec.c:979 #20 0x40b46c24 in plpgsql_exec_trigger (func=0x838fa90, trigdata=0xbfffd0f0) at pl_exec.c:655 #21 0x40b4368a in plpgsql_call_handler (fcinfo=0xbfffcfb0) at pl_handler.c:124 #22 0x08104393 in ExecCallTriggerFunc (trigdata=0xbfffd0f0, finfo=0x8338dd4, per_tuple_context=0x832fc74) at trigger.c:1149 #23 0x08105368 in AfterTriggerExecute (event=0x83d133c, rel=0x1, trigdesc=0x8338bc4, finfo=0x0, per_tuple_context=0x832fc74) at trigger.c:1964 #24 0x08105746 in afterTriggerInvokeEvents (events=0x83d12f0, firing_id=0, delete_ok=1 '\001') at trigger.c:2160 #25 0x08105996 in AfterTriggerEndQuery () at trigger.c:2327 #26 0x0818a758 in ProcessQuery (parsetree=0x8338b78, plan=0x832e0f0, params=0x0, dest=0xbfffd360, completionTag=0xbfffd360 "UPDATE 1") at pquery.c:215 #27 0x0818b5bf in PortalRunMulti (portal=0x83369ac, dest=0x83adf08, altdest=0x83adf08, completionTag=0xbfffd360 "UPDATE 1") at pquery.c:988 #28 0x0818aedf in PortalRun (portal=0x83369ac, count=2147483647, dest=0x83adf08, altdest=0x83adf08, completionTag=0xbfffd360 "UPDATE 1") at pquery.c:601 #29 0x0818794c in exec_simple_query ( query_string=0x832c4fc "update PField set name = 'PF0_2' where name = 'PF0_X';") at postgres.c:924 #30 0x08189c1f in PostgresMain (argc=4, argv=0x82e7a80, username=0x82e791c "swm") at postgres.c:2967 #31 0x08161593 in BackendRun (port=0x82f6958) at postmaster.c:2848 #32 0x08161031 in BackendStartup (port=0x82f6958) at postmaster.c:2470 #33 0x0815f717 in ServerLoop () at postmaster.c:1215 #34 0x0815ec66 in PostmasterMain (argc=4, argv=0x82e74b0) at postmaster.c:898 #35 0x081318eb in main (argc=4, argv=0x82e74b0) at main.c:265 #36 0x42015574 in __libc_start_main () from /lib/tls/libc.so.6 The following may also be helpful: (gdb) print RecentXmin $2 = 6385 The xid passed to SubTransGetTopmostTransaction() is 6384. Gavin ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] subtransaction assert failure
I am also seeing regular "make check" failures on my test buildfarm system (FC1, 2.4.22 kernel, cassert turned on): sysname | snapshot | status | stage +-++--- cat | 2004-09-15 15:25:59 | 2 | Check cat | 2004-09-15 23:27:37 | 0 | OK cat | 2004-09-16 00:25:55 | 2 | Check cheers andrew Neil Conway wrote: I'm seeing an intermittent assertion failure while running "make check" with current sources. I can usually reproduce the problem at least once out of every 3 or 4 runs of "make check" on each of two different machines (one is an 866 mhz iBook running OSX, the other is a dual Xeon @ 2.8 ghz running Debian w/ kernel 2.6). The assertion failure is: TRAP: FailedAssertion("!(TransactionIdFollowsOrEquals(xid, RecentXmin))", File: "/Users/neilc/pgsql/src/backend/access/transam/subtrans.c", Line: 146) The backtrace is: #0 0x900429ac in kill () #1 0x9009eb1c in abort () #2 0x001c3b60 in ExceptionalCondition (conditionName=0x0, errorType=0x25 "", fileName=0x95 "", lineNumber=1768842554) at /Users/neilc/pgsql/src/backend/utils/error/assert.c:51 #3 0x000461c8 in SubTransGetTopmostTransaction (xid=6145) at /Users/neilc/pgsql/src/backend/access/transam/subtrans.c:146 #4 0x001e266c in HeapTupleSatisfiesSnapshot (tuple=0x1801, snapshot=0x1801) at /Users/neilc/pgsql/src/backend/utils/time/tqual.c:798 #5 0x00016330 in heap_release_fetch (relation=0xae2a40, snapshot=0x203ec1c, tuple=0x2048f6c, userbuf=0x2048f84, keep_buf=1 '\001', pgstat_info=0x2048fa8) at /Users/neilc/pgsql/src/backend/access/heap/heapam.c:973 #6 0x0001efa4 in index_getnext (scan=0x1, direction=11414080) at /Users/neilc/pgsql/src/backend/access/index/indexam.c:524 #7 0x000cd420 in IndexNext (node=0x27615c) at /Users/neilc/pgsql/src/backend/executor/nodeIndexscan.c:316 #8 0x000c6f5c in ExecScan (node=0x20489d8, accessMtd=0xcd114 ) at /Users/neilc/pgsql/src/backend/executor/execScan.c:98 #9 0x000c1c4c in ExecProcNode (node=0x20489d8) at /Users/neilc/pgsql/src/backend/executor/execProcnode.c:307 #10 0x000ceac8 in ExecMergeJoin (node=0x20489d8) at /Users/neilc/pgsql/src/backend/executor/nodeMergejoin.c:754 #11 0x000c1c88 in ExecProcNode (node=0x2048680) at /Users/neilc/pgsql/src/backend/executor/execProcnode.c:330 #12 0x000cabe8 in agg_retrieve_direct (aggstate=0x2048388) at /Users/neilc/pgsql/src/backend/executor/nodeAgg.c:762 #13 0x000c1cc4 in ExecProcNode (node=0x2048388) at /Users/neilc/pgsql/src/backend/executor/execProcnode.c:353 #14 0x000d0b84 in ExecSort (node=0x20482fc) at /Users/neilc/pgsql/src/backend/executor/nodeSort.c:102 #15 0x000c1cac in ExecProcNode (node=0x20482fc) at /Users/neilc/pgsql/src/backend/executor/execProcnode.c:345 #16 0x000c0294 in ExecutePlan (estate=0x204801c, planstate=0x20482fc, operation=CMD_SELECT, numberTuples=0, direction=1768842554, dest=0xaf2684) at /Users/neilc/pgsql/src/backend/executor/execMain.c:1060 #17 0x000bf4ac in ExecutorRun (queryDesc=0x203ec48, direction=33849372, count=0) at /Users/neilc/pgsql/src/backend/executor/execMain.c:226 #18 0x001521f8 in PortalRunSelect (portal=0x201f5c8, forward=-4 '?', count=33811528, dest=0x0) at /Users/neilc/pgsql/src/backend/tcop/pquery.c:718 [...] (captured from the OSX machine; note that certain parts of the backtrace seem corrupted, so I wouldn't put _too_ much faith in it.) I can consistently repro this, provided I have the patience to run "make check" enough times. Can anyone else repro the problem? -Neil ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] subtransaction assert failure
On Thu, 16 Sep 2004, Gavin Sherry wrote: > Interestingly, I *cannot* recreate on the single CPU system and I cannot I've benn trying to reproduce it on a single cpu system with fedora 2, but no luck. If someone wants else to try, don't forget the configure flag --enable-cassert. -- /Dennis Björklund ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] subtransaction assert failure
On Thu, 16 Sep 2004, Neil Conway wrote: > I can consistently repro this, provided I have the patience to run "make > check" enough times. Can anyone else repro the problem? I've managed to recreate this. I've tried on two linux 2.4 systems. One way and four way systems. Interestingly, I *cannot* recreate on the single CPU system and I cannot get abort() to generate a core. Neil has found the same problem under Linux. Gavin ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] subtransaction assert failure
I'm seeing an intermittent assertion failure while running "make check" with current sources. I can usually reproduce the problem at least once out of every 3 or 4 runs of "make check" on each of two different machines (one is an 866 mhz iBook running OSX, the other is a dual Xeon @ 2.8 ghz running Debian w/ kernel 2.6). The assertion failure is: TRAP: FailedAssertion("!(TransactionIdFollowsOrEquals(xid, RecentXmin))", File: "/Users/neilc/pgsql/src/backend/access/transam/subtrans.c", Line: 146) The backtrace is: #0 0x900429ac in kill () #1 0x9009eb1c in abort () #2 0x001c3b60 in ExceptionalCondition (conditionName=0x0, errorType=0x25 "", fileName=0x95 "", lineNumber=1768842554) at /Users/neilc/pgsql/src/backend/utils/error/assert.c:51 #3 0x000461c8 in SubTransGetTopmostTransaction (xid=6145) at /Users/neilc/pgsql/src/backend/access/transam/subtrans.c:146 #4 0x001e266c in HeapTupleSatisfiesSnapshot (tuple=0x1801, snapshot=0x1801) at /Users/neilc/pgsql/src/backend/utils/time/tqual.c:798 #5 0x00016330 in heap_release_fetch (relation=0xae2a40, snapshot=0x203ec1c, tuple=0x2048f6c, userbuf=0x2048f84, keep_buf=1 '\001', pgstat_info=0x2048fa8) at /Users/neilc/pgsql/src/backend/access/heap/heapam.c:973 #6 0x0001efa4 in index_getnext (scan=0x1, direction=11414080) at /Users/neilc/pgsql/src/backend/access/index/indexam.c:524 #7 0x000cd420 in IndexNext (node=0x27615c) at /Users/neilc/pgsql/src/backend/executor/nodeIndexscan.c:316 #8 0x000c6f5c in ExecScan (node=0x20489d8, accessMtd=0xcd114 ) at /Users/neilc/pgsql/src/backend/executor/execScan.c:98 #9 0x000c1c4c in ExecProcNode (node=0x20489d8) at /Users/neilc/pgsql/src/backend/executor/execProcnode.c:307 #10 0x000ceac8 in ExecMergeJoin (node=0x20489d8) at /Users/neilc/pgsql/src/backend/executor/nodeMergejoin.c:754 #11 0x000c1c88 in ExecProcNode (node=0x2048680) at /Users/neilc/pgsql/src/backend/executor/execProcnode.c:330 #12 0x000cabe8 in agg_retrieve_direct (aggstate=0x2048388) at /Users/neilc/pgsql/src/backend/executor/nodeAgg.c:762 #13 0x000c1cc4 in ExecProcNode (node=0x2048388) at /Users/neilc/pgsql/src/backend/executor/execProcnode.c:353 #14 0x000d0b84 in ExecSort (node=0x20482fc) at /Users/neilc/pgsql/src/backend/executor/nodeSort.c:102 #15 0x000c1cac in ExecProcNode (node=0x20482fc) at /Users/neilc/pgsql/src/backend/executor/execProcnode.c:345 #16 0x000c0294 in ExecutePlan (estate=0x204801c, planstate=0x20482fc, operation=CMD_SELECT, numberTuples=0, direction=1768842554, dest=0xaf2684) at /Users/neilc/pgsql/src/backend/executor/execMain.c:1060 #17 0x000bf4ac in ExecutorRun (queryDesc=0x203ec48, direction=33849372, count=0) at /Users/neilc/pgsql/src/backend/executor/execMain.c:226 #18 0x001521f8 in PortalRunSelect (portal=0x201f5c8, forward=-4 '?', count=33811528, dest=0x0) at /Users/neilc/pgsql/src/backend/tcop/pquery.c:718 [...] (captured from the OSX machine; note that certain parts of the backtrace seem corrupted, so I wouldn't put _too_ much faith in it.) I can consistently repro this, provided I have the patience to run "make check" enough times. Can anyone else repro the problem? -Neil ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings