Re: 15beta1 test failure on mips in isolation/expected/stats
Re: Andres Freund > > I'd be good with tweaking this test case as you suggest, and maybe > > revisiting the topic later. > > Pushed the change of the test. Christoph, just to make sure, can you confirm > that this fixes the test instability for you? Unfortunately I could not reproduce the problem on the mipsel porter box I have access to (which is the same box used as mips64el porter box for the other thread). Running the "stats" test 30 times in a loop always made it pass. On the Debian buildds, the build has succeeded now in the 3rd try: https://buildd.debian.org/status/logs.php?pkg=postgresql-15=15%7Ebeta1-1=mipsel Christoph
Re: 15beta1 test failure on mips in isolation/expected/stats
Hi, On 2022-05-20 01:25:10 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2022-05-20 00:22:14 -0400, Tom Lane wrote: > >> There's some fallout in the expected-file, of course, but this > >> does seem to fix it (20 consecutive successful runs now at > >> 100/2). Don't see why though ... > > > I think what might be happening is that the transactional stats updates get > > reported by s2 *before* the non-transactional stats updates come in from > > s1. I.e. the pgstat_report_stat() at the end of s2_commit_prepared_a does a > > report, because the machine is slow enough for it to be "time to reports > > stats > > again". Then s1 reports its non-transactional stats. > > Sounds plausible. And I left the test loop running, and it's now past > 100 consecutive successes, so I think this change definitely "fixes" it. FWIW, the problem can be reliably reproduced by sticking a pgstat_force_next_flush() into pgstat_twophase_postcommit(). This is the only failure when doing so. > > It looks like our stats maintenance around truncation isn't quite > > "concurrency > > safe". That code hasn't meaningfully changed, but it'd not be surprising if > > it's not 100% precise... > > Yeah. Probably not something to try to improve post-beta, especially > since it's not completely clear how transactional and non-transactional > cases *should* interact. Yea. It's also not normally particularly crucial to be accurate down to that degree. > Maybe non-transactional updates should be > pushed immediately? But I'm not sure if that's fully correct, and > it definitely sounds expensive. I think that'd be far too expensive - the majority of stats are non-transactional... I think what we could do is to model truncates as subtracting the number of live/dead rows the truncating backend knows about, rather than setting them to 0. But that of course could incur other inaccuracies. > I'd be good with tweaking this test case as you suggest, and maybe > revisiting the topic later. Pushed the change of the test. Christoph, just to make sure, can you confirm that this fixes the test instability for you? > Kyotaro-san worried about whether any other places in stats.spec > have the same issue. I've not seen any evidence of that in my > tests, but perhaps some other machine with different timing > could find it. I tried to find some by putting in forced flushes in a bunch of places before, and now some more, without finding further cases. Greetings, Andres Freund
Re: 15beta1 test failure on mips in isolation/expected/stats
I wrote: > Andres Freund writes: >> I think what might be happening is that the transactional stats updates get >> reported by s2 *before* the non-transactional stats updates come in from >> s1. I.e. the pgstat_report_stat() at the end of s2_commit_prepared_a does a >> report, because the machine is slow enough for it to be "time to reports >> stats >> again". Then s1 reports its non-transactional stats. > Sounds plausible. And I left the test loop running, and it's now past > 100 consecutive successes, so I think this change definitely "fixes" it. In the light of morning, at least half of the parameter dependency is now obvious: the problematic test case involves a prepared transaction, so it fails completely with max_prepared_transactions = 0. The isolation test harness masks that by matching against stats_1.out, but it's not really a "success". My numbers do still suggest that there's a weak dependence on max_connections, but it's possible that that's a mirage. I did not run enough test cycles to be able to say positively that that's a real effect (and the machine's slow enough that I'm not excited about doing so). regards, tom lane
Re: 15beta1 test failure on mips in isolation/expected/stats
Andres Freund writes: > On 2022-05-20 00:22:14 -0400, Tom Lane wrote: >> There's some fallout in the expected-file, of course, but this >> does seem to fix it (20 consecutive successful runs now at >> 100/2). Don't see why though ... > I think what might be happening is that the transactional stats updates get > reported by s2 *before* the non-transactional stats updates come in from > s1. I.e. the pgstat_report_stat() at the end of s2_commit_prepared_a does a > report, because the machine is slow enough for it to be "time to reports stats > again". Then s1 reports its non-transactional stats. Sounds plausible. And I left the test loop running, and it's now past 100 consecutive successes, so I think this change definitely "fixes" it. > It looks like our stats maintenance around truncation isn't quite "concurrency > safe". That code hasn't meaningfully changed, but it'd not be surprising if > it's not 100% precise... Yeah. Probably not something to try to improve post-beta, especially since it's not completely clear how transactional and non-transactional cases *should* interact. Maybe non-transactional updates should be pushed immediately? But I'm not sure if that's fully correct, and it definitely sounds expensive. I'd be good with tweaking this test case as you suggest, and maybe revisiting the topic later. Kyotaro-san worried about whether any other places in stats.spec have the same issue. I've not seen any evidence of that in my tests, but perhaps some other machine with different timing could find it. regards, tom lane
Re: 15beta1 test failure on mips in isolation/expected/stats
Hi, On 2022-05-20 00:22:14 -0400, Tom Lane wrote: > Andres Freund writes: > > Hm. Does the instability vanish if you switch s2_commit_prepared_a and > > s1_ff? > > Like this? Yea. > diff --git a/src/test/isolation/specs/stats.spec > b/src/test/isolation/specs/stats.spec > index be4ae1f4ff..70be29b207 100644 > --- a/src/test/isolation/specs/stats.spec > +++ b/src/test/isolation/specs/stats.spec > @@ -562,8 +562,9 @@ permutation >s1_table_insert_k1 # should be counted >s1_table_update_k1 # dito >s1_prepare_a > + s1_ff >s2_commit_prepared_a > - s1_ff s2_ff > + s2_ff >s1_table_stats > > # S1 prepares, S1 aborts prepared > > There's some fallout in the expected-file, of course, but this > does seem to fix it (20 consecutive successful runs now at > 100/2). Don't see why though ... I think what might be happening is that the transactional stats updates get reported by s2 *before* the non-transactional stats updates come in from s1. I.e. the pgstat_report_stat() at the end of s2_commit_prepared_a does a report, because the machine is slow enough for it to be "time to reports stats again". Then s1 reports its non-transactional stats. It looks like our stats maintenance around truncation isn't quite "concurrency safe". That code hasn't meaningfully changed, but it'd not be surprising if it's not 100% precise... Greetings, Andres Freund
Re: 15beta1 test failure on mips in isolation/expected/stats
Andres Freund writes: > Hm. Does the instability vanish if you switch s2_commit_prepared_a and s1_ff? Like this? diff --git a/src/test/isolation/specs/stats.spec b/src/test/isolation/specs/stats.spec index be4ae1f4ff..70be29b207 100644 --- a/src/test/isolation/specs/stats.spec +++ b/src/test/isolation/specs/stats.spec @@ -562,8 +562,9 @@ permutation s1_table_insert_k1 # should be counted s1_table_update_k1 # dito s1_prepare_a + s1_ff s2_commit_prepared_a - s1_ff s2_ff + s2_ff s1_table_stats # S1 prepares, S1 aborts prepared There's some fallout in the expected-file, of course, but this does seem to fix it (20 consecutive successful runs now at 100/2). Don't see why though ... regards, tom lane
Re: 15beta1 test failure on mips in isolation/expected/stats
At Thu, 19 May 2022 22:58:14 -0400, Tom Lane wrote in > Kyotaro Horiguchi writes: > Notice that these markers can only delay reporting of the completion > of a step, not the launch of a step. The isolationtester will launch > the next step in a permutation as soon as (A) all prior steps of the > same session are done, and (B) the immediately preceding step in the > permutation is done or deemed blocked. For this purpose, "deemed > blocked" means that it has been seen to be waiting on a database lock, > or that it is complete but the report of its completion is delayed by > one of these markers. > > There's no "waiting..." reports in the test output, nor do we have any > condition markers in stats.spec right now, so I don't think any steps > have been "deemed blocked". Mmm... Thanks. I miunderstood the effect of it.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: 15beta1 test failure on mips in isolation/expected/stats
Hi, On 2022-05-19 22:58:14 -0400, Tom Lane wrote: > What I am wondering about at this point is whether the effects of > pg_stat_force_next_flush() could somehow be delayed until after we > have told the client the command is complete. It shouldn't - it just forces pg_stat_report_stat() to flush (rather than doing so based on the time of the last report). And pg_stat_report_stat() happens before ReadyForQuery(). Hm. Does the instability vanish if you switch s2_commit_prepared_a and s1_ff? Greetings, Andres Freund
Re: 15beta1 test failure on mips in isolation/expected/stats
Andres Freund writes: > On 2022-05-19 21:42:31 -0400, Tom Lane wrote: >> Even more interesting, the repeatability varies with the settings >> of max_connections and max_prepared_transactions. > That is indeed quite odd. >> At low values (resp. 20 and 0) I've not been able to make it happen at all, >> but at 100 and 2 it happens circa three times out of four. > Is this the only permutation that fails? No, but those values definitely seem to affect the probability of failure. I just finished a more extensive series of runs, and got: successes/tries with max_connections/max_prepared_transactions: 5/5 OK with 20/2 5/5 OK with 100/0 3/5 OK with 100/2 4/5 OK with 200/2 5/6 OK with 100/10 5/6 OK with 50/2 6/10 OK with 100/2 It seems like the probability decreases again if I raise either number further. So I'm now guessing that this is purely a timing issue and that somehow 100/2 is near the sweet spot for hitting the timing window. Why those settings would affect pgstats at all is unclear, though. regards, tom lane
Re: 15beta1 test failure on mips in isolation/expected/stats
Hi, On 2022-05-19 21:42:31 -0400, Tom Lane wrote: > > > > seq_scan|seq_tup_read|n_tup_ins|n_tup_upd|n_tup_del|n_live_tup|n_dead_tup|vacuum_count > > > > ++-+-+-+--+--+ > > - 3| 9|5|1|0| 1| 1| > > 0 > > + 3| 9|5|1|0| 4| 1| > > 0 > > (1 row) > Even more interesting, the repeatability varies with the settings > of max_connections and max_prepared_transactions. That is indeed quite odd. > At low values (resp. 20 and 0) I've not been able to make it happen at all, > but at 100 and 2 it happens circa three times out of four. Is this the only permutation that fails? Greetings, Andres Freund
Re: 15beta1 test failure on mips in isolation/expected/stats
Kyotaro Horiguchi writes: > At Fri, 20 May 2022 11:02:21 +0900 (JST), Kyotaro Horiguchi > wrote in >> Doesn't the step s1_table_stats needs a blocking condition (s2_ff)? > s/needs/need/; > If I removed the step s2_ff, I see the same difference. So I think it > is that. Some other permutations seem to need the same. Hmm ... it does seem like the answer might be somewhere around there, but it's not there exactly. This won't fix it: - s1_table_stats + s1_table_stats(s2_ff) That sort of marker only stabilizes cases where two different steps might be reported as completing in either order. In this case, that's clearly not the problem. What we do need is to ensure that s1_table_stats doesn't *launch* before s2_ff is done. However, it doesn't look to me like that's what's happening. isolation/README explains that Notice that these markers can only delay reporting of the completion of a step, not the launch of a step. The isolationtester will launch the next step in a permutation as soon as (A) all prior steps of the same session are done, and (B) the immediately preceding step in the permutation is done or deemed blocked. For this purpose, "deemed blocked" means that it has been seen to be waiting on a database lock, or that it is complete but the report of its completion is delayed by one of these markers. There's no "waiting..." reports in the test output, nor do we have any condition markers in stats.spec right now, so I don't think any steps have been "deemed blocked". What I am wondering about at this point is whether the effects of pg_stat_force_next_flush() could somehow be delayed until after we have told the client the command is complete. I've not poked into the code in that area, but if that could happen it'd explain this behavior. regards, tom lane
Re: 15beta1 test failure on mips in isolation/expected/stats
At Fri, 20 May 2022 11:02:21 +0900 (JST), Kyotaro Horiguchi wrote in > Doesn't the step s1_table_stats needs a blocking condition (s2_ff)? s/needs/need/; If I removed the step s2_ff, I see the same difference. So I think it is that. Some other permutations seem to need the same. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/test/isolation/specs/stats.spec b/src/test/isolation/specs/stats.spec index be4ae1f4ff..441dd7d3b8 100644 --- a/src/test/isolation/specs/stats.spec +++ b/src/test/isolation/specs/stats.spec @@ -417,7 +417,7 @@ permutation s2_table_update_k1 s1_table_drop s2_ff - s1_table_stats + s1_table_stats(s2_ff) permutation s1_table_select @@ -461,7 +461,7 @@ permutation s2_table_select s1_track_counts_on s1_ff s2_ff - s1_table_stats + s1_table_stats(s2_ff) # but can count again after s1_table_select s1_table_update_k1 @@ -478,7 +478,7 @@ permutation s2_table_select s1_track_counts_on s1_ff s2_ff - s1_table_stats + s1_table_stats(s2_ff) s1_table_select s1_table_update_k1 s1_ff @@ -509,7 +509,7 @@ permutation s2_commit_prepared_a s1_table_select s1_ff s2_ff - s1_table_stats + s1_table_stats(s2_ff) # S1 prepares, S2 commits prepared permutation @@ -533,7 +533,7 @@ permutation s2_rollback_prepared_a s1_table_select s1_ff s2_ff - s1_table_stats + s1_table_stats(s2_ff) ### 2PC: truncate handling @@ -564,7 +564,7 @@ permutation s1_prepare_a s2_commit_prepared_a s1_ff s2_ff - s1_table_stats + s1_table_stats(s2_ff) # S1 prepares, S1 aborts prepared permutation @@ -592,7 +592,7 @@ permutation s1_prepare_a s2_rollback_prepared_a s1_ff s2_ff - s1_table_stats + s1_table_stats(s2_ff) ### 2PC: rolled back drop maintains live / dead counters @@ -627,7 +627,7 @@ permutation s1_prepare_a s2_rollback_prepared_a s1_ff s2_ff - s1_table_stats + s1_table_stats(s2_ff) ##
Re: 15beta1 test failure on mips in isolation/expected/stats
At Thu, 19 May 2022 21:42:31 -0400, Tom Lane wrote in > Christoph Berg writes: > > Debian unstable mips (the old 32-bit one): > > > --- /<>/src/test/isolation/expected/stats.out 2022-05-16 > > 21:10:42.0 + > > +++ /<>/build/src/test/isolation/output_iso/results/stats.out > > 2022-05-18 23:26:56.573000536 + > > @@ -2854,7 +2854,7 @@ > > > > > seq_scan|seq_tup_read|n_tup_ins|n_tup_upd|n_tup_del|n_live_tup|n_dead_tup|vacuum_count > > > > ++-+-+-+--+--+ > > - 3| 9|5|1|0| 1| 1| > > 0 > > + 3| 9|5|1|0| 4| 1| > > 0 > > (1 row) > > I have just discovered that I can reproduce this identical symptom > fairly repeatably on an experimental lashup that I've been running > with bleeding-edge NetBSD on my ancient HPPA box. (You didn't think > I was just going to walk away from that hardware, did you?) > > Even more interesting, the repeatability varies with the settings > of max_connections and max_prepared_transactions. At low values > (resp. 20 and 0) I've not been able to make it happen at all, but > at 100 and 2 it happens circa three times out of four. > > I have no idea where to start looking, but this is clearly an issue > in the new stats code ... or else the hoped-for goal of removing > flakiness from the stats tests is just as far away as ever. Doesn't the step s1_table_stats needs a blocking condition (s2_ff)? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: 15beta1 test failure on mips in isolation/expected/stats
Christoph Berg writes: > Debian unstable mips (the old 32-bit one): > --- /<>/src/test/isolation/expected/stats.out2022-05-16 > 21:10:42.0 + > +++ /<>/build/src/test/isolation/output_iso/results/stats.out > 2022-05-18 23:26:56.573000536 + > @@ -2854,7 +2854,7 @@ > > seq_scan|seq_tup_read|n_tup_ins|n_tup_upd|n_tup_del|n_live_tup|n_dead_tup|vacuum_count > > ++-+-+-+--+--+ > - 3| 9|5|1|0| 1| 1| > 0 > + 3| 9|5|1|0| 4| 1| > 0 > (1 row) I have just discovered that I can reproduce this identical symptom fairly repeatably on an experimental lashup that I've been running with bleeding-edge NetBSD on my ancient HPPA box. (You didn't think I was just going to walk away from that hardware, did you?) Even more interesting, the repeatability varies with the settings of max_connections and max_prepared_transactions. At low values (resp. 20 and 0) I've not been able to make it happen at all, but at 100 and 2 it happens circa three times out of four. I have no idea where to start looking, but this is clearly an issue in the new stats code ... or else the hoped-for goal of removing flakiness from the stats tests is just as far away as ever. regards, tom lane
15beta1 test failure on mips in isolation/expected/stats
Debian unstable mips (the old 32-bit one): test vacuum-conflict ... ok 2170 ms test vacuum-skip-locked ... ok 2445 ms test stats... FAILED38898 ms test horizons ... ok 4543 ms test predicate-hash ... ok22419 ms build/src/test/isolation/output_iso/regression.diffs diff -U3 /<>/src/test/isolation/expected/stats.out /<>/build/src/test/isolation/output_iso/results/stats.out --- /<>/src/test/isolation/expected/stats.out 2022-05-16 21:10:42.0 + +++ /<>/build/src/test/isolation/output_iso/results/stats.out 2022-05-18 23:26:56.573000536 + @@ -2854,7 +2854,7 @@ seq_scan|seq_tup_read|n_tup_ins|n_tup_upd|n_tup_del|n_live_tup|n_dead_tup|vacuum_count ++-+-+-+--+--+ - 3| 9|5|1|0| 1| 1| 0 + 3| 9|5|1|0| 4| 1| 0 (1 row) Full build log: https://buildd.debian.org/status/fetch.php?pkg=postgresql-15=mipsel=15%7Ebeta1-1=1652916588=0 (I'll try rescheduling this build as well, the last one took 4h before it failed.) Christoph