Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-07-24 Thread Andres Freund
Hi, On 2021-06-21 05:29:19 -0700, Andres Freund wrote: > On 2021-06-16 12:12:23 -0700, Andres Freund wrote: > > Could you share your testcase? I've been working on a series of patches > > to address this (I'll share in a bit), and I've run quite a few tests, > > and didn't hit any infinite loops.

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-07-16 Thread Peter Geoghegan
On Wed, Jun 16, 2021 at 1:21 PM Peter Geoghegan wrote: > Oh yeah, I think that I get it now. Tell me if this sounds right to you: > > It's not so much that HeapTupleSatisfiesVacuum() "disagrees" with > heap_prune_satisfies_vacuum() in a way that actually matters to > VACUUM. While there does seem

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-21 Thread Andres Freund
Hi, On 2021-06-16 12:12:23 -0700, Andres Freund wrote: > Could you share your testcase? I've been working on a series of patches > to address this (I'll share in a bit), and I've run quite a few tests, > and didn't hit any infinite loops. Sorry for not yet doing that. Unfortunately I have an

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-16 Thread Peter Geoghegan
On Wed, Jun 16, 2021 at 12:22 PM Andres Freund wrote: > I think it's more complicated than that - "before" isn't a guarantee when the > horizon can go backwards. Consider the case where a hot_standby_feedback=on > replica without a slot connects - that can result in the xid horizon going >

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-16 Thread Matthias van de Meent
On Wed, 16 Jun 2021 at 21:22, Andres Freund wrote: > > Hi, > > On 2021-06-16 09:46:07 -0700, Peter Geoghegan wrote: > > On Wed, Jun 16, 2021 at 9:03 AM Peter Geoghegan wrote: > > > On Wed, Jun 16, 2021 at 3:59 AM Matthias van de Meent > > > > So the implicit assumption in heap_page_prune that >

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-16 Thread Matthias van de Meent
On Wed, 16 Jun 2021 at 21:12, Andres Freund wrote: > > Hi, > > On 2021-06-16 12:59:33 +0200, Matthias van de Meent wrote: > > PFA my adapted patch that fixes this new-ish issue, and does not > > include the (incorrect) assertions in GlobalVisUpdateApply. I've > > tested this against the

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-16 Thread Andres Freund
Hi, On 2021-06-16 09:46:07 -0700, Peter Geoghegan wrote: > On Wed, Jun 16, 2021 at 9:03 AM Peter Geoghegan wrote: > > On Wed, Jun 16, 2021 at 3:59 AM Matthias van de Meent > > > So the implicit assumption in heap_page_prune that > > > HeapTupleSatisfiesVacuum(OldestXmin) is always consistent

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-16 Thread Andres Freund
Hi, On 2021-06-16 12:59:33 +0200, Matthias van de Meent wrote: > PFA my adapted patch that fixes this new-ish issue, and does not > include the (incorrect) assertions in GlobalVisUpdateApply. I've > tested this against the reproducing case, both with and without the > fix in

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-16 Thread Peter Geoghegan
On Wed, Jun 16, 2021 at 9:03 AM Peter Geoghegan wrote: > On Wed, Jun 16, 2021 at 3:59 AM Matthias van de Meent > > So the implicit assumption in heap_page_prune that > > HeapTupleSatisfiesVacuum(OldestXmin) is always consistent with > > heap_prune_satisfies_vacuum(vacrel) has never been true. In

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-16 Thread Peter Geoghegan
On Wed, Jun 16, 2021 at 3:59 AM Matthias van de Meent wrote: > On Tue, 15 Jun 2021 at 03:22, Andres Freund wrote: > > > @@ -4032,6 +4039,24 @@ GlobalVisTestShouldUpdate(GlobalVisState *state) > > > static void > > > GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons) > > > { > > > +

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-16 Thread Matthias van de Meent
On Tue, 15 Jun 2021 at 03:22, Andres Freund wrote: > > Hi, > > > @@ -4032,6 +4039,24 @@ GlobalVisTestShouldUpdate(GlobalVisState *state) > > static void > > GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons) > > { > > + /* assert non-decreasing nature of horizons */ > > Thinking more

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-14 Thread Andres Freund
Hi, > @@ -4032,6 +4039,24 @@ GlobalVisTestShouldUpdate(GlobalVisState *state) > static void > GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons) > { > + /* assert non-decreasing nature of horizons */ > + Assert(FullTransactionIdFollowsOrEquals( > +

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-14 Thread Andres Freund
Hi, On 2021-06-14 11:53:47 +0200, Matthias van de Meent wrote: > On Thu, 10 Jun 2021 at 19:43, Peter Geoghegan wrote: > > > > On Thu, Jun 10, 2021 at 10:29 AM Matthias van de Meent > > wrote: > > > I see one exit for HEAPTUPLE_DEAD on a potentially recently committed > > > xvac (?), and we

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-14 Thread Matthias van de Meent
On Thu, 10 Jun 2021 at 19:43, Peter Geoghegan wrote: > > On Thu, Jun 10, 2021 at 10:29 AM Matthias van de Meent > wrote: > > I see one exit for HEAPTUPLE_DEAD on a potentially recently committed > > xvac (?), and we might also check against recently committed > > transactions if xmin == xmax,

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-10 Thread Peter Geoghegan
On Thu, Jun 10, 2021 at 7:38 PM Andres Freund wrote: > Well, I'd like to add assertions ensuring the retry path is only entered > when correct - but I feel hesitant about doing so when I can't exercise > that path reliably in at least some of the situations. I originally tested the

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-10 Thread Andres Freund
Hi, On 2021-06-10 19:15:59 -0700, Peter Geoghegan wrote: > On Thu, Jun 10, 2021 at 7:00 PM Andres Freund wrote: > > I'm not convinced - right now we don't exercise this path in tests at > > all. More assertions won't change that - stuff that can be triggered in > > production-ish loads doesn't

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-10 Thread Peter Geoghegan
On Thu, Jun 10, 2021 at 7:00 PM Andres Freund wrote: > I'm not convinced - right now we don't exercise this path in tests at > all. More assertions won't change that - stuff that can be triggered in > production-ish loads doesn't help during development. I do think that > that makes it far too

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-10 Thread Tom Lane
Peter Geoghegan writes: > ISTM that it would be much more useful to focus on adding an assertion > (or maybe even a "can't happen" error) that fails when the DEAD/goto > path is reached with a tuple whose xmin wasn't aborted. If that was in > place then we would have caught the bug in >

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-10 Thread Andres Freund
Hi, On 2021-06-10 18:49:50 -0700, Peter Geoghegan wrote: > ISTM that it would be much more useful to focus on adding an assertion > (or maybe even a "can't happen" error) that fails when the DEAD/goto > path is reached with a tuple whose xmin wasn't aborted. If that was in > place then we would

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-10 Thread Peter Geoghegan
On Thu, Jun 10, 2021 at 5:58 PM Andres Freund wrote: > The problem with writing a test is likely to find a way to halfway > reliably schedule a transaction abort after pruning, but before the > tuple-removal loop? Does anybody see a trick to do so? I asked Alexander about using his pending stop

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-10 Thread Andres Freund
Hi, On 2021-06-08 19:18:18 -0500, Justin Pryzby wrote: > I reproduced the issue on a new/fresh cluster like this: > > ./postgres -D data -c autovacuum_naptime=1 -c > autovacuum_analyze_scale_factor=0.005 -c log_autovacuum_min_duration=-1 > psql -h /tmp postgres -c "CREATE TABLE t(i int); INSERT

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-10 Thread Alvaro Herrera
On 2021-Jun-10, Peter Geoghegan wrote: > On Thu, Jun 10, 2021 at 10:29 AM Matthias van de Meent > wrote: > > I see one exit for HEAPTUPLE_DEAD on a potentially recently committed > > xvac (?), and we might also check against recently committed > > transactions if xmin == xmax, although

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-10 Thread Peter Geoghegan
On Thu, Jun 10, 2021 at 10:29 AM Matthias van de Meent wrote: > I see one exit for HEAPTUPLE_DEAD on a potentially recently committed > xvac (?), and we might also check against recently committed > transactions if xmin == xmax, although apparently that is not > implemented right now. I don't

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-10 Thread Matthias van de Meent
On Thu, 10 Jun 2021 at 19:07, Peter Geoghegan wrote: > > On Thu, Jun 10, 2021 at 9:57 AM Matthias van de Meent > wrote: > > > By "matches what we expect", I meant "involves a just-aborted > > > transaction". We could defensively verify that the inserting > > > transaction concurrently aborted at

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-10 Thread Andres Freund
Hi, On 2021-06-10 17:49:05 +0200, Matthias van de Meent wrote: > Apart from this, I'm also quite certain that the goto-branch that > created this infinite loop should have been dead code: In a correctly > working system, the GlobalVis*Rels should always be at least as strict > as the

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-10 Thread Peter Geoghegan
On Thu, Jun 10, 2021 at 9:57 AM Matthias van de Meent wrote: > > By "matches what we expect", I meant "involves a just-aborted > > transaction". We could defensively verify that the inserting > > transaction concurrently aborted at the point of retrying/calling > > heap_page_prune() a second

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-10 Thread Matthias van de Meent
On Thu, 10 Jun 2021 at 18:03, Peter Geoghegan wrote: > > On Thu, Jun 10, 2021 at 8:49 AM Matthias van de Meent > wrote: > > Could you elaborate on what this "matches what we expect" entails? > > > > Apart from this, I'm also quite certain that the goto-branch that > > created this infinite loop

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-10 Thread Peter Geoghegan
On Thu, Jun 10, 2021 at 8:49 AM Matthias van de Meent wrote: > Could you elaborate on what this "matches what we expect" entails? > > Apart from this, I'm also quite certain that the goto-branch that > created this infinite loop should have been dead code: In a correctly > working system, the

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-10 Thread Matthias van de Meent
On Wed, 9 Jun 2021 at 22:45, Peter Geoghegan wrote: > > On Wed, Jun 9, 2021 at 11:45 AM Andres Freund wrote: > > Good find! > > +1 > > > > The attached patch fixes this inconsistency > > > > I think I prefer applying the fix and the larger changes separately. > > I wonder if it's worth making

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-10 Thread Matthias van de Meent
On Wed, 9 Jun 2021 at 20:45, Andres Freund wrote: > > Specifically, the issue is that it uses the innocuous looking > > else if (RelationIsAccessibleInLogicalDecoding(rel)) > return horizons.catalog_oldest_nonremovable; > > but that's not sufficient, because > > #define

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-09 Thread Peter Geoghegan
On Wed, Jun 9, 2021 at 11:45 AM Andres Freund wrote: > Good find! +1 > > The attached patch fixes this inconsistency > > I think I prefer applying the fix and the larger changes separately. I wonder if it's worth making the goto inside lazy_scan_prune verify that the heap tuple matches what we

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-09 Thread Andres Freund
Hi, Good find! On 2021-06-09 17:42:34 +0200, Matthias van de Meent wrote: > I believe that I've found the culprit: > GetOldestNonRemovableTransactionId(rel) does not use the exact same > conditions for returning OldestXmin as GlobalVisTestFor(rel) does. > This results in different minimal XIDs,

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-09 Thread Matthias van de Meent
On Wed, 9 Jun 2021 at 04:42, Michael Paquier wrote: > > On Tue, Jun 08, 2021 at 05:47:28PM -0700, Peter Geoghegan wrote: > > I don't have time to try this out myself today, but offhand I'm pretty > > confident that this is sufficient to reproduce the underlying bug > > itself. And if that's true

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Michael Paquier
On Tue, Jun 08, 2021 at 05:47:28PM -0700, Peter Geoghegan wrote: > I don't have time to try this out myself today, but offhand I'm pretty > confident that this is sufficient to reproduce the underlying bug > itself. And if that's true then I guess it can't have anything to do > with the

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Masahiko Sawada
On Wed, Jun 9, 2021 at 2:17 AM Andres Freund wrote: > > Hi, > > On 2021-06-08 14:27:14 +0200, Matthias van de Meent wrote: > > heap_prune_satisfies_vacuum considers 1 more transaction to be > > unvacuumable, and thus indeed won't vacuum that tuple that > > HeapTupleSatisfiesVacuum does want to be

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Justin Pryzby
On Tue, Jun 08, 2021 at 05:44:15PM -0700, Peter Geoghegan wrote: > On Tue, Jun 8, 2021 at 5:11 PM Tom Lane wrote: > > I wonder if this is a variant of the problem shown at > > > > https://www.postgresql.org/message-id/2591376.1621196582%40sss.pgh.pa.us > > > > where maybe_needed was visibly quite

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Peter Geoghegan
On Tue, Jun 8, 2021 at 5:18 PM Justin Pryzby wrote: > I reproduced the issue on a new/fresh cluster like this: > > ./postgres -D data -c autovacuum_naptime=1 -c > autovacuum_analyze_scale_factor=0.005 -c log_autovacuum_min_duration=-1 > psql -h /tmp postgres -c "CREATE TABLE t(i int); INSERT

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Peter Geoghegan
On Tue, Jun 8, 2021 at 5:11 PM Tom Lane wrote: > I wonder if this is a variant of the problem shown at > > https://www.postgresql.org/message-id/2591376.1621196582%40sss.pgh.pa.us > > where maybe_needed was visibly quite insane. This value is > less visibly insane, but it's still wrong. It

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Justin Pryzby
On Tue, Jun 08, 2021 at 02:38:37PM -0700, Peter Geoghegan wrote: > On Tue, Jun 8, 2021 at 2:23 PM Justin Pryzby wrote: > > I'm not sure what you're suggesting ? Maybe I should add some NOTICES > > there. > > Here is one approach that might work: Can you check if the assertion > added by the

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Tom Lane
Peter Geoghegan writes: > On Tue, Jun 8, 2021 at 5:27 AM Matthias van de Meent >>> (gdb) p GlobalVisCatalogRels >>> $57 = {definitely_needed = {value = 926025113}, maybe_needed = {value = >>> 926025112}} >> This maybe_needed is older than the OldestXmin, which indeed gives us >> this

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Peter Geoghegan
On Tue, Jun 8, 2021 at 5:27 AM Matthias van de Meent wrote: > > (gdb) p *vacrel > > $56 = {... OldestXmin = 926025113, ...} > > > > (gdb) p GlobalVisCatalogRels > > $57 = {definitely_needed = {value = 926025113}, maybe_needed = {value = > > 926025112}} > > This maybe_needed is older than the

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Peter Geoghegan
On Tue, Jun 8, 2021 at 4:03 AM Justin Pryzby wrote: > postgres=# SELECT lp, lp_off, lp_flags, lp_len, t_xmin, t_xmax, t_field3, > t_ctid, t_infomask2, t_infomask, t_hoff, t_bits, t_oid FROM > heap_page_items(pg_read_binary_file('/tmp/dump_block.page')); > lp | lp_off | lp_flags | lp_len |

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Peter Geoghegan
On Tue, Jun 8, 2021 at 2:23 PM Justin Pryzby wrote: > I'm not sure what you're suggesting ? Maybe I should add some NOTICES there. Here is one approach that might work: Can you check if the assertion added by the attached patch fails very quickly with your test case? This does nothing more

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Justin Pryzby
On Tue, Jun 08, 2021 at 01:52:40PM -0700, Peter Geoghegan wrote: > On Tue, Jun 8, 2021 at 12:27 PM Justin Pryzby wrote: > > > They're running this: > > > | PGOPTIONS="--deadlock_timeout=333ms -cstatement-timeout=3600s" psql -c > > > "REINDEX INDEX CONCURRENTLY $i" > > > And if it times out, it

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Peter Geoghegan
On Tue, Jun 8, 2021 at 12:27 PM Justin Pryzby wrote: > > They're running this: > > | PGOPTIONS="--deadlock_timeout=333ms -cstatement-timeout=3600s" psql -c > > "REINDEX INDEX CONCURRENTLY $i" > > And if it times out, it then runs: $PSQL "DROP INDEX CONCURRENTLY $bad" > ... > > $ date -d

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Alvaro Herrera
On 2021-Jun-08, Justin Pryzby wrote: > They're all zero except for: > > $201 = 1 '\001' > $202 = 3 '\003' > $203 = 1 '\001' > > src/include/storage/proc.h-#define PROC_IS_AUTOVACUUM 0x01 > /* is it an autovac worker? */ > src/include/storage/proc.h-#define

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Justin Pryzby
On Tue, Jun 08, 2021 at 12:34:04PM -0500, Justin Pryzby wrote: > On Tue, Jun 08, 2021 at 12:06:02PM -0400, Alvaro Herrera wrote: > > On 2021-Jun-06, Justin Pryzby wrote: > > > > > However, I also found an autovacuum chewing 100% CPU, and it appears the > > > problem is actually because autovacuum

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Justin Pryzby
On Tue, Jun 08, 2021 at 11:40:31AM -0700, Peter Geoghegan wrote: > Reminds me of the other bug that you also reported about a year ago, > Justin - which was never fixed. The one with the duplicate tids from a cic > (see pg 14 open item). > > Is this essentially the same environment as the one

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Justin Pryzby
On Tue, Jun 08, 2021 at 02:01:51PM -0400, Alvaro Herrera wrote: > On 2021-Jun-08, Justin Pryzby wrote: > > > On Tue, Jun 08, 2021 at 12:06:02PM -0400, Alvaro Herrera wrote: > > > On 2021-Jun-06, Justin Pryzby wrote: > > > > > > > However, I also found an autovacuum chewing 100% CPU, and it

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Peter Geoghegan
Reminds me of the other bug that you also reported about a year ago, Justin - which was never fixed. The one with the duplicate tids from a cic (see pg 14 open item). Is this essentially the same environment as the one that led to your other bug report? Peter Geoghegan (Sent from my phone)

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Alvaro Herrera
On 2021-Jun-08, Justin Pryzby wrote: > On Tue, Jun 08, 2021 at 12:06:02PM -0400, Alvaro Herrera wrote: > > On 2021-Jun-06, Justin Pryzby wrote: > > > > > However, I also found an autovacuum chewing 100% CPU, and it appears the > > > problem is actually because autovacuum has locked a page of

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Justin Pryzby
On Tue, Jun 08, 2021 at 12:06:02PM -0400, Alvaro Herrera wrote: > On 2021-Jun-06, Justin Pryzby wrote: > > > However, I also found an autovacuum chewing 100% CPU, and it appears the > > problem is actually because autovacuum has locked a page of pg-statistic, > > and > > every other process then

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Andres Freund
Hi, On 2021-06-08 14:27:14 +0200, Matthias van de Meent wrote: > heap_prune_satisfies_vacuum considers 1 more transaction to be > unvacuumable, and thus indeed won't vacuum that tuple that > HeapTupleSatisfiesVacuum does want to be vacuumed. > > The new open question is now: Why is >

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Alvaro Herrera
On 2021-Jun-06, Justin Pryzby wrote: > However, I also found an autovacuum chewing 100% CPU, and it appears the > problem is actually because autovacuum has locked a page of pg-statistic, and > every other process then gets stuck waiting in the planner. I checked a few > and found these: >

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Justin Pryzby
On Tue, Jun 08, 2021 at 02:27:14PM +0200, Matthias van de Meent wrote: > Thanks for the information! I created an apparently-complete core file by first doing this: | echo 127 |sudo tee /proc/22591/coredump_filter *and updated wiki:Developer_FAQ to work with huge pages I'm planning to kill the

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Matthias van de Meent
On Tue, 8 Jun 2021 at 14:11, Justin Pryzby wrote: > > On Tue, Jun 08, 2021 at 01:54:41PM +0200, Matthias van de Meent wrote: > > On Tue, 8 Jun 2021 at 13:03, Justin Pryzby wrote: > > > > > > On Sun, Jun 06, 2021 at 11:00:38AM -0700, Peter Geoghegan wrote: > > > > On Sun, Jun 6, 2021 at 9:35 AM

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Justin Pryzby
On Tue, Jun 08, 2021 at 01:54:41PM +0200, Matthias van de Meent wrote: > On Tue, 8 Jun 2021 at 13:03, Justin Pryzby wrote: > > > > On Sun, Jun 06, 2021 at 11:00:38AM -0700, Peter Geoghegan wrote: > > > On Sun, Jun 6, 2021 at 9:35 AM Justin Pryzby wrote: > > > > I'll leave the instance running

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Matthias van de Meent
On Tue, 8 Jun 2021 at 13:03, Justin Pryzby wrote: > > On Sun, Jun 06, 2021 at 11:00:38AM -0700, Peter Geoghegan wrote: > > On Sun, Jun 6, 2021 at 9:35 AM Justin Pryzby wrote: > > > I'll leave the instance running for a little bit before restarting (or > > > kill-9) > > > in case someone

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Justin Pryzby
On Sun, Jun 06, 2021 at 01:59:10PM -0400, Tom Lane wrote: > Matthias van de Meent writes: > > On Sun, 6 Jun 2021 at 18:35, Justin Pryzby wrote: > >> However, I also found an autovacuum chewing 100% CPU, and it appears the > >> problem is actually because autovacuum has locked a page of

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-08 Thread Justin Pryzby
On Sun, Jun 06, 2021 at 11:00:38AM -0700, Peter Geoghegan wrote: > On Sun, Jun 6, 2021 at 9:35 AM Justin Pryzby wrote: > > I'll leave the instance running for a little bit before restarting (or > > kill-9) > > in case someone requests more info. > > How about dumping the page image out, and

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-06 Thread Justin Pryzby
On Sun, Jun 06, 2021 at 07:26:22PM +0200, Matthias van de Meent wrote: > I think it would be helpful for further debugging if we would have the > state of the all tuples on that page (well, the tuple headers with > their transactionids and their line pointers), as that would help with >

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-06 Thread Peter Geoghegan
On Sun, Jun 6, 2021 at 11:43 AM Justin Pryzby wrote: > Sorry, but I already killed the process to try to follow Matthias' suggestion. > I have a core file from "gcore" but it looks like it's incomplete and the > address is now "out of bounds"... Based on what you said about ending up back in

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-06 Thread Justin Pryzby
On Sun, Jun 06, 2021 at 11:00:38AM -0700, Peter Geoghegan wrote: > On Sun, Jun 6, 2021 at 9:35 AM Justin Pryzby wrote: > > I'll leave the instance running for a little bit before restarting (or > > kill-9) > > in case someone requests more info. > > How about dumping the page image out, and

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-06 Thread Andres Freund
Hi, On Sun, Jun 6, 2021, at 10:59, Tom Lane wrote: > Matthias van de Meent writes: > > On Sun, 6 Jun 2021 at 18:35, Justin Pryzby wrote: > >> However, I also found an autovacuum chewing 100% CPU, and it appears the > >> problem is actually because autovacuum has locked a page of pg-statistic,

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-06 Thread Peter Geoghegan
On Sun, Jun 6, 2021 at 9:35 AM Justin Pryzby wrote: > I'll leave the instance running for a little bit before restarting (or kill-9) > in case someone requests more info. How about dumping the page image out, and sharing it with the list? This procedure should work fine from gdb:

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-06 Thread Tom Lane
Matthias van de Meent writes: > On Sun, 6 Jun 2021 at 18:35, Justin Pryzby wrote: >> However, I also found an autovacuum chewing 100% CPU, and it appears the >> problem is actually because autovacuum has locked a page of pg-statistic, and >> every other process then gets stuck waiting in the

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-06 Thread Matthias van de Meent
On Sun, 6 Jun 2021 at 18:35, Justin Pryzby wrote: > > An internal instance was rejecting connections with "too many clients". > I found a bunch of processes waiting on a futex and I was going to upgrade the > kernel (3.10.0-514) and dismiss the issue. > > However, I also found an autovacuum

pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-06 Thread Justin Pryzby
An internal instance was rejecting connections with "too many clients". I found a bunch of processes waiting on a futex and I was going to upgrade the kernel (3.10.0-514) and dismiss the issue. However, I also found an autovacuum chewing 100% CPU, and it appears the problem is actually because