Hi,
On 2021-06-21 05:29:19 -0700, Andres Freund wrote:
> On 2021-06-16 12:12:23 -0700, Andres Freund wrote:
> > Could you share your testcase? I've been working on a series of patches
> > to address this (I'll share in a bit), and I've run quite a few tests,
> > and didn't hit any infinite loops.
On Wed, Jun 16, 2021 at 1:21 PM Peter Geoghegan wrote:
> Oh yeah, I think that I get it now. Tell me if this sounds right to you:
>
> It's not so much that HeapTupleSatisfiesVacuum() "disagrees" with
> heap_prune_satisfies_vacuum() in a way that actually matters to
> VACUUM. While there does seem
Hi,
On 2021-06-16 12:12:23 -0700, Andres Freund wrote:
> Could you share your testcase? I've been working on a series of patches
> to address this (I'll share in a bit), and I've run quite a few tests,
> and didn't hit any infinite loops.
Sorry for not yet doing that. Unfortunately I have an
On Wed, Jun 16, 2021 at 12:22 PM Andres Freund wrote:
> I think it's more complicated than that - "before" isn't a guarantee when the
> horizon can go backwards. Consider the case where a hot_standby_feedback=on
> replica without a slot connects - that can result in the xid horizon going
>
On Wed, 16 Jun 2021 at 21:22, Andres Freund wrote:
>
> Hi,
>
> On 2021-06-16 09:46:07 -0700, Peter Geoghegan wrote:
> > On Wed, Jun 16, 2021 at 9:03 AM Peter Geoghegan wrote:
> > > On Wed, Jun 16, 2021 at 3:59 AM Matthias van de Meent
> > > > So the implicit assumption in heap_page_prune that
>
On Wed, 16 Jun 2021 at 21:12, Andres Freund wrote:
>
> Hi,
>
> On 2021-06-16 12:59:33 +0200, Matthias van de Meent wrote:
> > PFA my adapted patch that fixes this new-ish issue, and does not
> > include the (incorrect) assertions in GlobalVisUpdateApply. I've
> > tested this against the
Hi,
On 2021-06-16 09:46:07 -0700, Peter Geoghegan wrote:
> On Wed, Jun 16, 2021 at 9:03 AM Peter Geoghegan wrote:
> > On Wed, Jun 16, 2021 at 3:59 AM Matthias van de Meent
> > > So the implicit assumption in heap_page_prune that
> > > HeapTupleSatisfiesVacuum(OldestXmin) is always consistent
Hi,
On 2021-06-16 12:59:33 +0200, Matthias van de Meent wrote:
> PFA my adapted patch that fixes this new-ish issue, and does not
> include the (incorrect) assertions in GlobalVisUpdateApply. I've
> tested this against the reproducing case, both with and without the
> fix in
On Wed, Jun 16, 2021 at 9:03 AM Peter Geoghegan wrote:
> On Wed, Jun 16, 2021 at 3:59 AM Matthias van de Meent
> > So the implicit assumption in heap_page_prune that
> > HeapTupleSatisfiesVacuum(OldestXmin) is always consistent with
> > heap_prune_satisfies_vacuum(vacrel) has never been true. In
On Wed, Jun 16, 2021 at 3:59 AM Matthias van de Meent
wrote:
> On Tue, 15 Jun 2021 at 03:22, Andres Freund wrote:
> > > @@ -4032,6 +4039,24 @@ GlobalVisTestShouldUpdate(GlobalVisState *state)
> > > static void
> > > GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
> > > {
> > > +
On Tue, 15 Jun 2021 at 03:22, Andres Freund wrote:
>
> Hi,
>
> > @@ -4032,6 +4039,24 @@ GlobalVisTestShouldUpdate(GlobalVisState *state)
> > static void
> > GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
> > {
> > + /* assert non-decreasing nature of horizons */
>
> Thinking more
Hi,
> @@ -4032,6 +4039,24 @@ GlobalVisTestShouldUpdate(GlobalVisState *state)
> static void
> GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
> {
> + /* assert non-decreasing nature of horizons */
> + Assert(FullTransactionIdFollowsOrEquals(
> +
Hi,
On 2021-06-14 11:53:47 +0200, Matthias van de Meent wrote:
> On Thu, 10 Jun 2021 at 19:43, Peter Geoghegan wrote:
> >
> > On Thu, Jun 10, 2021 at 10:29 AM Matthias van de Meent
> > wrote:
> > > I see one exit for HEAPTUPLE_DEAD on a potentially recently committed
> > > xvac (?), and we
On Thu, 10 Jun 2021 at 19:43, Peter Geoghegan wrote:
>
> On Thu, Jun 10, 2021 at 10:29 AM Matthias van de Meent
> wrote:
> > I see one exit for HEAPTUPLE_DEAD on a potentially recently committed
> > xvac (?), and we might also check against recently committed
> > transactions if xmin == xmax,
On Thu, Jun 10, 2021 at 7:38 PM Andres Freund wrote:
> Well, I'd like to add assertions ensuring the retry path is only entered
> when correct - but I feel hesitant about doing so when I can't exercise
> that path reliably in at least some of the situations.
I originally tested the
Hi,
On 2021-06-10 19:15:59 -0700, Peter Geoghegan wrote:
> On Thu, Jun 10, 2021 at 7:00 PM Andres Freund wrote:
> > I'm not convinced - right now we don't exercise this path in tests at
> > all. More assertions won't change that - stuff that can be triggered in
> > production-ish loads doesn't
On Thu, Jun 10, 2021 at 7:00 PM Andres Freund wrote:
> I'm not convinced - right now we don't exercise this path in tests at
> all. More assertions won't change that - stuff that can be triggered in
> production-ish loads doesn't help during development. I do think that
> that makes it far too
Peter Geoghegan writes:
> ISTM that it would be much more useful to focus on adding an assertion
> (or maybe even a "can't happen" error) that fails when the DEAD/goto
> path is reached with a tuple whose xmin wasn't aborted. If that was in
> place then we would have caught the bug in
>
Hi,
On 2021-06-10 18:49:50 -0700, Peter Geoghegan wrote:
> ISTM that it would be much more useful to focus on adding an assertion
> (or maybe even a "can't happen" error) that fails when the DEAD/goto
> path is reached with a tuple whose xmin wasn't aborted. If that was in
> place then we would
On Thu, Jun 10, 2021 at 5:58 PM Andres Freund wrote:
> The problem with writing a test is likely to find a way to halfway
> reliably schedule a transaction abort after pruning, but before the
> tuple-removal loop? Does anybody see a trick to do so?
I asked Alexander about using his pending stop
Hi,
On 2021-06-08 19:18:18 -0500, Justin Pryzby wrote:
> I reproduced the issue on a new/fresh cluster like this:
>
> ./postgres -D data -c autovacuum_naptime=1 -c
> autovacuum_analyze_scale_factor=0.005 -c log_autovacuum_min_duration=-1
> psql -h /tmp postgres -c "CREATE TABLE t(i int); INSERT
On 2021-Jun-10, Peter Geoghegan wrote:
> On Thu, Jun 10, 2021 at 10:29 AM Matthias van de Meent
> wrote:
> > I see one exit for HEAPTUPLE_DEAD on a potentially recently committed
> > xvac (?), and we might also check against recently committed
> > transactions if xmin == xmax, although
On Thu, Jun 10, 2021 at 10:29 AM Matthias van de Meent
wrote:
> I see one exit for HEAPTUPLE_DEAD on a potentially recently committed
> xvac (?), and we might also check against recently committed
> transactions if xmin == xmax, although apparently that is not
> implemented right now.
I don't
On Thu, 10 Jun 2021 at 19:07, Peter Geoghegan wrote:
>
> On Thu, Jun 10, 2021 at 9:57 AM Matthias van de Meent
> wrote:
> > > By "matches what we expect", I meant "involves a just-aborted
> > > transaction". We could defensively verify that the inserting
> > > transaction concurrently aborted at
Hi,
On 2021-06-10 17:49:05 +0200, Matthias van de Meent wrote:
> Apart from this, I'm also quite certain that the goto-branch that
> created this infinite loop should have been dead code: In a correctly
> working system, the GlobalVis*Rels should always be at least as strict
> as the
On Thu, Jun 10, 2021 at 9:57 AM Matthias van de Meent
wrote:
> > By "matches what we expect", I meant "involves a just-aborted
> > transaction". We could defensively verify that the inserting
> > transaction concurrently aborted at the point of retrying/calling
> > heap_page_prune() a second
On Thu, 10 Jun 2021 at 18:03, Peter Geoghegan wrote:
>
> On Thu, Jun 10, 2021 at 8:49 AM Matthias van de Meent
> wrote:
> > Could you elaborate on what this "matches what we expect" entails?
> >
> > Apart from this, I'm also quite certain that the goto-branch that
> > created this infinite loop
On Thu, Jun 10, 2021 at 8:49 AM Matthias van de Meent
wrote:
> Could you elaborate on what this "matches what we expect" entails?
>
> Apart from this, I'm also quite certain that the goto-branch that
> created this infinite loop should have been dead code: In a correctly
> working system, the
On Wed, 9 Jun 2021 at 22:45, Peter Geoghegan wrote:
>
> On Wed, Jun 9, 2021 at 11:45 AM Andres Freund wrote:
> > Good find!
>
> +1
>
> > > The attached patch fixes this inconsistency
> >
> > I think I prefer applying the fix and the larger changes separately.
>
> I wonder if it's worth making
On Wed, 9 Jun 2021 at 20:45, Andres Freund wrote:
>
> Specifically, the issue is that it uses the innocuous looking
>
> else if (RelationIsAccessibleInLogicalDecoding(rel))
> return horizons.catalog_oldest_nonremovable;
>
> but that's not sufficient, because
>
> #define
On Wed, Jun 9, 2021 at 11:45 AM Andres Freund wrote:
> Good find!
+1
> > The attached patch fixes this inconsistency
>
> I think I prefer applying the fix and the larger changes separately.
I wonder if it's worth making the goto inside lazy_scan_prune verify
that the heap tuple matches what we
Hi,
Good find!
On 2021-06-09 17:42:34 +0200, Matthias van de Meent wrote:
> I believe that I've found the culprit:
> GetOldestNonRemovableTransactionId(rel) does not use the exact same
> conditions for returning OldestXmin as GlobalVisTestFor(rel) does.
> This results in different minimal XIDs,
On Wed, 9 Jun 2021 at 04:42, Michael Paquier wrote:
>
> On Tue, Jun 08, 2021 at 05:47:28PM -0700, Peter Geoghegan wrote:
> > I don't have time to try this out myself today, but offhand I'm pretty
> > confident that this is sufficient to reproduce the underlying bug
> > itself. And if that's true
On Tue, Jun 08, 2021 at 05:47:28PM -0700, Peter Geoghegan wrote:
> I don't have time to try this out myself today, but offhand I'm pretty
> confident that this is sufficient to reproduce the underlying bug
> itself. And if that's true then I guess it can't have anything to do
> with the
On Wed, Jun 9, 2021 at 2:17 AM Andres Freund wrote:
>
> Hi,
>
> On 2021-06-08 14:27:14 +0200, Matthias van de Meent wrote:
> > heap_prune_satisfies_vacuum considers 1 more transaction to be
> > unvacuumable, and thus indeed won't vacuum that tuple that
> > HeapTupleSatisfiesVacuum does want to be
On Tue, Jun 08, 2021 at 05:44:15PM -0700, Peter Geoghegan wrote:
> On Tue, Jun 8, 2021 at 5:11 PM Tom Lane wrote:
> > I wonder if this is a variant of the problem shown at
> >
> > https://www.postgresql.org/message-id/2591376.1621196582%40sss.pgh.pa.us
> >
> > where maybe_needed was visibly quite
On Tue, Jun 8, 2021 at 5:18 PM Justin Pryzby wrote:
> I reproduced the issue on a new/fresh cluster like this:
>
> ./postgres -D data -c autovacuum_naptime=1 -c
> autovacuum_analyze_scale_factor=0.005 -c log_autovacuum_min_duration=-1
> psql -h /tmp postgres -c "CREATE TABLE t(i int); INSERT
On Tue, Jun 8, 2021 at 5:11 PM Tom Lane wrote:
> I wonder if this is a variant of the problem shown at
>
> https://www.postgresql.org/message-id/2591376.1621196582%40sss.pgh.pa.us
>
> where maybe_needed was visibly quite insane. This value is
> less visibly insane, but it's still wrong. It
On Tue, Jun 08, 2021 at 02:38:37PM -0700, Peter Geoghegan wrote:
> On Tue, Jun 8, 2021 at 2:23 PM Justin Pryzby wrote:
> > I'm not sure what you're suggesting ? Maybe I should add some NOTICES
> > there.
>
> Here is one approach that might work: Can you check if the assertion
> added by the
Peter Geoghegan writes:
> On Tue, Jun 8, 2021 at 5:27 AM Matthias van de Meent
>>> (gdb) p GlobalVisCatalogRels
>>> $57 = {definitely_needed = {value = 926025113}, maybe_needed = {value =
>>> 926025112}}
>> This maybe_needed is older than the OldestXmin, which indeed gives us
>> this
On Tue, Jun 8, 2021 at 5:27 AM Matthias van de Meent
wrote:
> > (gdb) p *vacrel
> > $56 = {... OldestXmin = 926025113, ...}
> >
> > (gdb) p GlobalVisCatalogRels
> > $57 = {definitely_needed = {value = 926025113}, maybe_needed = {value =
> > 926025112}}
>
> This maybe_needed is older than the
On Tue, Jun 8, 2021 at 4:03 AM Justin Pryzby wrote:
> postgres=# SELECT lp, lp_off, lp_flags, lp_len, t_xmin, t_xmax, t_field3,
> t_ctid, t_infomask2, t_infomask, t_hoff, t_bits, t_oid FROM
> heap_page_items(pg_read_binary_file('/tmp/dump_block.page'));
> lp | lp_off | lp_flags | lp_len |
On Tue, Jun 8, 2021 at 2:23 PM Justin Pryzby wrote:
> I'm not sure what you're suggesting ? Maybe I should add some NOTICES there.
Here is one approach that might work: Can you check if the assertion
added by the attached patch fails very quickly with your test case?
This does nothing more
On Tue, Jun 08, 2021 at 01:52:40PM -0700, Peter Geoghegan wrote:
> On Tue, Jun 8, 2021 at 12:27 PM Justin Pryzby wrote:
> > > They're running this:
> > > | PGOPTIONS="--deadlock_timeout=333ms -cstatement-timeout=3600s" psql -c
> > > "REINDEX INDEX CONCURRENTLY $i"
> > > And if it times out, it
On Tue, Jun 8, 2021 at 12:27 PM Justin Pryzby wrote:
> > They're running this:
> > | PGOPTIONS="--deadlock_timeout=333ms -cstatement-timeout=3600s" psql -c
> > "REINDEX INDEX CONCURRENTLY $i"
> > And if it times out, it then runs: $PSQL "DROP INDEX CONCURRENTLY $bad"
> ...
> > $ date -d
On 2021-Jun-08, Justin Pryzby wrote:
> They're all zero except for:
>
> $201 = 1 '\001'
> $202 = 3 '\003'
> $203 = 1 '\001'
>
> src/include/storage/proc.h-#define PROC_IS_AUTOVACUUM 0x01
> /* is it an autovac worker? */
> src/include/storage/proc.h-#define
On Tue, Jun 08, 2021 at 12:34:04PM -0500, Justin Pryzby wrote:
> On Tue, Jun 08, 2021 at 12:06:02PM -0400, Alvaro Herrera wrote:
> > On 2021-Jun-06, Justin Pryzby wrote:
> >
> > > However, I also found an autovacuum chewing 100% CPU, and it appears the
> > > problem is actually because autovacuum
On Tue, Jun 08, 2021 at 11:40:31AM -0700, Peter Geoghegan wrote:
> Reminds me of the other bug that you also reported about a year ago,
> Justin - which was never fixed. The one with the duplicate tids from a cic
> (see pg 14 open item).
>
> Is this essentially the same environment as the one
On Tue, Jun 08, 2021 at 02:01:51PM -0400, Alvaro Herrera wrote:
> On 2021-Jun-08, Justin Pryzby wrote:
>
> > On Tue, Jun 08, 2021 at 12:06:02PM -0400, Alvaro Herrera wrote:
> > > On 2021-Jun-06, Justin Pryzby wrote:
> > >
> > > > However, I also found an autovacuum chewing 100% CPU, and it
Reminds me of the other bug that you also reported about a year ago,
Justin - which was never fixed. The one with the duplicate tids from a cic
(see pg 14 open item).
Is this essentially the same environment as the one that led to your other
bug report?
Peter Geoghegan
(Sent from my phone)
On 2021-Jun-08, Justin Pryzby wrote:
> On Tue, Jun 08, 2021 at 12:06:02PM -0400, Alvaro Herrera wrote:
> > On 2021-Jun-06, Justin Pryzby wrote:
> >
> > > However, I also found an autovacuum chewing 100% CPU, and it appears the
> > > problem is actually because autovacuum has locked a page of
On Tue, Jun 08, 2021 at 12:06:02PM -0400, Alvaro Herrera wrote:
> On 2021-Jun-06, Justin Pryzby wrote:
>
> > However, I also found an autovacuum chewing 100% CPU, and it appears the
> > problem is actually because autovacuum has locked a page of pg-statistic,
> > and
> > every other process then
Hi,
On 2021-06-08 14:27:14 +0200, Matthias van de Meent wrote:
> heap_prune_satisfies_vacuum considers 1 more transaction to be
> unvacuumable, and thus indeed won't vacuum that tuple that
> HeapTupleSatisfiesVacuum does want to be vacuumed.
>
> The new open question is now: Why is
>
On 2021-Jun-06, Justin Pryzby wrote:
> However, I also found an autovacuum chewing 100% CPU, and it appears the
> problem is actually because autovacuum has locked a page of pg-statistic, and
> every other process then gets stuck waiting in the planner. I checked a few
> and found these:
>
On Tue, Jun 08, 2021 at 02:27:14PM +0200, Matthias van de Meent wrote:
> Thanks for the information!
I created an apparently-complete core file by first doing this:
| echo 127 |sudo tee /proc/22591/coredump_filter
*and updated wiki:Developer_FAQ to work with huge pages
I'm planning to kill the
On Tue, 8 Jun 2021 at 14:11, Justin Pryzby wrote:
>
> On Tue, Jun 08, 2021 at 01:54:41PM +0200, Matthias van de Meent wrote:
> > On Tue, 8 Jun 2021 at 13:03, Justin Pryzby wrote:
> > >
> > > On Sun, Jun 06, 2021 at 11:00:38AM -0700, Peter Geoghegan wrote:
> > > > On Sun, Jun 6, 2021 at 9:35 AM
On Tue, Jun 08, 2021 at 01:54:41PM +0200, Matthias van de Meent wrote:
> On Tue, 8 Jun 2021 at 13:03, Justin Pryzby wrote:
> >
> > On Sun, Jun 06, 2021 at 11:00:38AM -0700, Peter Geoghegan wrote:
> > > On Sun, Jun 6, 2021 at 9:35 AM Justin Pryzby wrote:
> > > > I'll leave the instance running
On Tue, 8 Jun 2021 at 13:03, Justin Pryzby wrote:
>
> On Sun, Jun 06, 2021 at 11:00:38AM -0700, Peter Geoghegan wrote:
> > On Sun, Jun 6, 2021 at 9:35 AM Justin Pryzby wrote:
> > > I'll leave the instance running for a little bit before restarting (or
> > > kill-9)
> > > in case someone
On Sun, Jun 06, 2021 at 01:59:10PM -0400, Tom Lane wrote:
> Matthias van de Meent writes:
> > On Sun, 6 Jun 2021 at 18:35, Justin Pryzby wrote:
> >> However, I also found an autovacuum chewing 100% CPU, and it appears the
> >> problem is actually because autovacuum has locked a page of
On Sun, Jun 06, 2021 at 11:00:38AM -0700, Peter Geoghegan wrote:
> On Sun, Jun 6, 2021 at 9:35 AM Justin Pryzby wrote:
> > I'll leave the instance running for a little bit before restarting (or
> > kill-9)
> > in case someone requests more info.
>
> How about dumping the page image out, and
On Sun, Jun 06, 2021 at 07:26:22PM +0200, Matthias van de Meent wrote:
> I think it would be helpful for further debugging if we would have the
> state of the all tuples on that page (well, the tuple headers with
> their transactionids and their line pointers), as that would help with
>
On Sun, Jun 6, 2021 at 11:43 AM Justin Pryzby wrote:
> Sorry, but I already killed the process to try to follow Matthias' suggestion.
> I have a core file from "gcore" but it looks like it's incomplete and the
> address is now "out of bounds"...
Based on what you said about ending up back in
On Sun, Jun 06, 2021 at 11:00:38AM -0700, Peter Geoghegan wrote:
> On Sun, Jun 6, 2021 at 9:35 AM Justin Pryzby wrote:
> > I'll leave the instance running for a little bit before restarting (or
> > kill-9)
> > in case someone requests more info.
>
> How about dumping the page image out, and
Hi,
On Sun, Jun 6, 2021, at 10:59, Tom Lane wrote:
> Matthias van de Meent writes:
> > On Sun, 6 Jun 2021 at 18:35, Justin Pryzby wrote:
> >> However, I also found an autovacuum chewing 100% CPU, and it appears the
> >> problem is actually because autovacuum has locked a page of pg-statistic,
On Sun, Jun 6, 2021 at 9:35 AM Justin Pryzby wrote:
> I'll leave the instance running for a little bit before restarting (or kill-9)
> in case someone requests more info.
How about dumping the page image out, and sharing it with the list?
This procedure should work fine from gdb:
Matthias van de Meent writes:
> On Sun, 6 Jun 2021 at 18:35, Justin Pryzby wrote:
>> However, I also found an autovacuum chewing 100% CPU, and it appears the
>> problem is actually because autovacuum has locked a page of pg-statistic, and
>> every other process then gets stuck waiting in the
On Sun, 6 Jun 2021 at 18:35, Justin Pryzby wrote:
>
> An internal instance was rejecting connections with "too many clients".
> I found a bunch of processes waiting on a futex and I was going to upgrade the
> kernel (3.10.0-514) and dismiss the issue.
>
> However, I also found an autovacuum
An internal instance was rejecting connections with "too many clients".
I found a bunch of processes waiting on a futex and I was going to upgrade the
kernel (3.10.0-514) and dismiss the issue.
However, I also found an autovacuum chewing 100% CPU, and it appears the
problem is actually because
68 matches
Mail list logo