Re: Fwd: [HACKERS] Cluster "stuck" in "not accepting commands to avoid wraparound data loss"

2015-12-17 Thread Robert Haas
On Thu, Dec 17, 2015 at 1:20 PM, Andres Freund  wrote:
> On 2015-12-17 13:08:15 -0500, Robert Haas wrote:
>> On Thu, Dec 17, 2015 at 12:14 PM, Andres Freund  wrote:
>> > On 2015-12-17 09:04:25 -0800, Jeff Janes wrote:
>> >> > But I'm somewhat confused what this has to do with Andres's report.
>> >>
>> >> Doesn't it explain the exact situation he is in, where the oldest
>> >> database is 200 million, but the cluster as a whole is 2 billion?
>> >
>> > There were no crashes, so no, I don't think so.
>>
>> Backing up a step, do we think that the fact that this was running in
>> a shell rather than a screen is relevant somehow?  Or did something
>> happen to this particular cluster totally unrelated to that?
>
> I reran the whole thing on a separate, but very similar, VM. Just
> checked. Same thing happened. This time I have log files and
> everything. No time to investigate right now, but it's reproducible if
> you accept running tests for a week or so.

I don't think I'm going to speculate further until you have time to
investigate more.  It seems clear that autovacuum is going wrong
somehow, but it's extremely unclear why.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Fwd: [HACKERS] Cluster "stuck" in "not accepting commands to avoid wraparound data loss"

2015-12-17 Thread Andres Freund
On 2015-12-17 13:08:15 -0500, Robert Haas wrote:
> On Thu, Dec 17, 2015 at 12:14 PM, Andres Freund  wrote:
> > On 2015-12-17 09:04:25 -0800, Jeff Janes wrote:
> >> > But I'm somewhat confused what this has to do with Andres's report.
> >>
> >> Doesn't it explain the exact situation he is in, where the oldest
> >> database is 200 million, but the cluster as a whole is 2 billion?
> >
> > There were no crashes, so no, I don't think so.
> 
> Backing up a step, do we think that the fact that this was running in
> a shell rather than a screen is relevant somehow?  Or did something
> happen to this particular cluster totally unrelated to that?

I reran the whole thing on a separate, but very similar, VM. Just
checked. Same thing happened. This time I have log files and
everything. No time to investigate right now, but it's reproducible if
you accept running tests for a week or so.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Fwd: [HACKERS] Cluster "stuck" in "not accepting commands to avoid wraparound data loss"

2015-12-17 Thread Robert Haas
On Thu, Dec 17, 2015 at 12:14 PM, Andres Freund  wrote:
> On 2015-12-17 09:04:25 -0800, Jeff Janes wrote:
>> > But I'm somewhat confused what this has to do with Andres's report.
>>
>> Doesn't it explain the exact situation he is in, where the oldest
>> database is 200 million, but the cluster as a whole is 2 billion?
>
> There were no crashes, so no, I don't think so.

Backing up a step, do we think that the fact that this was running in
a shell rather than a screen is relevant somehow?  Or did something
happen to this particular cluster totally unrelated to that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Fwd: [HACKERS] Cluster "stuck" in "not accepting commands to avoid wraparound data loss"

2015-12-17 Thread Andres Freund
On 2015-12-17 09:04:25 -0800, Jeff Janes wrote:
> > But I'm somewhat confused what this has to do with Andres's report.
> 
> Doesn't it explain the exact situation he is in, where the oldest
> database is 200 million, but the cluster as a whole is 2 billion?

There were no crashes, so no, I don't think so.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Fwd: [HACKERS] Cluster "stuck" in "not accepting commands to avoid wraparound data loss"

2015-12-17 Thread Jeff Janes
Sorry, accidentally failed to include the list originally, here it is
for the list:

On Dec 16, 2015 9:52 AM, "Robert Haas"  wrote:
>
> On Fri, Dec 11, 2015 at 1:08 PM, Jeff Janes  wrote:
> > Since changes to datfrozenxid are WAL logged at the time they occur,
> > but the supposedly-synchronous change to ShmemVariableCache is not WAL
> > logged until the next checkpoint, a well timed crash can leave you in
> > the state where the system is in a tizzy about wraparound but each
> > database says "Nope, not me".
>
> ShmemVariableCache is an in-memory data structure, so it's going to
> get blown away and rebuilt on a crash.  But I guess it gets rebuild
> from the contents of the most recent checkpoint record, so that
> doesn't actually help.  However, I wonder if it would be safe to for
> the autovacuum launcher to calculate an updated value and call
> SetTransactionIdLimit() to update ShmemVariableCache.

I was wondering if that should happen either at the end of crash
recovery (but I suppose you can't poll pg_database yet at that
point?), or immediately before throwing the "database is not accepting
commands to avoid wraparound data loss" error.

At which point would it make sense for the launcher do it?  I guess
just after it was started up under PMSIGNAL_START_AUTOVAC_LAUNCHER
conditions?

> But I'm somewhat confused what this has to do with Andres's report.

Doesn't it explain the exact situation he is in, where the oldest
database is 200 million, but the cluster as a whole is 2 billion?

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers