Re: [GENERAL] startup process stuck in recovery

2017-10-11 Thread Tom Lane
Simon Riggs  writes:
> On 11 October 2017 at 08:09, Christophe Pettus  wrote:
>> While it's certainly true that this was an extreme case, it was a real-life 
>> production situation.  The concern here is that in the actual production 
>> situation, the only symptom was that the startup process just stopped.  
>> There were no log messages or any other indication of what was going wrong.

> Which indicates it was making progress, just slowly.

> Tom says "This is pretty easy to diagnose though
> because it spews "out of shared memory" WARNING messages to the
> postmaster log at an astonishing rate"

> These don't seem to match.

Yeah.  I'm still suspicious that Christophe saw some other misbehavior
than the one I found.  We know his server was dealing with < 10K locks,
which doesn't seem like enough to cause any obvious problem from a mere
O(N^2) behavior.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-11 Thread Simon Riggs
On 11 October 2017 at 08:09, Christophe Pettus  wrote:
>
>> On Oct 10, 2017, at 23:54, Simon Riggs  wrote:
>>
>> The use case described seems incredibly
>> unreal and certainly amenable to being rewritten.
>
> While it's certainly true that this was an extreme case, it was a real-life 
> production situation.  The concern here is that in the actual production 
> situation, the only symptom was that the startup process just stopped.  There 
> were no log messages or any other indication of what was going wrong.

Which indicates it was making progress, just slowly.

Tom says "This is pretty easy to diagnose though
because it spews "out of shared memory" WARNING messages to the
postmaster log at an astonishing rate"

These don't seem to match.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-11 Thread Christophe Pettus

> On Oct 10, 2017, at 23:54, Simon Riggs  wrote:
> 
> The use case described seems incredibly
> unreal and certainly amenable to being rewritten.

While it's certainly true that this was an extreme case, it was a real-life 
production situation.  The concern here is that in the actual production 
situation, the only symptom was that the startup process just stopped.  There 
were no log messages or any other indication of what was going wrong.
--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-11 Thread Simon Riggs
On 10 October 2017 at 21:23, Tom Lane  wrote:

> What I see is that, given this particular test case, the backend
> process on the master never holds more than a few locks at a time.
> Each time we abort a subtransaction, the AE lock it was holding
> on the temp table it created gets dropped.  However ... on the
> standby server, pre v10, the replay process attempts to take all
> 12000 of those AE locks at once.  This is not a great plan.

Standby doesn't take locks "at once", they are added just as they
arrive. The locks are held by topxid, so not released at subxid abort,
by design, so they are held concurrently.

> v10 and HEAD avoid the problem because the standby server doesn't
> take locks (any at all, AFAICS).  I suppose this must be a
> consequence of commit 9b013dc238c, though I'm not sure exactly how.

Locks are still taken, but in 9b013dc238c we just avoid trying to
release locks when transactions don't have any.

> Anyway, it's pretty scary that it's so easy to run the replay process
> out of shared memory pre-v10.  I wonder if we should consider
> backpatching that fix.  Any situation where the replay process takes
> more locks concurrently than were ever held on the master is surely
> very bad news.

v10 improves on this specific point because we perform lock release at
subxid abort.

Various cases have been reported over time and this has been improving
steadily in each release.

It isn't "easy" to run the replay process out of memory because
clearly that doesn't happen much, but yes there are some pessimal use
cases that don't work well. The use case described seems incredibly
unreal and certainly amenable to being rewritten.

Backpatching some of those fixes is quite risky, IMHO.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-10 Thread Tom Lane
Christophe Pettus  writes:
> I was able to reproduce this on 9.5.9 with the following:

Hmm ... so I still can't reproduce the specific symptoms Christophe
reports.

What I see is that, given this particular test case, the backend
process on the master never holds more than a few locks at a time.
Each time we abort a subtransaction, the AE lock it was holding
on the temp table it created gets dropped.  However ... on the
standby server, pre v10, the replay process attempts to take all
12000 of those AE locks at once.  This is not a great plan.

On 9.5, for me, as soon as we're out of shared memory
ResolveRecoveryConflictWithLock will go into an infinite loop.
And AFAICS it *is* infinite; it doesn't look to me like it's
making any progress.  This is pretty easy to diagnose though
because it spews "out of shared memory" WARNING messages to the
postmaster log at an astonishing rate.

9.6 hits the OOM condition as well, but manages to get out of it
somehow.  I'm not very clear how, and the log trace doesn't look
like it's real clean: after a bunch of these

WARNING:  out of shared memory
CONTEXT:  xlog redo at 0/C1098AC0 for Standby/LOCK: xid 134024 db 423347 rel 
524106 
WARNING:  out of shared memory
CONTEXT:  xlog redo at 0/C10A97E0 for Standby/LOCK: xid 134024 db 423347 rel 
524151 
WARNING:  out of shared memory
CONTEXT:  xlog redo at 0/C10B36B0 for Standby/LOCK: xid 134024 db 423347 rel 
524181 
WARNING:  out of shared memory
CONTEXT:  xlog redo at 0/C10BD780 for Standby/LOCK: xid 134024 db 423347 rel 
524211 

you get a bunch of these

WARNING:  you don't own a lock of type AccessExclusiveLock
CONTEXT:  xlog redo at 0/C13A79B0 for Transaction/COMMIT: 2017-10-10 
15:05:56.615721-04
LOG:  RecoveryLockList contains entry for lock no longer recorded by lock 
manager: xid 134024 database 423347 relation 526185
CONTEXT:  xlog redo at 0/C13A79B0 for Transaction/COMMIT: 2017-10-10 
15:05:56.615721-04
WARNING:  you don't own a lock of type AccessExclusiveLock
CONTEXT:  xlog redo at 0/C13A79B0 for Transaction/COMMIT: 2017-10-10 
15:05:56.615721-04
LOG:  RecoveryLockList contains entry for lock no longer recorded by lock 
manager: xid 134024 database 423347 relation 526188
CONTEXT:  xlog redo at 0/C13A79B0 for Transaction/COMMIT: 2017-10-10 
15:05:56.615721-04
WARNING:  you don't own a lock of type AccessExclusiveLock
CONTEXT:  xlog redo at 0/C13A79B0 for Transaction/COMMIT: 2017-10-10 
15:05:56.615721-04
LOG:  RecoveryLockList contains entry for lock no longer recorded by lock 
manager: xid 134024 database 423347 relation 526191
CONTEXT:  xlog redo at 0/C13A79B0 for Transaction/COMMIT: 2017-10-10 
15:05:56.615721-04

The important point though is that "a bunch" is a finite number,
whereas 9.5 seems to be just stuck.  I'm not sure how Christophe's
server managed to continue to make progress.

It looks like the 9.6-era patch 37c54863c must have been responsible
for that behavioral change.  There's no indication in the commit message
or the comments that anyone had specifically considered the OOM
scenario, so I think it's just accidental that it's better.

v10 and HEAD avoid the problem because the standby server doesn't
take locks (any at all, AFAICS).  I suppose this must be a
consequence of commit 9b013dc238c, though I'm not sure exactly how.

Anyway, it's pretty scary that it's so easy to run the replay process
out of shared memory pre-v10.  I wonder if we should consider
backpatching that fix.  Any situation where the replay process takes
more locks concurrently than were ever held on the master is surely
very bad news.

A marginally lesser concern is that the replay process does need to have
robust behavior in the face of locktable OOM.  AFAICS whatever it is doing
now is just accidental, and I'm not sure it's correct.  "Doesn't get into
an infinite loop" is not a sufficiently high bar.

And I'm still wondering exactly what Christophe actually saw ...

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-10 Thread Christophe Pettus

> On Oct 10, 2017, at 08:05, Tom Lane  wrote:
> 
> You're right, I was testing on HEAD, so that patch might've obscured
> the problem.  But the code looks like it could still be O(N^2) in
> some cases.  Will look again later.

I was able to reproduce this on 9.5.9 with the following:

DO $$
DECLARE
   i int := 1;
BEGIN
   FOR i IN 1..12000 LOOP
  BEGIN
  PERFORM f();
  i := i / 0;
  EXCEPTION
 WHEN division_by_zero THEN
  END;
   END LOOP;
END;
$$ language plpgsql;

where f() is:

CREATE OR REPLACE FUNCTION f() RETURNS VOID AS $$
BEGIN
  CREATE TEMPORARY TABLE test_table ON COMMIT DROP AS SELECT i FROM 
generate_series(1, 100) i;
END:
$$ language plpgsql;

A couple of observations:

-- In this version, I couldn't do a select * from pg_locks() on the secondary 
without getting an out-of-shared-memory error.
-- If I increased max_locks_per_transaction to 15000, the problem didn't occur, 
even if I bumped up the number of iterations in the first to 2.

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-10 Thread Tom Lane
Alvaro Herrera  writes:
> Tom Lane wrote:
>> Hmm, I tried to reproduce this and could not.  I experimented with
>> various permutations of this:

> This problem is probably related to commit 9b013dc238c, which AFAICS is
> only in pg10, not 9.5.

You're right, I was testing on HEAD, so that patch might've obscured
the problem.  But the code looks like it could still be O(N^2) in
some cases.  Will look again later.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-10 Thread Alvaro Herrera
Tom Lane wrote:
> Christophe Pettus  writes:
> > The problem indeed appear to be a very large number of subtransactions, 
> > each one creating a temp table, inside a single transaction.  It's made 
> > worse by one of those transactions finally getting replayed on the 
> > secondary, only to have another one come in right behind it...
> 
> Hmm, I tried to reproduce this and could not.  I experimented with
> various permutations of this:

This problem is probably related to commit 9b013dc238c, which AFAICS is
only in pg10, not 9.5.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-09 Thread Christophe Pettus

> On Oct 9, 2017, at 17:30, Tom Lane  wrote:
> 
> What am I missing to reproduce the problem?

Not sure.  The actual client behavior here is a bit cryptic (not our code, 
incompletely logs).  They might be creating a savepoint before each temp table 
creation, without a release, but I would expect that to exhaust something else 
first.

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-09 Thread Christophe Pettus

> On Oct 9, 2017, at 18:21, Peter Geoghegan  wrote:
> What's the hot_standy_feedback setting? How about
> max_standby_archive_delay/max_standby_streaming_delay?

On, 5m, 5m.

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-09 Thread Peter Geoghegan
On Mon, Oct 9, 2017 at 12:08 PM, Christophe Pettus  wrote:
> Suggestions on further diagnosis?

What's the hot_standy_feedback setting? How about
max_standby_archive_delay/max_standby_streaming_delay?


-- 
Peter Geoghegan


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-09 Thread Tom Lane
Peter Geoghegan  writes:
> Just a guess, but do you disable autovacuum on your dev machine? (I know I 
> do.)

Nope.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-09 Thread Peter Geoghegan
On Mon, Oct 9, 2017 at 5:30 PM, Tom Lane  wrote:
> and did not see any untoward behavior, at least not till I got to enough
> temp tables to overrun the master's shared lock table, and even then it
> cleaned up fine.  At no point was the standby process consuming anywhere
> near as much CPU as the master's backend.
>
> What am I missing to reproduce the problem?

Just a guess, but do you disable autovacuum on your dev machine? (I know I do.)

It's possible that this is relevant:
https://postgr.es/m/CAB-EU3RawZx8-OzMfvswFf6z+Y7GOZf03TZ=bez+pbqx+a4...@mail.gmail.com

-- 
Peter Geoghegan


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-09 Thread Tom Lane
Christophe Pettus  writes:
> The problem indeed appear to be a very large number of subtransactions, each 
> one creating a temp table, inside a single transaction.  It's made worse by 
> one of those transactions finally getting replayed on the secondary, only to 
> have another one come in right behind it...

Hmm, I tried to reproduce this and could not.  I experimented with
various permutations of this:

create or replace function goslow(n int) returns void language plpgsql as
$$
begin
for i in 1..n loop
  begin
execute 'create temp table myfoo' || i || '(f1 int)';
perform i/0;
  exception when division_by_zero then null;
  end;
end loop;
perform n/0;
end;
$$;

and did not see any untoward behavior, at least not till I got to enough
temp tables to overrun the master's shared lock table, and even then it
cleaned up fine.  At no point was the standby process consuming anywhere
near as much CPU as the master's backend.

What am I missing to reproduce the problem?

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-09 Thread Merlin Moncure
On Mon, Oct 9, 2017 at 6:12 PM, Christophe Pettus  wrote:
>
>> On Oct 9, 2017, at 14:29, Tom Lane  wrote:
>> Hmm.  Creating or dropping a temp table does take AccessExclusiveLock,
>> just as it does for a non-temp table.  In principle we'd not have to
>> transmit those locks to standbys, but I doubt that the WAL code has
>> enough knowledge to filter them out.  So a lot of temp tables and
>> a lot of separate subtransactions could be a nasty combination.
>
> The problem indeed appear to be a very large number of subtransactions, each 
> one creating a temp table, inside a single transaction.  It's made worse by 
> one of those transactions finally getting replayed on the secondary, only to 
> have another one come in right behind it...

FYI, this is pathological coding; it can cause other nastiness such as
exhausting shared memory.  Consider refactoring the code to using a
permanent table (if access is concurrent, you can do tricks like using
'pid' in the key to subdivide the data).

merlin


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-09 Thread Christophe Pettus

> On Oct 9, 2017, at 14:29, Tom Lane  wrote:
> Hmm.  Creating or dropping a temp table does take AccessExclusiveLock,
> just as it does for a non-temp table.  In principle we'd not have to
> transmit those locks to standbys, but I doubt that the WAL code has
> enough knowledge to filter them out.  So a lot of temp tables and
> a lot of separate subtransactions could be a nasty combination.

The problem indeed appear to be a very large number of subtransactions, each 
one creating a temp table, inside a single transaction.  It's made worse by one 
of those transactions finally getting replayed on the secondary, only to have 
another one come in right behind it...
--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-09 Thread Tom Lane
Christophe Pettus  writes:
>> On Oct 9, 2017, at 13:26, Tom Lane  wrote:
>> My bet is that the source server did something that's provoking O(N^2)
>> behavior in the standby server's lock management.  It's hard to say
>> exactly what, but I'm wondering about something like a plpgsql function
>> taking an AccessExclusiveLock inside a loop that repeatedly traps an
>> exception.  Can you correlate where the standby is stuck with what
>> was happening on the source?

> Interestingly, the OIDs for the relations on which the locks on the secondary 
> are held aren't present in pg_class, and they're clustered together.  Could a 
> large number of temporary table creations that are being undone by an abort 
> cause this?

Hmm.  Creating or dropping a temp table does take AccessExclusiveLock,
just as it does for a non-temp table.  In principle we'd not have to
transmit those locks to standbys, but I doubt that the WAL code has
enough knowledge to filter them out.  So a lot of temp tables and
a lot of separate subtransactions could be a nasty combination.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-09 Thread Christophe Pettus

> On Oct 9, 2017, at 13:26, Tom Lane  wrote:
> My bet is that the source server did something that's provoking O(N^2)
> behavior in the standby server's lock management.  It's hard to say
> exactly what, but I'm wondering about something like a plpgsql function
> taking an AccessExclusiveLock inside a loop that repeatedly traps an
> exception.  Can you correlate where the standby is stuck with what
> was happening on the source?

Interestingly, the OIDs for the relations on which the locks on the secondary 
are held aren't present in pg_class, and they're clustered together.  Could a 
large number of temporary table creations that are being undone by an abort 
cause this?

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-09 Thread Christophe Pettus

> On Oct 9, 2017, at 13:26, Tom Lane  wrote:
> 
> Oh, that's really interesting.  So it's not *just* releasing locks but
> also acquiring them, which says that it is making progress of some sort.

It seems to have leveled out now, and is still grinding away.

> Can you correlate where the standby is stuck with what
> was happening on the source?

There are definitely some mysterious functions being called, and I'll dig into 
those.

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-09 Thread Tom Lane
Christophe Pettus  writes:
>> On Oct 9, 2017, at 13:01, Tom Lane  wrote:
>> Is that number changing at all?

> Increasing:
>  AccessExclusiveLock |  8810

Oh, that's really interesting.  So it's not *just* releasing locks but
also acquiring them, which says that it is making progress of some sort.

My bet is that the source server did something that's provoking O(N^2)
behavior in the standby server's lock management.  It's hard to say
exactly what, but I'm wondering about something like a plpgsql function
taking an AccessExclusiveLock inside a loop that repeatedly traps an
exception.  Can you correlate where the standby is stuck with what
was happening on the source?

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-09 Thread Christophe Pettus

> On Oct 9, 2017, at 13:01, Tom Lane  wrote:
> Hmm.  Is it possible that the process is replaying the abort of a
> transaction with a lot of subtransactions? 

That's possible, although we're now talking about an hours-long delay at this 
point.

> Is that number changing at all?

Increasing:

# select mode, count(*) from pg_locks where pid=5882 group by mode;
mode | count 
-+---
 ExclusiveLock   | 1
 AccessExclusiveLock |  8810
(2 rows)

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-09 Thread Tom Lane
Christophe Pettus  writes:
> The other observation is that the startup process is holding a *lot* of locks:

Hmm.  Is it possible that the process is replaying the abort of a
transaction with a lot of subtransactions?  It seems like maybe
you could be getting into an O(N^2) situation, though for N in
the low thousands it's hard to see that loop taking so long that
you'd think things were stuck.

> # select mode, count(*) from pg_locks where pid=5882 group by mode;
>  AccessExclusiveLock |  7133

Is that number changing at all?

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-09 Thread Christophe Pettus
On Oct 9, 2017, at 12:18, Christophe Pettus  wrote:
> 
> #0  0x558812f4f1da in ?? ()
> #1  0x558812f4f8cb in StandbyReleaseLockTree ()
> #2  0x558812d718ee in ?? ()
> #3  0x558812d75520 in xact_redo ()
> #4  0x558812d7f713 in StartupXLOG ()
> #5  0x558812f0e262 in StartupProcessMain ()
> #6  0x558812d8d4ea in AuxiliaryProcessMain ()
> #7  0x558812f0b2e9 in ?? ()
> #8  0x558812f0dae7 in PostmasterMain ()
> #9  0x558812d0c402 in main ()

The other observation is that the startup process is holding a *lot* of locks:

(pid 5882 is the startup process)

# select mode, count(*) from pg_locks where pid=5882 group by mode;
mode | count 
-+---
 ExclusiveLock   | 1
 AccessExclusiveLock |  7133
(2 rows)

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-09 Thread Christophe Pettus

> On Oct 9, 2017, at 12:10, Tom Lane  wrote:
> 
> Attach to startup process with gdb, and get a stack trace?

#0  0x558812f4f1da in ?? ()
#1  0x558812f4f8cb in StandbyReleaseLockTree ()
#2  0x558812d718ee in ?? ()
#3  0x558812d75520 in xact_redo ()
#4  0x558812d7f713 in StartupXLOG ()
#5  0x558812f0e262 in StartupProcessMain ()
#6  0x558812d8d4ea in AuxiliaryProcessMain ()
#7  0x558812f0b2e9 in ?? ()
#8  0x558812f0dae7 in PostmasterMain ()
#9  0x558812d0c402 in main ()


--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] startup process stuck in recovery

2017-10-09 Thread Tom Lane
Christophe Pettus  writes:
> We're dealing with a 9.5.5 database with the symptom that, after a certain 
> amount of time after restart, the startup process reaches a certain WAL 
> segment, and stops.  The startup process runs at 100% CPU, with no output 
> from strace.  There are no queries running on the secondary, so it's not a 
> delay for queries.  The WAL receiver continues to operate normally, and we 
> can see sent_location / write_location / flush_location continue to move 
> ahead in parallel, with replay_location stuck in that WAL segment.

> Suggestions on further diagnosis?

Attach to startup process with gdb, and get a stack trace?

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general