Re: [HACKERS] logical replication deranged sender

2017-05-10 Thread Michael Paquier
On Wed, May 10, 2017 at 3:06 AM, Petr Jelinek
 wrote:
> Okay, then it's the same issue Masahiko Sawada reported in nearby
> thread, or at least has same cause.

This report is here:
https://www.postgresql.org/message-id/CAD21AoBYpyqTSw+=ES+xXtRGMPKh=pkiqjnxzknnuae0pst...@mail.gmail.com
It may be useful to keep all discussions on the other thread as that's
the first report.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication deranged sender

2017-05-09 Thread Petr Jelinek
On 09/05/17 19:13, Jeff Janes wrote:
> On Tue, May 9, 2017 at 9:18 AM, Petr Jelinek
> > wrote:
> 
> On 08/05/17 13:47, Petr Jelinek wrote:
> > On 08/05/17 01:17, Jeff Janes wrote:
> >> After dropping a subscription, it says it succeeded and that it dropped
> >> the slot on the publisher.
> >>
> >> But the publisher still has the slot, and a full-tilt process described
> >> by ps as
> >>
> >> postgres: wal sender process jjanes [local] idle in transaction
> >>
> >> Strace shows that this process is doing nothing but opening, reading,
> >> lseek, and closing from pg_wal, and calling sbrk.  It never sends 
> anything.
> >>
> >> This is not how it should work, correct?
> >>
> >
> > No, and I don't see how this happens though, we only report success if
> > the publisher side said that DROP_REPLICATION_SLOT succeeded. So far I
> > don't see anything in source that would explain this. I will need to
> > reproduce it first to see what's happening (wasn't able to do that yet,
> > but it might just need more time since you say it does no happen 
> always).
> >
> 
> Hm I wonder are there any workers left on subscriber when this happens?
> 
> 
> Yes.  using ps, I get this:
> 
> postgres: bgworker: logical replication worker for subscription 16408
> sync 16391
> postgres: bgworker: logical replication worker for subscription 16408
> sync 16388
> 
> They seem to be permanently blocked on a socket to read from the publisher.
> 
> On the publisher side, I think it is very slowly assembling a snapshot. 
> It seems to be adding one xid at a time, and then re-sorting the entire
> list.  Over and over.
> 

Okay, then it's the same issue Masahiko Sawada reported in nearby
thread, or at least has same cause.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication deranged sender

2017-05-09 Thread Jeff Janes
On Tue, May 9, 2017 at 9:18 AM, Petr Jelinek 
wrote:

> On 08/05/17 13:47, Petr Jelinek wrote:
> > On 08/05/17 01:17, Jeff Janes wrote:
> >> After dropping a subscription, it says it succeeded and that it dropped
> >> the slot on the publisher.
> >>
> >> But the publisher still has the slot, and a full-tilt process described
> >> by ps as
> >>
> >> postgres: wal sender process jjanes [local] idle in transaction
> >>
> >> Strace shows that this process is doing nothing but opening, reading,
> >> lseek, and closing from pg_wal, and calling sbrk.  It never sends
> anything.
> >>
> >> This is not how it should work, correct?
> >>
> >
> > No, and I don't see how this happens though, we only report success if
> > the publisher side said that DROP_REPLICATION_SLOT succeeded. So far I
> > don't see anything in source that would explain this. I will need to
> > reproduce it first to see what's happening (wasn't able to do that yet,
> > but it might just need more time since you say it does no happen always).
> >
>
> Hm I wonder are there any workers left on subscriber when this happens?
>

Yes.  using ps, I get this:

postgres: bgworker: logical replication worker for subscription 16408 sync
16391
postgres: bgworker: logical replication worker for subscription 16408 sync
16388

They seem to be permanently blocked on a socket to read from the publisher.

On the publisher side, I think it is very slowly assembling a snapshot.  It
seems to be adding one xid at a time, and then re-sorting the entire list.
Over and over.

Cheers,

Jeff


Re: [HACKERS] logical replication deranged sender

2017-05-09 Thread Petr Jelinek
On 08/05/17 13:47, Petr Jelinek wrote:
> On 08/05/17 01:17, Jeff Janes wrote:
>> After dropping a subscription, it says it succeeded and that it dropped
>> the slot on the publisher.
>>
>> But the publisher still has the slot, and a full-tilt process described
>> by ps as 
>>
>> postgres: wal sender process jjanes [local] idle in transaction
>>
>> Strace shows that this process is doing nothing but opening, reading,
>> lseek, and closing from pg_wal, and calling sbrk.  It never sends anything.
>>
>> This is not how it should work, correct?
>>
> 
> No, and I don't see how this happens though, we only report success if
> the publisher side said that DROP_REPLICATION_SLOT succeeded. So far I
> don't see anything in source that would explain this. I will need to
> reproduce it first to see what's happening (wasn't able to do that yet,
> but it might just need more time since you say it does no happen always).
> 

Hm I wonder are there any workers left on subscriber when this happens?

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication deranged sender

2017-05-08 Thread Petr Jelinek
On 08/05/17 01:17, Jeff Janes wrote:
> After dropping a subscription, it says it succeeded and that it dropped
> the slot on the publisher.
> 
> But the publisher still has the slot, and a full-tilt process described
> by ps as 
> 
> postgres: wal sender process jjanes [local] idle in transaction
> 
> Strace shows that this process is doing nothing but opening, reading,
> lseek, and closing from pg_wal, and calling sbrk.  It never sends anything.
> 
> This is not how it should work, correct?
> 

No, and I don't see how this happens though, we only report success if
the publisher side said that DROP_REPLICATION_SLOT succeeded. So far I
don't see anything in source that would explain this. I will need to
reproduce it first to see what's happening (wasn't able to do that yet,
but it might just need more time since you say it does no happen always).

> 
> The subscribing server's logs shows:
> 
> 21270  2017-05-07 16:04:16.517 PDT LOG:  process 21270 still waiting for
> AccessShareLock on relation 6100 of database 0 after 1000.051 ms
> 21270  2017-05-07 16:04:16.517 PDT DETAIL:  Process holding the lock:
> 25493. Wait queue: 21270.
> 21270  2017-05-07 16:04:16.520 PDT LOG:  process 21270 acquired
> AccessShareLock on relation 6100 of database 0 after 1003.608 ms
> 25493 DROP SUBSCRIPTION 2017-05-07 16:04:16.520 PDT LOG:  duration:
> 1005.176 ms CPU: user: 0.00 s, system: 0.00 s, elapsed: 1.00 s  
> statement: drop subscription foobar ;
> 

Yeah that's normal because DropSubscription holds exclusive lock on
pg_subscription. I am guessing 21270 is the logical replication launcher?

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] logical replication deranged sender

2017-05-07 Thread Jeff Janes
After dropping a subscription, it says it succeeded and that it dropped the
slot on the publisher.

But the publisher still has the slot, and a full-tilt process described by
ps as

postgres: wal sender process jjanes [local] idle in transaction

Strace shows that this process is doing nothing but opening, reading,
lseek, and closing from pg_wal, and calling sbrk.  It never sends anything.

This is not how it should work, correct?

I'm running pgbench -c4 -j8 on the publisher, publishing all pgbench tables.

It doesn't seem to happen every time I create and then drop a subscription,
but it happens pretty often.

The subscribing server's logs shows:

21270  2017-05-07 16:04:16.517 PDT LOG:  process 21270 still waiting for
AccessShareLock on relation 6100 of database 0 after 1000.051 ms
21270  2017-05-07 16:04:16.517 PDT DETAIL:  Process holding the lock:
25493. Wait queue: 21270.
21270  2017-05-07 16:04:16.520 PDT LOG:  process 21270 acquired
AccessShareLock on relation 6100 of database 0 after 1003.608 ms
25493 DROP SUBSCRIPTION 2017-05-07 16:04:16.520 PDT LOG:  duration:
1005.176 ms CPU: user: 0.00 s, system: 0.00 s, elapsed: 1.00 s   statement:
drop subscription foobar ;

The publishing server's logs doesn't show anything relevant.

I'm using 86713de,  I have not tried this in earlier commits.

Cheers,

Jeff