Re: [HACKERS] Logical decoding and walsender timeouts

2016-10-31 Thread Vladimir Gordiychuk
>>> When sending a big message, WalSndWriteData() notices that it's
>>> approaching timeout and tries to send a keepalive request, but the
>>> request just gets buffered behind the remaining output plugin data and
>>> isn't seen by the client until the client has received the rest of the
>>> pending data.
>>
>> Only for individual messages, not the entire transaction though.

>Right.  I initially thought it was the whole tx, but I was mistaken as
>I'd failed to notice that WalSndWriteData() queues a keepalive
>request.

This problem can be resolve periodically send keepalive by client, and this
interval should be less than timeout configure on server. For example on
server configure timeout wal_sender_timeout=60 so, client should send keep
alive message to server with interval 60/3. In that case server will not
send keep alive with flag required reply, and also not disconnect client
because during decode huge transaction present check income messages. I
faced a similar problem in pgjdc and resolve it as I write before.

2016-10-31 16:28 GMT+03:00 Craig Ringer :

> On 31 October 2016 at 16:52, Andres Freund  wrote:
> > Hi,
> >
> > On 2016-10-31 16:34:38 +0800, Craig Ringer wrote:
> >> TL;DR: Logical decoding clients need to generate their own keepalives
> >> and not rely on the server requesting them to prevent timeouts. Or
> >> admins should raise the wal_sender_timeout by a LOT when using logical
> >> decoding on DBs with any big rows.
> >
> > Unconvinced.
>
> Yeah. I've seen enough issues in the wild where we keep timing out and
> restarting over and over until we increase wal_sender_timeout to know
> there's _something_ going on. I am less sure I'm right about what is
> or how to solve it.
>
> >> When sending a big message, WalSndWriteData() notices that it's
> >> approaching timeout and tries to send a keepalive request, but the
> >> request just gets buffered behind the remaining output plugin data and
> >> isn't seen by the client until the client has received the rest of the
> >> pending data.
> >
> > Only for individual messages, not the entire transaction though.
>
> Right.  I initially thought it was the whole tx, but I was mistaken as
> I'd failed to notice that WalSndWriteData() queues a keepalive
> request.
>
> > Are
> > you sure the problem at hand is that we're sending a keepalive, but it's
> > too late?
>
> No, I'm not sure. I'm trying to identify the cause of an issue I've
> seen in the wild, but never under conditions where it's been possible
> to sit around and debug in a leisurely manner.
>
> I'm trying to set up a TAP test to demonstrate that this happens, but
> I don't think it's going to work without some kind of network
> bandwidth limitation simulation or simulated latency. A local unix
> socket is just too fast for Pg's row size limits.
>
> > It might very well be that the actual issue is that we're
> > never sending keepalives, because the network is fast enough / the tcp
> > window is large enough.  IIRC we only send a keepalive if we're blocked
> > on network IO?
>
> Mm, that's a good point. That might better explain the issues I've
> seen in the wild, since I never found strong evidence that individual
> big rows were involved, but hadn't been able to come up with anything
> else yet.
>
> >> So: We could ask output plugins to deal with this for us, by chunking
> >> up their data in small pieces and calling OutputPluginPrepareWrite()
> >> and OutputPluginWrite() more than once per output plugin callback if
> >> they expect to send a big message. But this pushes the complexity of
> >> splitting up and handling big rows, and big Datums, onto each plugin.
> >> It's awkward to do well and hard to avoid splitting things up
> >> unnecessarily.
> >
> > There's decent reason for doing that independently though, namely that
> > it's a lot more efficient from a memory management POV.
>
> Definitely. Though you're always going to be tossing around ridiculous
> chunks of memory when dealing with big external compressed toasted
> data, unless there are ways to access that progressively that I'm
> unaware of. Hopefully there are.
>
> I'd quite like to extend the bdr/pglogical/logicalrep protocol so that
> in-core logical rep, in some later version, can write a field as 'to
> follow', like we currently mark unchanged toasted datums separately.
> Then send it chunked, after the main row, in follow-up messages. That
> way we keep processing keepalives, we don't allocate preposterous
> amounts of memory, etc.
>
> > I don't think the "unrequested keepalive" approach really solves the
> > problem on a fundamental enough level.
>
> Fair. It feels a bit like flailing in the dark, too.
>
> >> (A separate issue is that we can also time out when doing logical
> >> _replication_ if the downstream side blocks on a lock, since it's not
> >> safe to send on a socket from a signal handler ... )
> >
> > That's strictly speaking not true. write() / sendmsg() are 

Re: [HACKERS] Logical decoding and walsender timeouts

2016-10-31 Thread Craig Ringer
On 31 October 2016 at 16:52, Andres Freund  wrote:
> Hi,
>
> On 2016-10-31 16:34:38 +0800, Craig Ringer wrote:
>> TL;DR: Logical decoding clients need to generate their own keepalives
>> and not rely on the server requesting them to prevent timeouts. Or
>> admins should raise the wal_sender_timeout by a LOT when using logical
>> decoding on DBs with any big rows.
>
> Unconvinced.

Yeah. I've seen enough issues in the wild where we keep timing out and
restarting over and over until we increase wal_sender_timeout to know
there's _something_ going on. I am less sure I'm right about what is
or how to solve it.

>> When sending a big message, WalSndWriteData() notices that it's
>> approaching timeout and tries to send a keepalive request, but the
>> request just gets buffered behind the remaining output plugin data and
>> isn't seen by the client until the client has received the rest of the
>> pending data.
>
> Only for individual messages, not the entire transaction though.

Right.  I initially thought it was the whole tx, but I was mistaken as
I'd failed to notice that WalSndWriteData() queues a keepalive
request.

> Are
> you sure the problem at hand is that we're sending a keepalive, but it's
> too late?

No, I'm not sure. I'm trying to identify the cause of an issue I've
seen in the wild, but never under conditions where it's been possible
to sit around and debug in a leisurely manner.

I'm trying to set up a TAP test to demonstrate that this happens, but
I don't think it's going to work without some kind of network
bandwidth limitation simulation or simulated latency. A local unix
socket is just too fast for Pg's row size limits.

> It might very well be that the actual issue is that we're
> never sending keepalives, because the network is fast enough / the tcp
> window is large enough.  IIRC we only send a keepalive if we're blocked
> on network IO?

Mm, that's a good point. That might better explain the issues I've
seen in the wild, since I never found strong evidence that individual
big rows were involved, but hadn't been able to come up with anything
else yet.

>> So: We could ask output plugins to deal with this for us, by chunking
>> up their data in small pieces and calling OutputPluginPrepareWrite()
>> and OutputPluginWrite() more than once per output plugin callback if
>> they expect to send a big message. But this pushes the complexity of
>> splitting up and handling big rows, and big Datums, onto each plugin.
>> It's awkward to do well and hard to avoid splitting things up
>> unnecessarily.
>
> There's decent reason for doing that independently though, namely that
> it's a lot more efficient from a memory management POV.

Definitely. Though you're always going to be tossing around ridiculous
chunks of memory when dealing with big external compressed toasted
data, unless there are ways to access that progressively that I'm
unaware of. Hopefully there are.

I'd quite like to extend the bdr/pglogical/logicalrep protocol so that
in-core logical rep, in some later version, can write a field as 'to
follow', like we currently mark unchanged toasted datums separately.
Then send it chunked, after the main row, in follow-up messages. That
way we keep processing keepalives, we don't allocate preposterous
amounts of memory, etc.

> I don't think the "unrequested keepalive" approach really solves the
> problem on a fundamental enough level.

Fair. It feels a bit like flailing in the dark, too.

>> (A separate issue is that we can also time out when doing logical
>> _replication_ if the downstream side blocks on a lock, since it's not
>> safe to send on a socket from a signal handler ... )
>
> That's strictly speaking not true. write() / sendmsg() are signal safe
> functions.  There's good reasons not to do that however, namely that the
> non signal handler code might be busy writing data itself.

Huh, ok. And since in pglogical/bdr and as far as I can tell in core
logical rep we don't send anything on the socket while we're calling
in to heap access, the executor, etc, that'd actually be an option. We
could possibly safeguard it with a volatile "socket busy" flag since
we don't do much sending anyway. But I'd need to do my reading on
signal handler safety etc. Still, good to know it's not completely
absurd to do this if the issue comes up.

Thanks very much for the input. I saw your post with proposed changes.
Once I can get the issue simulated reliably I'll test the patch and
see if it solves it, but it looks like it's sensible to just apply it
anyway TBH.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical decoding and walsender timeouts

2016-10-31 Thread Andres Freund
Hi,

On 2016-10-31 16:34:38 +0800, Craig Ringer wrote:
> TL;DR: Logical decoding clients need to generate their own keepalives
> and not rely on the server requesting them to prevent timeouts. Or
> admins should raise the wal_sender_timeout by a LOT when using logical
> decoding on DBs with any big rows.

Unconvinced.


> When sending a big message, WalSndWriteData() notices that it's
> approaching timeout and tries to send a keepalive request, but the
> request just gets buffered behind the remaining output plugin data and
> isn't seen by the client until the client has received the rest of the
> pending data.

Only for individual messages, not the entire transaction though.   Are
you sure the problem at hand is that we're sending a keepalive, but it's
too late? It might very well be that the actual issue is that we're
never sending keepalives, because the network is fast enough / the tcp
window is large enough.  IIRC we only send a keepalive if we're blocked
on network IO?


> So: We could ask output plugins to deal with this for us, by chunking
> up their data in small pieces and calling OutputPluginPrepareWrite()
> and OutputPluginWrite() more than once per output plugin callback if
> they expect to send a big message. But this pushes the complexity of
> splitting up and handling big rows, and big Datums, onto each plugin.
> It's awkward to do well and hard to avoid splitting things up
> unnecessarily.

There's decent reason for doing that independently though, namely that
it's a lot more efficient from a memory management POV.


I don't think the "unrequested keepalive" approach really solves the
problem on a fundamental enough level.


> (A separate issue is that we can also time out when doing logical
> _replication_ if the downstream side blocks on a lock, since it's not
> safe to send on a socket from a signal handler ... )

That's strictly speaking not true. write() / sendmsg() are signal safe
functions.  There's good reasons not to do that however, namely that the
non signal handler code might be busy writing data itself.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Logical decoding and walsender timeouts

2016-10-31 Thread Craig Ringer
Hi all

I've been investigating some intermittent timeout issues I've seen
with BDR users in the wild and I think I've identified an issue with
how logical decoding sends data and handles feedback.

TL;DR: Logical decoding clients need to generate their own keepalives
and not rely on the server requesting them to prevent timeouts. Or
admins should raise the wal_sender_timeout by a LOT when using logical
decoding on DBs with any big rows.


An output plugin accumulates data in a StringInfo which is then sent
by WalSndWriteData(), which does nonblocking sends, checks for
interrupts, processes client replies, and requests keepalives if
approaching walsender timeout.

A problem arises when the client's buffered data is big enough that it
takes more than wal_sender_timeout to transmit the whole chunk and get
a reply from the client on the link. A slow link, a big row, or some
badly timed packet loss can easily result in timeout.

Unlike in physical replication, a timeout on a logical decoding
session can be expensive:

* We have to restart logical decoding at restart_lsn, which can be a
LONG time ago if there are any long running tx's, and decode up to the
last confirmed xact;

* We then have to resend the whole xact we were working on when we
timed out, which might be quite big.

so we should do our best to avoid timeouts in logical decoding
sessions. Physical replication chunks data into small pieces and can
restart cheaply, but logical decoding will send a 100MB+ row as a
single message and have to re-read hundreds of GB of WAL and resend
gigabytes of data on the wire to get back where it started after a
timeout.

When sending a big message, WalSndWriteData() notices that it's
approaching timeout and tries to send a keepalive request, but the
request just gets buffered behind the remaining output plugin data and
isn't seen by the client until the client has received the rest of the
pending data. The keepalive requests are useful so long as no one
message takes more than wal_sender_timeout to send since it'll defer
timeout and give us enough time to process the next message. But they
won't prevent timeout within a single big message.

How to fix this?

We can't inject keepalive requests in the middle of a CopyData message.

After the cancel-message discussion I know better than to suggest
using out-of-band tcp data and SIGURG.

So: We could ask output plugins to deal with this for us, by chunking
up their data in small pieces and calling OutputPluginPrepareWrite()
and OutputPluginWrite() more than once per output plugin callback if
they expect to send a big message. But this pushes the complexity of
splitting up and handling big rows, and big Datums, onto each plugin.
It's awkward to do well and hard to avoid splitting things up
unnecessarily.

We could extend the logical decoding protocol slightly, so that the
CopyData messages that carry output plugin payloads are limited to
some fixed, reasonable size (1MB? 10MB?) where the CopyData message
overhead is insignificant and most rows won't need to be reassembled
from multiple CopyData messages by the downstream. This gives us gaps
in which we can inject keepalives / status updates during big messages
so we get the client replies and process them. This won't fix existing
releases though, and makes clients reassemble the chunks. It also
means clients can't know the full size of a message based on the
CopyData length anymore so it'll force clients that rely on the
CopyData message length to add their own length fields in the payload.
CopyData doesn't have anything we can use as a "continuation message
follows" flag.

We could permit the walsender to reset the keepalive timer whenever it
confirms that it's sent a reasonable chunk of data such that it's
unlikely to just be lurking in the kernel's socket send buffer or
various routers in transit. We don't care that the client received up
to any particular point, only that it's alive and making progress. We
know the client must be receiving and acking it since we're using TCP.
The problem is knowing what "enough' data sent to mean the client
must've received some is. The socket send buffer is easy to check and
bounded, but TCP window scaling means there can be a huge amount of
un-acked data in transit. AFAIK we have no visibility into the network
stack's view of what's acked, at least portably. Linux has TCP_INFO
but it's hard to use and nonportable. So we can't assume that even
successfully sending 1GB of data means the client got data 1GB-ago
unless we clamp the congestion window or do other
possibly-performance-affecting things.

Or we could expect logical decoding downstream clients to send their
own proactive feedback when receiving and processing large messages.
The walsender is already capable of noticing and processing such
replies, just not requesting them when it should. Again this pushes
complexity onto each client, but less so than requiring output plugins
to chunk up their messages. Also, since each