Slow plan choice with prepared query

2022-02-23 Thread Mark Saward

Hi,


I've experienced a situation where the planner seems to make a very poor 
choice with a prepared query after the first five executions.  Looking 
at the documentation, I think this happens because it switches from a 
custom plan to a generic one, and doesn't make a good choice for the 
generic one.


Postgres version: running in docker, reports to be 'Debian 14.1-1.pgdg110+1'

If I force it to use a custom plan via 'set local plan_cache_mode = 
force_custom_plan', then I don't notice any slowdown.  Without it, the 
6th and onwards calls can take 1 second to 15 seconds each, as opposed 
to about 10ms.


Since I have a workaround, I don't necessarily need assistance, but 
posting this here in case it's of value as a test case. Here's a test 
case that reliably duplicates this issue for me:




create table test (
  test_id serial primary key,
  data text
);

insert into test (data) (select data from (select 
generate_series(1,1) AS id, md5(random()::text) AS data) x);


prepare foo_test(text, text, int, text, bool) as SELECT * FROM (SELECT
  *,
  count(*) OVER () > $3 AS has_more,
  row_number() OVER ()
  FROM (
    WITH counted AS (
  SELECT count(*) AS total
  FROM   (select test_id::text, data
from test
where
  (cast($1 as text) is null or lower(data) like '%' || lower($1) || '%')
and
  (cast($2 as text) is null or lower(test_id::text) like '%' || 
lower($2) || '%')) base

    ), cursor_row AS (
  SELECT base.test_id
  FROM   (select test_id::text, data
from test
where
  (cast($1 as text) is null or lower(data) like '%' || lower($1) || '%')
and
  (cast($2 as text) is null or lower(test_id::text) like '%' || 
lower($2) || '%')) base

  WHERE  base.test_id = $4
    )
    SELECT counted.*, base.*
  FROM   (select test_id::text, data
from test
where
  (cast($1 as text) is null or lower(data) like '%' || lower($1) || '%')
and
  (cast($2 as text) is null or lower(test_id::text) like '%' || 
lower($2) || '%')) base

  LEFT JOIN   cursor_row ON true
  LEFT JOIN   counted ON true
  WHERE ((
    $4 IS NULL OR cast($5 as bool) IS NULL
  ) OR (
    (base.test_id)
  > (cursor_row.test_id)
  ))
  ORDER BY base.test_id ASC
  LIMIT $3 + 1
) xy LIMIT $3 ) z ORDER BY row_number ASC;

\timing

execute foo_test(null, null, 5, 500, true);
execute foo_test(null, null, 5, 500, true);
execute foo_test(null, null, 5, 500, true);
execute foo_test(null, null, 5, 500, true);
execute foo_test(null, null, 5, 500, true);

-- This one should be slower:
execute foo_test(null, null, 5, 500, true);





Re: An I/O error occurred while sending to the backend (PG 13.4)

2022-02-23 Thread Justin Pryzby
You originally mailed about an error on the client, and now you found
corresponding server logs, which suggests a veritable network issue.

Are the postgres clients and server on the same subnet ?  If not, what are the
intermediate routers ?  Is there any NAT happening ?  Do those devices have any
interesting logs that correspond with the server/client connection failures ?

Have you tried enabling TCP keepalives ?  This might help to convince a NAT
device not to forget about your connection.

https://www.postgresql.org/docs/current/runtime-config-connection.html
tcp_keepalives_idle=9
tcp_keepalives_interval=9
tcp_keepalives_count=0
tcp_user_timeout=0 -- You apparently have this set, but it cannot work on 
windows, so just generates noise.

On linux, you can check the keepalive counters in "netstat -not" to be sure
that it's enabled.  A similar switch hopefully exists for windows.

-- 
Justin




Re: An I/O error occurred while sending to the backend (PG 13.4)

2022-02-23 Thread Justin Pryzby
On Thu, Feb 24, 2022 at 12:47:42AM +, l...@laurent-hasson.com wrote:
> On Sat, Dec 04, 2021 at 05:32:10PM +, l...@laurent-hasson.com wrote:
> >  I have a data warehouse with a fairly complex ETL process that has
> >  been running for years now across PG 9.6, 11.2 and now 13.4 for the
> >  past couple of months. I have been getting the error "An I/O error
> >  occurred while sending to the backend" quite often under load in 13.4
> >  which I never used to get on 11.2. I have applied some tricks, 
> particularly
> >  with the socketTimeout JDBC configuration.

>  It'd be nice to see a network trace for this too.  Using tcpdump or
>  wireshark.  Preferably from the client side.
> 
> Hello Justin,
> 
> It has been ages! The issue has been happening a bit more often recently, as 
> much as once every 10 days or so. As a reminder, the set up is Postgres 13.4 
> on Windows Server with 16cores and 64GB memory. The scenario where this 
> occurs is an ETL tool called Pentaho Kettle (V7) connecting to the DB for 
> DataWarehouse workloads. The tool is Java-based and connects via JDBC using 
> postgresql-42.2.5.jar. There are no particular settings besides the 
> socketTimeout setting mentioned above.
> 
> The workload has some steps being lots of quick transactions for dimension 
> tables for example, but some fact table calculations, especially large 
> pivots, can make queries run for 40mn up to over an hour (a few of those).
> 
> I caught these in the logs at the time of a failure but unsure what to make 
> of that:
> 
> 2022-02-21 02:10:43.605 EST [1368] LOG:  unexpected EOF on client connection 
> with an open transaction
> 2022-02-21 02:10:43.605 EST [3304] LOG:  could not receive data from client: 
> An existing connection was forcibly closed by the remote host.
>   
> 2022-02-21 02:10:43.605 EST [3304] LOG:  unexpected EOF on client connection 
> with an open transaction
> 2022-02-21 02:31:38.808 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
> supported

I suggest to enable CSV logging, which has many more columns of data.
Some of them might provide an insight - I'm not sure.
log_destination=csvlog (in addition to whatever else you have set).

And the aforementioned network trace.  You could set a capture filter on TCP
SYN|RST so it's not absurdly large.  From my notes, it might look like this:
(tcp[tcpflags]&(tcp-rst|tcp-syn|tcp-fin)!=0)

-- 
Justin




RE: An I/O error occurred while sending to the backend (PG 13.4)

2022-02-23 Thread l...@laurent-hasson.com



   >  -Original Message-
   >  From: l...@laurent-hasson.com 
   >  Sent: Saturday, December 4, 2021 14:18
   >  To: Justin Pryzby 
   >  Cc: pgsql-performa...@postgresql.org
   >  Subject: RE: An I/O error occurred while sending to the backend (PG 13.4)
   >  
   >  
   > >  -Original Message-
   > >  From: Justin Pryzby 
   > >  Sent: Saturday, December 4, 2021 12:59
   > >  To: l...@laurent-hasson.com
   > >  Cc: pgsql-performa...@postgresql.org
   > >  Subject: Re: An I/O error occurred while sending to the backend (PG
   > >  13.4)
   > >
   > >  On Sat, Dec 04, 2021 at 05:32:10PM +, l...@laurent-hasson.com
   > >  wrote:
   > >  > I have a data warehouse with a fairly complex ETL process that has
   > >  been running for years now across PG 9.6, 11.2 and now 13.4 for the
   > >  past couple of months. I have been getting the error "An I/O error
   > >  occurred while sending to the backend" quite often under load in 
13.4
   > >  which I never used to get on 11.2. I have applied some tricks,
   >  particularly
   > >  with the socketTimeout JDBC configuration.
   > >  >
   > >  > So my first question is whether anyone has any idea why this is
   > >  happening? My hardware and general PG configuration have not
   > >  changed between 11.2 and 13.4 and I NEVER experienced this on 11.2
   >  in
   > >  about 2y of production.
   > >  >
   > >  > Second, I have one stored procedure that takes a very long time to
   >  run
   > >  (40mn more or less), so obviously, I'd need to set socketTimeout to
   > >  something like 1h in order to call it and not timeout. That doesn't 
seem
   > >  reasonable?
   > >
   > >  Is the DB server local or remote (TCP/IP) to the client?
   > >
   > >  Could you collect the corresponding postgres query logs when this
   > >  happens ?
   > >
   > >  It'd be nice to see a network trace for this too.  Using tcpdump or
   > >  wireshark.
   > >  Preferably from the client side.
   > >
   > >  FWIW, I suspect the JDBC socketTimeout is a bad workaround.
   > >
   > >  --
   > >  Justin
   >  
   >  It's a remote server, but all on a local network. Network performance is I
   >  am sure not the issue. Also, the system is on Windows Server. What are you
   >  expecting to see out of a tcpdump? I'll try to get PG logs on the failing 
query.
   >  
   >  Thank you,
   >  Laurent.
   >  
   >  
   >  
   >  

Hello Justin,

It has been ages! The issue has been happening a bit more often recently, as 
much as once every 10 days or so. As a reminder, the set up is Postgres 13.4 on 
Windows Server with 16cores and 64GB memory. The scenario where this occurs is 
an ETL tool called Pentaho Kettle (V7) connecting to the DB for DataWarehouse 
workloads. The tool is Java-based and connects via JDBC using 
postgresql-42.2.5.jar. There are no particular settings besides the 
socketTimeout setting mentioned above.

The workload has some steps being lots of quick transactions for dimension 
tables for example, but some fact table calculations, especially large pivots, 
can make queries run for 40mn up to over an hour (a few of those).

I caught these in the logs at the time of a failure but unsure what to make of 
that:


2022-02-21 02:08:16.214 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:29.347 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:30.371 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:30.463 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:30.596 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:30.687 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:30.786 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:30.873 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:30.976 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:31.050 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:31.131 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:31.240 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:31.906 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:31.988 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:33.068 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:34.850 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:10:43.596 EST [836] LOG:  could not receive data from client: An 
existing connection was forcibly closed by the remote host.


2022-02-21 02:10:43.598 EST [8616] LOG:  could not receive data from client: An 
existing connection was forcibly c