Updating large tables without dead tuples

2018-02-23 Thread l...@laurent-hasson.com
Hello

I work with a large and wide table (about 300 million rows, about 50 columns), 
and from time to time, we get business requirements to make some modifications. 
But sometimes, it's just some plain mistake. This has happened to us a few 
weeks ago where someone made a mistake and we had to update a single column of 
a large and wide table. Literally, the source data screwed up a zip code and we 
had to patch on our end.

Anyways... Query ran was:
update T set source_id = substr(sourceId, 2, 10);
Took about 10h and created 100's of millions of dead tuples, causing another 
couple of hours of vacuum.

This was done during a maintenance window, and that table is read-only except 
when we ETL data to it on a weekly basis, and so I was just wondering why I 
should pay the "bloat" penalty for this type of transaction. Is there a trick 
that could be use here?

More generally, I suspect that the MVCC architecture is so deep that something 
like LOCK TABLE, which would guarantee that there won't be contentions, 
couldn't be used as a heuristic to not create dead tuples? That would make 
quite a performance improvement for this type of work though.


Thank you,
Laurent.


RE: Updating large tables without dead tuples

2018-02-24 Thread l...@laurent-hasson.com
> -Original Message-
> From: Stephen Frost [mailto:sfr...@snowman.net]
> Sent: Friday, February 23, 2018 19:10
> To: l...@laurent-hasson.com
> Cc: pgsql-performance@lists.postgresql.org
> Subject: Re: Updating large tables without dead tuples
> 
> Greetings,
> 
> * l...@laurent-hasson.com (l...@laurent-hasson.com) wrote:
> > This was done during a maintenance window, and that table is read-only
> except when we ETL data to it on a weekly basis, and so I was just wondering
> why I should pay the "bloat" penalty for this type of transaction. Is there a 
> trick
> that could be use here?
> 
> Yes, create a new table and INSERT the data into that table, then swap the new
> table into place as the old table.  Another option, if you don't mind the
> exclusive lock taken on the table, is to dump the data to another table, then
> TRUNCATE the current one and then INSERT into it.
> 
> There's other options too, involving triggers and such to allow updates and
> other changes to be captured during this process, avoiding the need to lock 
> the
> table, but that gets a bit complicated.
> 
> > More generally, I suspect that the MVCC architecture is so deep that
> something like LOCK TABLE, which would guarantee that there won't be
> contentions, couldn't be used as a heuristic to not create dead tuples? That
> would make quite a performance improvement for this type of work though.
> 
> I'm afraid it wouldn't be quite that simple, particularly you have to think 
> about
> what happens when you issue a rollback...
> 
> Thanks!
> 
> Stephen

[Laurent Hasson] 
[Laurent Hasson] 
This table several other tables with foreign keys into it... So any physical 
replacement of the table wouldn't work I believe. I'd have to disable/remove 
the foreign keys across the other tables, do this work, and then re-set the 
foreign keys. Overall time in aggregate may not be much shorter than the 
current implementation.

This table represents Hospital visits, off of which hang a lot of other 
information. The updated column in that Visits table is not part of the key.

As for the rollback, I didn't think about it because in our case, short of a 
db/hardware failure, this operation wouldn't fail... But the risk is there and 
I understand the engine must be prepared for anything and fulfill the ACID 
principles.

With respect to that, I read in many places that an UPDATE is effectively a 
DELETE + INSERT. Does that mean in the rollback logs, there are 2 entries for 
each row updated as a result?

Thank you,
Laurent.



RE: Updating large tables without dead tuples

2018-03-10 Thread l...@laurent-hasson.com


> -Original Message-
> From: Vik Fearing [mailto:vik.fear...@2ndquadrant.com]
> Sent: Friday, March 02, 2018 20:56
> To: l...@laurent-hasson.com; pgsql-performance@lists.postgresql.org
> Cc: Stephen Frost 
> Subject: Re: Updating large tables without dead tuples
> 
> On 02/24/2018 12:27 AM, l...@laurent-hasson.com wrote:
> > Hello
> >
> >
> >
> > I work with a large and wide table (about 300 million rows, about 50
> > columns), and from time to time, we get business requirements to make
> > some modifications. But sometimes, it's just some plain mistake. This
> > has happened to us a few weeks ago where someone made a mistake and we
> > had to update a single column of a large and wide table. Literally,
> > the source data screwed up a zip code and we had to patch on our end.
> >
> >
> >
> > Anyways. Query ran was:
> >
> >     update T set source_id = substr(sourceId, 2, 10);
> >
> > Took about 10h and created 100's of millions of dead tuples, causing
> > another couple of hours of vacuum.
> >
> >
> >
> > This was done during a maintenance window, and that table is read-only
> > except when we ETL data to it on a weekly basis, and so I was just
> > wondering why I should pay the "bloat" penalty for this type of
> > transaction. Is there a trick that could be use here?
> Yes, there is a trick I like to use here, as long as you don't mind locking 
> the
> table (even against reads).
> 
> I'll assume T.source_id is of type text.  If it's not, use whatever the 
> actual type
> is.
> 
> ALTER TABLE T
> ALTER COLUMN source_id TYPE text USING substr(sourceId, 2, 10);
> 
> I copied what you had verbatim, I earnestly hope you don't have two columns
> source_id and sourceId in your table.
> 
> This will rewrite the entire table just the same as a VACUUM FULL after your
> UPDATE would.
> 
> Don't forget to VACUUM ANALYZE this table after the operation.  Even though
> there will be no dead rows, you still need to VACUUM it to generate the
> visibility map and you need to ANALYZE it for statistics on your "new" column.
> 
> Foreign keys remain intact with this solution and you don't have double wal
> logging like for an UPDATE.
> --
> Vik Fearing  +33 6 46 75 15 36
> http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

[Laurent Hasson] 
Yes, sorry... only a single column source_id. I understand your idea... Is that 
because a TEXT field (vs a varchar) would be considered TOAST and be treated 
differently?

Thanks,
Laurent.



RE: An I/O error occurred while sending to the backend (PG 13.4)

2022-03-01 Thread l...@laurent-hasson.com


   >  -Original Message-
   >  From: Justin Pryzby 
   >  Sent: Monday, February 28, 2022 17:05
   >  To: l...@laurent-hasson.com
   >  Cc: pgsql-performa...@postgresql.org
   >  Subject: Re: An I/O error occurred while sending to the backend (PG 13.4)
   >  
   >  On Mon, Feb 28, 2022 at 09:43:09PM +, l...@laurent-hasson.com
   >  wrote:
   >  >On Wed, Feb 23, 2022 at 07:04:15PM -0600, Justin Pryzby wrote:
   >  >>  > And the aforementioned network trace.  You could set a capture
   >  filter on TCP
   >  >>  > SYN|RST so it's not absurdly large.  From my notes, it might 
look like
   >  this:
   >  >>  > (tcp[tcpflags]&(tcp-rst|tcp-syn|tcp-fin)!=0)
   >  >>
   >  >>  I'd also add '|| icmp'.  My hunch is that you'll see some ICMP 
(not
   >  "ping")
   >  >>  being sent by an intermediate gateway, resulting in the connection
   >  being
   >  >>  reset.
   >  >
   >  > I am so sorry but I do not understand what you are asking me to do. I am
   >  unfamiliar with these commands. Is this a postgres configuration file? Is 
this
   >  something I just do once or something I leave on to hopefully catch it 
when
   >  the issue occurs? Is this something to do on the DB machine or the ETL
   >  machine? FYI:
   >  
   >  It's no problem.
   >  
   >  I suggest that you run wireshark with a capture filter to try to show 
*why*
   >  the connections are failing.  I think the capture filter might look like:
   >  
   >  (icmp || (tcp[tcpflags] & (tcp-rst|tcp-syn|tcp-fin)!=0)) && host
   >  10.64.17.211
   >  
   >  With the "host" filtering for the IP address of the *remote* machine.
   >  
   >  You could run that on whichever machine is more convenient and leave it
   >  running for however long it takes for that error to happen.  You'll be 
able to
   >  save a .pcap file for inspection.  I suppose it'll show either a TCP RST 
or an
   >  ICMP.
   >  Whichever side sent that is where the problem is.  I still suspect the 
issue
   >  isn't in postgres.
   >  
   >  >   - My ETL machine is on 10.64.17.211
   >  >   - My DB machine is on 10.64.17.210
   >  >   - Both on Windows Server 2012 R2, x64
   >  
   >  These network details make my theory unlikely.
   >  
   >  They're on the same subnet with no intermediate gateways, and
   >  communicate directly via a hub/switch/crossover cable.  If that's true, 
then
   >  both will have each other's hardware address in ARP after pinging from one
   >  to the other.
   >  
   >  --
   >  Justin

Yes, the machines ARE on the same subnet. They actually even are on the same 
physical rack as per what I have been told. When I run a tracert, I get this:

Tracing route to PRODDB.xxx.int [10.64.17.210] over a maximum of 30 hops:
  1 1 ms<1 ms<1 ms  PRODDB.xxx.int [10.64.17.210]
Trace complete.

Now, there is an additional component I think... Storage is on an array and I 
am not getting a clear answer as to where it is 😊 Is it possible that something 
is happening at the storage layer? Could that be reported as a network issue vs 
a storage issue for Postgres?

Also, both machines are actually VMs. I forgot to mention that and not sure if 
that's relevant.

Thank you,
Laurent.




RE: An I/O error occurred while sending to the backend (PG 13.4)

2022-03-03 Thread l...@laurent-hasson.com


   >  -Original Message-
   >  From: Justin Pryzby 
   >  Sent: Tuesday, March 1, 2022 14:27
   >  To: l...@laurent-hasson.com
   >  Cc: pgsql-performa...@postgresql.org
   >  Subject: Re: An I/O error occurred while sending to the backend (PG 13.4)
   >  
   >  On Tue, Mar 01, 2022 at 04:28:31PM +, l...@laurent-hasson.com
   >  wrote:
   >  > Now, there is an additional component I think... Storage is on an array
   >  and I am not getting a clear answer as to where it is 😊 Is it possible 
that
   >  something is happening at the storage layer? Could that be reported as a
   >  network issue vs a storage issue for Postgres?
   >  
   >  No.  If there were an error with storage, it'd be reported as a local 
error,
   >  and the query would fail, rather than failing with client-server
   >  communication.
   >  
   >  > Also, both machines are actually VMs. I forgot to mention that and not
   >  sure if that's relevant.
   >  
   >  Are they running on the same hypervisor ?  Is that hyperv ?
   >  Lacking other good hypotheses, that does seem relevant.
   >  
   >  --
   >  Justin

Issue happened again last night. I did implement your recommendations but it 
didn't seem to prevent the issue:

tcp_keepalives_idle=9   # TCP_KEEPIDLE, in seconds;
# 0 selects the system default
tcp_keepalives_interval=9   # TCP_KEEPINTVL, in seconds;
# 0 selects the system default
tcp_keepalives_count=0  # TCP_KEEPCNT;
# 0 selects the system default
#tcp_user_timeout = 0   # TCP_USER_TIMEOUT, in milliseconds;
# 0 selects the system default

On the client application, the exceptions are:

2022/03/03 01:04:56 - Upsert2.0 - ERROR (version 7.1.0.0-12, build 1 from 
2017-05-16 17.18.02 by buildguy) : Unexpected error
2022/03/03 01:04:56 - Upsert2.0 - ERROR (version 7.1.0.0-12, build 1 from 
2017-05-16 17.18.02 by buildguy) : 
org.pentaho.di.core.exception.KettleStepException: 
2022/03/03 01:04:56 - Upsert2.0 - Error in step, asking everyone to stop 
because of:
2022/03/03 01:04:56 - Upsert2.0 - 
2022/03/03 01:04:56 - Upsert2.0 - Error inserting/updating row
2022/03/03 01:04:56 - Upsert2.0 - An I/O error occurred while sending to the 
backend.
2022/03/03 01:04:56 - Upsert2.0 - 
2022/03/03 01:04:56 - Upsert2.0 - 
2022/03/03 01:04:56 - Upsert2.0 -at 
org.pentaho.di.trans.steps.insertupdate.InsertUpdate.processRow(InsertUpdate.java:313)
2022/03/03 01:04:56 - Upsert2.0 -at 
org.pentaho.di.trans.step.RunThread.run(RunThread.java:62)
2022/03/03 01:04:56 - Upsert2.0 -at java.lang.Thread.run(Thread.java:745)
2022/03/03 01:04:56 - Upsert2.0 - Caused by: 
org.pentaho.di.core.exception.KettleDatabaseException: 
2022/03/03 01:04:56 - Upsert2.0 - Error inserting/updating row
2022/03/03 01:04:56 - Upsert2.0 - An I/O error occurred while sending to the 
backend.
2022/03/03 01:04:56 - Upsert2.0 - 
2022/03/03 01:04:56 - Upsert2.0 -at 
org.pentaho.di.core.database.Database.insertRow(Database.java:1321)
2022/03/03 01:04:56 - Upsert2.0 -at 
org.pentaho.di.core.database.Database.insertRow(Database.java:1245)
2022/03/03 01:04:56 - Upsert2.0 -at 
org.pentaho.di.core.database.Database.insertRow(Database.java:1233)
2022/03/03 01:04:56 - Upsert2.0 -at 
org.pentaho.di.trans.steps.insertupdate.InsertUpdate.lookupValues(InsertUpdate.java:163)
2022/03/03 01:04:56 - Upsert2.0 -at 
org.pentaho.di.trans.steps.insertupdate.InsertUpdate.processRow(InsertUpdate.java:299)
2022/03/03 01:04:56 - Upsert2.0 -... 2 more
2022/03/03 01:04:56 - Upsert2.0 - Caused by: org.postgresql.util.PSQLException: 
An I/O error occurred while sending to the backend.
2022/03/03 01:04:56 - Upsert2.0 -at 
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:382)
2022/03/03 01:04:56 - Upsert2.0 -at 
org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:490)
2022/03/03 01:04:56 - Upsert2.0 -at 
org.postgresql.jdbc.PgStatement.execute(PgStatement.java:408)
2022/03/03 01:04:56 - Upsert2.0 -at 
org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:166)
2022/03/03 01:04:56 - Upsert2.0 -at 
org.postgresql.jdbc.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:134)
2022/03/03 01:04:56 - Upsert2.0 -at 
org.pentaho.di.core.database.Database.insertRow(Database.java:1288)
2022/03/03 01:04:56 - Upsert2.0 -... 6 more
2022/03/03 01:04:56 - Upsert2.0 - Caused by: java.net.SocketException: 
Connection reset
2022/03/03 01:04:56 - Upsert2.0 -at 
java.net.SocketInputStream.read(SocketInputStream.java:209)
2022/03/03 01:04:56 - Upsert2.0 -at 
java.net.SocketInputStream.read(SocketInputStream.java:141)
2022/03/03 01:04:56 - Upsert2.0 -at 
org.postgresql.core.Visib

RE: An I/O error occurred while sending to the backend (PG 13.4)

2022-03-03 Thread l...@laurent-hasson.com
I am also starting to feel that the issue being on the database’s side is less 
and less likely. There is something happening in between, or possibly on the 
client.

Ranier, the only reason I was focusing on this at the PG level is that this 
issue started to show up several months ago shortly after I updated to PG13 
from PG11. Had run PG11 for 2 years without ever seeing that issue at all. The 
ETL itself hasn’t changed either, except for upgrading the JDBC driver… But I 
did revert back to an older JDBC driver and the issue still did occur 
eventually.

Of course, other things could have changed in the client’s IT infrastructure 
that I am not aware of, so I am pushing that angle as well more aggressively 
now. I am also pushing for WireShark to monitor the network more closely. Stay 
tuned!

Thank you so much all for your support but at this time, I think the ball is in 
my camp and working out with it on some plan.

Thank you,
Laurent.


From: Ranier Vilela 
Sent: Thursday, March 3, 2022 11:57
To: Justin Pryzby 
Cc: l...@laurent-hasson.com; pgsql-performa...@postgresql.org
Subject: Re: An I/O error occurred while sending to the backend (PG 13.4)


Em qui., 3 de mar. de 2022 às 13:46, Justin Pryzby 
mailto:pry...@telsasoft.com>> escreveu:
On Thu, Mar 03, 2022 at 01:33:08PM -0300, Ranier Vilela wrote:
> Sorry, but this is much more on the client side.

The client is reporting the problem, as is the server.
 Are you read the server log?
" 2022-03-03 01:04:40 EST [21228] LOG:  could not receive data from client: An 
existing connection was forcibly closed by the remote host.
2022-03-03 01:04:40 EST [21228] LOG:  unexpected EOF on client connection with 
an open transaction"


> Following the logs, it is understood that the client is dropping the
> connection.

The logs show that the client's connection *was* dropped.
And on the server, the same.
No, the log server shows that the client dropped the connection.


> So most likely the error could be from Pentaho or JDBC.
>
> https://www.geeksforgeeks.org/java-net-socketexception-in-java-with-examples/
> " This *SocketException* occurs on the server-side when the client closed
> the socket connection before the response could be returned over the
> socket."
>
> I suggest moving this thread to the Pentaho or JDBC support.

We don't know the source of the problem.
Yeah, but it is much more likely to be on the client.

  I still doubt it's in postgres,
Everything indicates not.

but I
don't think it's helpful to blame the client, just because the client reported
the problem.  If the server were to disconnect abruptly, I'd expect the client
to report that, too.

Laurent would just have to start the conversation over (and probably collect
the same diagnostic information anyway).  The client projects could blame
postgres with as much rationale as there is for us to blame the client.

Please don't add confusion here.
I just suggested, this is not an order.

regards,
Ranier Vilela


RE: An I/O error occurred while sending to the backend (PG 13.4)

2022-04-13 Thread l...@laurent-hasson.com
>From: Ranier Vilela ranier...@gmail.com
>Sent: Thursday, March 03, 2022 13:22
>
>
>You are welcome.
>
>regards,
>Ranier Vilela



Hello all,

After a lot of back and forth, someone in IT informed us that the database VM 
is under a backup schedule using Veeam. Apparently, during the backup window, 
Veeam creates a snapshot and that takes the VM offline for a couple of minutes… 
And of course, they scheduled this right at the busiest time of the day for 
this machine which is during our nightly ETL. Their backup doesn’t perform very 
week either, which explained why the failure seemed to randomly happen at 
various points during our ETL (which takes about 2h30mn).

They moved the schedule out and the issue has not happened again over the past 
3 weeks. This looks like it was the root cause and would explain (I think) how 
the database and the client simultaneously reported a connection timeout.

Thank you so much for all your help in trying to figure this out and exonerate 
Postgres.

Thank you,
Laurent.




Zero throughput on a query on a very large table.

2019-01-24 Thread l...@laurent-hasson.com
Hello,



We have been stuck for the past week on a query that simply won’t “execute”. We 
have a table with 1.2B rows that took around 14h to load, but a simple select 
takes forever and after 10h, no records are coming through still.



Environment:

 - Table tmp_outpatient_rev with 41 VARCHAR columns (desy_sort_key, 
claim_no, clm_line_num, clm_thru_dt, nch_clm_type_cd, rev_cntr, rev_cntr_dt, …)

 - 1.2B rows (Billion with a ‘B’)

 - A single Unique Index on columns desy_sort_key, claim_no, clm_line_num

 - select pg_size_pretty(pg_relation_size('tmp_outpatient_rev')) --> 215GB

 - Database Server: 64GB, 8 cores/16 threads, HDDs 10K

 - Linux

 - PG 11.1



Query:

 select * from tmp_outpatient_rev order by desy_sort_key, claim_no



Plan:

Gather Merge  (cost=61001461.16..216401602.29 rows=1242732290 width=250)

  Output: desy_sort_key, claim_no, clm_line_num, clm_thru_dt, 
nch_clm_type_cd, rev_cntr, rev_cntr_dt, …

  Workers Planned: 10

  ->  Sort  (cost=61000460.97..61311144.04 rows=124273229 width=250)

Output: desy_sort_key, claim_no, clm_line_num, clm_thru_dt, 
nch_clm_type_cd, rev_cntr, rev_cntr_dt, …

Sort Key: tmp_outpatient_rev.desy_sort_key, 
tmp_outpatient_rev.claim_no

->  Parallel Seq Scan on public.tmp_outpatient_rev  
(cost=0.00..29425910.29 rows=124273229 width=250)

  Output: desy_sort_key, claim_no, clm_line_num, clm_thru_dt, 
nch_clm_type_cd, rev_cntr, rev_cntr_dt, …



Method of access:

- Using Pentaho Kettle (an ETL tool written in Java and using JDBC), we 
simply issue the query and expect records to start streaming in ASAP.

   - Issue was replicated with really basic JDBC code in a Java test program.

- The database doesn't have much other data and the table was loaded from a 
CSV data source with LOAD over something like 14h (average throughput of about 
25K rows/s)

- Settings:

  alter database "CMS_TMP" set seq_page_cost=1;

  alter database "CMS_TMP" set random_page_cost=4;

  alter database "CMS_TMP" set enable_seqscan=true;

  JDBC connection string with no extra params.

  Database has been generally configured properly.



Problem:

- The plan shows a full table scan followed by a sort, and then a gather 
merge. With 1.2B rows, that's crazy to try to sort that 😊

- After 10h, the query is still "silent" and no record is streaming in. IO 
is very high (80-90% disk throughput utilization) on the machine (the sort…).

- I have tried to hack the planner to force an index scan (which would 
avoid the sort/gather steps and should start streaming data right away), in 
particular, enable_seqscan=false or seq_page_cost=2. This had ZERO impact on 
the plan to my surprise.

   - I changed the “order by” to include all 3 columns from the index, or 
created a non-unique index with only the first 2 columns, all to no effect 
whatsoever either.

- The table was written over almost 14h at about 25K row/s and it seems to 
me I should be able to read the data back at least as fast.



Why is a simple index scan not used? Why are all our efforts to try to force 
the use of the index failing?



Any help is very much appreciated as we are really hitting a wall here with 
that table.



Thank you so much.


Laurent Hasson






RE: Zero throughput on a query on a very large table.

2019-01-25 Thread l...@laurent-hasson.com
> -Original Message-
> From: Andreas Kretschmer 
> Sent: Friday, January 25, 2019 00:55
> To: pgsql-performance@lists.postgresql.org
> Subject: Re: Zero throughput on a query on a very large table.
> 
> 
> 
> Am 25.01.19 um 06:20 schrieb l...@laurent-hasson.com:
> >
> > Hello,
> >
> > We have been stuck for the past week on a query that simply won’t
> > “execute”. We have a table with 1.2B rows that took around 14h to
> > load, but a simple select takes forever and after 10h, no records are
> > coming through still.
> >
> > Environment:
> >
> >  - Table tmp_outpatient_rev with 41 VARCHAR columns
> > (desy_sort_key, claim_no, clm_line_num, clm_thru_dt, nch_clm_type_cd,
> > rev_cntr, rev_cntr_dt, …)
> >
> >  - 1.2B rows (Billion with a ‘B’)
> >
> >  - A single Unique Index on columns desy_sort_key, claim_no,
> > clm_line_num
> >
> >  - select pg_size_pretty(pg_relation_size('tmp_outpatient_rev'))
> > --> 215GB
> >
> >  - Database Server: 64GB, 8 cores/16 threads, HDDs 10K
> >
> >  - Linux
> >
> >  - PG 11.1
> >
> > Query:
> >
> >  select * from tmp_outpatient_rev order by desy_sort_key, claim_no
> >
> > Plan:
> >
> >     Gather Merge (cost=61001461.16..216401602.29 rows=1242732290
> > width=250)
> >
> >   Output: desy_sort_key, claim_no, clm_line_num, clm_thru_dt,
> > nch_clm_type_cd, rev_cntr, rev_cntr_dt, …
> >
> >   Workers Planned: 10
> >
> >   ->  Sort (cost=61000460.97..61311144.04 rows=124273229
> > width=250)
> >
> >     Output: desy_sort_key, claim_no, clm_line_num,
> > clm_thru_dt, nch_clm_type_cd, rev_cntr, rev_cntr_dt, …
> >
> >     Sort Key: tmp_outpatient_rev.desy_sort_key,
> > tmp_outpatient_rev.claim_no
> >
> >     ->  Parallel Seq Scan on public.tmp_outpatient_rev
> > (cost=0.00..29425910.29 rows=124273229 width=250)
> >
> >   Output: desy_sort_key, claim_no, clm_line_num,
> > clm_thru_dt, nch_clm_type_cd, rev_cntr, rev_cntr_dt, …
> >
> > Method of access:
> >
> >     - Using Pentaho Kettle (an ETL tool written in Java and using
> > JDBC), we simply issue the query and expect records to start streaming
> > in ASAP.
> >
> >    - Issue was replicated with really basic JDBC code in a Java test
> > program.
> >
> >     - The database doesn't have much other data and the table was
> > loaded from a CSV data source with LOAD over something like 14h
> > (average throughput of about 25K rows/s)
> >
> >     - Settings:
> >
> >   alter database "CMS_TMP" set seq_page_cost=1;
> >
> >   alter database "CMS_TMP" set random_page_cost=4;
> >
> >   alter database "CMS_TMP" set enable_seqscan=true;
> >
> >   JDBC connection string with no extra params.
> >
> >   Database has been generally configured properly.
> >
> > Problem:
> >
> >     - The plan shows a full table scan followed by a sort, and then a
> > gather merge. With 1.2B rows, that's crazy to try to sort that 😊
> >
> >     - After 10h, the query is still "silent" and no record is
> > streaming in. IO is very high (80-90% disk throughput utilization) on
> > the machine (the sort…).
> >
> >     - I have tried to hack the planner to force an index scan (which
> > would avoid the sort/gather steps and should start streaming data
> > right away), in particular, enable_seqscan=false or seq_page_cost=2.
> > This had ZERO impact on the plan to my surprise.
> >
> >    - I changed the “order by” to include all 3 columns from the index,
> > or created a non-unique index with only the first 2 columns, all to no
> > effect whatsoever either.
> >
> >     - The table was written over almost 14h at about 25K row/s and it
> > seems to me I should be able to read the data back at least as fast.
> >
> > Why is a simple index scan not used? Why are all our efforts to try to
> > force the use of the index failing?
> >
> >
> 
> the query isn't that simple, there is no where condition, so PG has to read 
> the
> whole table and the index is useless. Would it be enought to select only the
> columns covered by the index?
> (run a vacuum on the table after loading the data, that's can enable a index-
> only-scan in this case)
> 
> 
> 
> 
> Regards, Andreas
> 
> --
> 2ndQuadrant - The PostgreSQL Support Company.
> www.2ndQuadrant.com
> 

Well, even without a where clause, and a straight select with an order by on an 
index... The index may perform slightly more slowly, but stream data more 
rapidly... I guess what i am pointing out is that in ETL scenarios, enabling 
better continuous throughput would be better than total overall query 
performance?

Thank you,
Laurent.


 



Re: Zero throughput on a query on a very large table.

2019-01-25 Thread l...@laurent-hasson.com
Sorry, the web outlook client may be "prepending" this message instead of 
appending, as is the custom on this mailing list.


The indices are defined as:

CREATE INDEX i_outprev_ptclaim
ON public.tmp_outpatient_rev USING btree
(desy_sort_key COLLATE pg_catalog."default", claim_no COLLATE 
pg_catalog."default")
TABLESPACE pg_default;

CREATE UNIQUE INDEX ui_outprev_ptclaimline
ON public.tmp_outpatient_rev USING btree
(desy_sort_key COLLATE pg_catalog."default", claim_no COLLATE 
pg_catalog."default", clm_line_num COLLATE pg_catalog."default")
TABLESPACE pg_default;


I am using PGAdmin4 and the client times out, so i don't have the exact timing, 
but each one of those indices completed under 5h (started at lunch time and was 
done before the end of the afternoon). So when i ran the query and it didn't 
move for about 10h, i figured it might "never end" :).


I'll try changing the random page cost and see. The work_men param is set to 
128MB... So maybe that's something too? I'll try.


Additionally, do note that we have a second table, similar in structure, with 
180M rows, select pg_size_pretty(pg_relation_size('tmp_inpatient_rev')) --> 
18GB (so it's 10x smaller) but we get 40K rows/s read throughput on that with a 
similar query and index and the plan does chose an index scan and returns the 
first thousands of row almost immediately (a few secs).


____
From: Tom Lane 
Sent: Friday, January 25, 2019 1:24:45 AM
To: l...@laurent-hasson.com
Cc: pgsql-performa...@postgresql.org
Subject: Re: Zero throughput on a query on a very large table.

"l...@laurent-hasson.com"  writes:
> Query:
>  select * from tmp_outpatient_rev order by desy_sort_key, claim_no

> Plan:
> [ seqscan-and-sort ... parallelized, but still seqscan-and-sort ]

> - I have tried to hack the planner to force an index scan (which would 
> avoid the sort/gather steps and should start streaming data right away), in 
> particular, enable_seqscan=false or seq_page_cost=2. This had ZERO impact on 
> the plan to my surprise.

If you can't get an indexscan plan despite setting enable_seqscan=false,
that typically means that the planner thinks the index's sort order
does not match what the query is asking for.  I wonder whether you
created the index with nondefault collation, or asc/desc ordering,
or something like that.  There's not enough detail here to diagnose
that.

It should also be noted that what enable_seqscan=false actually does
is to add a cost penalty of 1e10 to seqscan plans.  It's possible
that your table is so large and badly ordered that the estimated
cost differential between seqscan and indexscan is more than 1e10,
so that the planner goes for the seqscan anyway.  You could probably
overcome that by aggressively decreasing random_page_cost (and by
"aggressive" I don't mean 2, I mean 0.2, or maybe 0.2, whatever
it takes).  However, if that's what's happening, I'm worried that
getting what you asked for may not really be the outcome you wanted.
Just because you start to see some data streaming to your app right
away doesn't mean the process is going to complete in less time than
it would if you waited for the sort to happen.

You didn't mention what you have work_mem set to, but a small value
of that would handicap the sort-based plan a lot.  I wonder whether
jacking up work_mem to help the sorts run faster won't end up being
the better idea in the end.

regards, tom lane

PS: On the third hand, you mention having created new indexes on this
table with apparently not a lot of pain, which is a tad surprising
if you don't have the patience to wait for a sort to finish.  How
long did those index builds take?


Re: Zero throughput on a query on a very large table.

2019-01-25 Thread l...@laurent-hasson.com

Since the PGADmin4 client timed out when creating the index, you picked my 
interest here and i was wondering if the index creation itself had failed... 
but:

\d tmp_outpatient_rev

Indexes:
"ui_outprev_ptclaimline" UNIQUE, btree (desy_sort_key, claim_no, 
clm_line_num)
"i_outprev_ptclaim" btree (desy_sort_key, claim_no)

So looks like the indices are file. I am pursuing some of the other 
recommendations you suggested before.


Thank you,

Laurent.


From: David Rowley 
Sent: Friday, January 25, 2019 1:55:31 AM
To: Tom Lane
Cc: l...@laurent-hasson.com; pgsql-performa...@postgresql.org
Subject: Re: Zero throughput on a query on a very large table.

On Fri, 25 Jan 2019 at 19:24, Tom Lane  wrote:
> PS: On the third hand, you mention having created new indexes on this
> table with apparently not a lot of pain, which is a tad surprising
> if you don't have the patience to wait for a sort to finish.  How
> long did those index builds take?

It would certainly be good to look at psql's \d tmp_outpatient_rev
output to ensure that the index is not marked as INVALID.


--
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Zero throughput on a query on a very large table.

2019-01-25 Thread l...@laurent-hasson.com
Agreed on the 2 indices. I only added the second non-unique index to test the 
hypothesis that i was doing an order-by col1, col2 when the original unique 
index was on col1, col2, col3...


Also, the original statement i implemented did not have all of that. This is 
the normalized SQL that Postgres now gives when looking at the indices. 
Collation for the DB is "en_US.UTF-8" and that's used for the defaults i 
suspect?


From: Tom Lane 
Sent: Friday, January 25, 2019 1:10:55 PM
To: l...@laurent-hasson.com
Cc: pgsql-performa...@postgresql.org
Subject: Re: Zero throughput on a query on a very large table.

"l...@laurent-hasson.com"  writes:
> The indices are defined as:

> CREATE INDEX i_outprev_ptclaim
> ON public.tmp_outpatient_rev USING btree
> (desy_sort_key COLLATE pg_catalog."default", claim_no COLLATE 
> pg_catalog."default")
> TABLESPACE pg_default;

> CREATE UNIQUE INDEX ui_outprev_ptclaimline
> ON public.tmp_outpatient_rev USING btree
> (desy_sort_key COLLATE pg_catalog."default", claim_no COLLATE 
> pg_catalog."default", clm_line_num COLLATE pg_catalog."default")
> TABLESPACE pg_default;

I'm a bit suspicious of those explicit COLLATE clauses; seems like maybe
they could be accounting for not matching to the query-requested order.
Perhaps they're different from the collations specified on the underlying
table columns?

Also, it seems unlikely that it's worth the maintenance work to keep
both of these indexes, though that's not related to your immediate
problem.

regards, tom lane


Re: Zero throughput on a query on a very large table.

2019-01-25 Thread l...@laurent-hasson.com
Sorry :) When i look at the "SQL" tab in PGAdmin when i select the index in the 
schema browser. But you are right that /d doesn't show that.


From: Tom Lane 
Sent: Friday, January 25, 2019 1:34:01 PM
To: l...@laurent-hasson.com
Cc: pgsql-performa...@postgresql.org
Subject: Re: Zero throughput on a query on a very large table.

"l...@laurent-hasson.com"  writes:
> Also, the original statement i implemented did not have all of that. This is 
> the normalized SQL that Postgres now gives when looking at the indices.

[ squint... ]  What do you mean exactly by "Postgres gives that"?
I don't see any redundant COLLATE clauses in e.g. psql \d.

regards, tom lane


Re: Zero throughput on a query on a very large table.

2019-01-25 Thread l...@laurent-hasson.com
OK... I think we may have cracked this.


First, do you think that 128MB work_mem is ok? We have a 64GB machine and 
expecting fewer than 100 connections. This is really an ETL workload 
environment at this time.


Second, here is what i found and what messed us up.

select current_setting('random_page_cost'); --> 4

alter database "CMS_TMP" set random_page_cost=0.0001;
select current_setting('random_page_cost'); --> 4 

I also tried:
select current_setting('random_page_cost'); --> 4
select set_config('random_page_cost', '0.01', true);
select current_setting('random_page_cost'); --> 4 


Is there something that is happening that is causing those settings to not 
stick? I then tried:


select current_setting('random_page_cost'); --> 4
select set_config('random_page_cost', '0.01', false); -- false now, 
i.e., global
select current_setting('random_page_cost'); --> 0.01 


So i think we just spent 4 days on that issue. I then did

select set_config('enable_seqscan', 'off', false);
And the plan is now using an index scan, and we are getting 12K rows/s in 
throughput immediately!!! 😊

So i guess my final question is that i really want to only affect that one 
query executing, and i seem to not be able to change the settings used by the 
planner just for that one transaction. I have to change it globally which i 
would prefer not to do. Any help here?


Thanks,

Laurent.


From: l...@laurent-hasson.com 
Sent: Friday, January 25, 2019 1:36:21 PM
To: Tom Lane
Cc: pgsql-performa...@postgresql.org
Subject: Re: Zero throughput on a query on a very large table.


Sorry :) When i look at the "SQL" tab in PGAdmin when i select the index in the 
schema browser. But you are right that /d doesn't show that.


From: Tom Lane 
Sent: Friday, January 25, 2019 1:34:01 PM
To: l...@laurent-hasson.com
Cc: pgsql-performa...@postgresql.org
Subject: Re: Zero throughput on a query on a very large table.

"l...@laurent-hasson.com"  writes:
> Also, the original statement i implemented did not have all of that. This is 
> the normalized SQL that Postgres now gives when looking at the indices.

[ squint... ]  What do you mean exactly by "Postgres gives that"?
I don't see any redundant COLLATE clauses in e.g. psql \d.

regards, tom lane


Re: Zero throughput on a query on a very large table.

2019-01-25 Thread l...@laurent-hasson.com
Just a correction from my previous message regarding the throughput we get.


On that one table with 1.2B row, the plan through the index scan delivers 
actually 50K rows/s in read speed to the application, almost immediately. It 
would go through the entire table in under 7h vs the other approach which still 
didn't deliver any data after 10h.


We do additional joins and logic and out final throughput is about 12K/s (what 
i quoted previously), but this is a case where clearly the index_scan plan 
delivers vastly better performance than the table_seq_scan+sort plan.


Any insight here?


Thank you,

Laurent.


From: l...@laurent-hasson.com 
Sent: Friday, January 25, 2019 2:06:54 PM
To: Tom Lane
Cc: pgsql-performa...@postgresql.org
Subject: Re: Zero throughput on a query on a very large table.


OK... I think we may have cracked this.


First, do you think that 128MB work_mem is ok? We have a 64GB machine and 
expecting fewer than 100 connections. This is really an ETL workload 
environment at this time.


Second, here is what i found and what messed us up.

select current_setting('random_page_cost'); --> 4

alter database "CMS_TMP" set random_page_cost=0.0001;
select current_setting('random_page_cost'); --> 4 

I also tried:
select current_setting('random_page_cost'); --> 4
select set_config('random_page_cost', '0.01', true);
select current_setting('random_page_cost'); --> 4 


Is there something that is happening that is causing those settings to not 
stick? I then tried:


select current_setting('random_page_cost'); --> 4
select set_config('random_page_cost', '0.01', false); -- false now, 
i.e., global
select current_setting('random_page_cost'); --> 0.01 


So i think we just spent 4 days on that issue. I then did

select set_config('enable_seqscan', 'off', false);
And the plan is now using an index scan, and we are getting 12K rows/s in 
throughput immediately!!! 😊

So i guess my final question is that i really want to only affect that one 
query executing, and i seem to not be able to change the settings used by the 
planner just for that one transaction. I have to change it globally which i 
would prefer not to do. Any help here?


Thanks,

Laurent.


From: l...@laurent-hasson.com 
Sent: Friday, January 25, 2019 1:36:21 PM
To: Tom Lane
Cc: pgsql-performa...@postgresql.org
Subject: Re: Zero throughput on a query on a very large table.


Sorry :) When i look at the "SQL" tab in PGAdmin when i select the index in the 
schema browser. But you are right that /d doesn't show that.


From: Tom Lane 
Sent: Friday, January 25, 2019 1:34:01 PM
To: l...@laurent-hasson.com
Cc: pgsql-performa...@postgresql.org
Subject: Re: Zero throughput on a query on a very large table.

"l...@laurent-hasson.com"  writes:
> Also, the original statement i implemented did not have all of that. This is 
> the normalized SQL that Postgres now gives when looking at the indices.

[ squint... ]  What do you mean exactly by "Postgres gives that"?
I don't see any redundant COLLATE clauses in e.g. psql \d.

regards, tom lane


Re: Zero throughput on a query on a very large table.

2019-01-25 Thread l...@laurent-hasson.com
Correct, but in the Java code, it's multiple statements in a single 
transaction, so it should stick. Not sure if something else stupid is going on.


Good to know about the ALTER DATABASE effect. I didn't realize that.


Thanks a billion.


Laurent.


From: Tom Lane 
Sent: Friday, January 25, 2019 3:04:37 PM
To: l...@laurent-hasson.com
Cc: pgsql-performa...@postgresql.org
Subject: Re: Zero throughput on a query on a very large table.

"l...@laurent-hasson.com"  writes:
> Second, here is what i found and what messed us up.

> select current_setting('random_page_cost'); --> 4
> alter database "CMS_TMP" set random_page_cost=0.0001;
> select current_setting('random_page_cost'); --> 4 

ALTER DATABASE only affects subsequently-started sessions.

> I also tried:
> select current_setting('random_page_cost'); --> 4
> select set_config('random_page_cost', '0.01', true);
> select current_setting('random_page_cost'); --> 4 

That "true" means "local to the current transaction", which is
just the one statement if you don't have a BEGIN.

regards, tom lane


Big performance slowdown from 11.2 to 13.3

2021-07-21 Thread l...@laurent-hasson.com

Hello,

We have a data warehouse working on Postgres V11.2. We have a query that is 
pretty beefy that has been taking under 5mn to run consistently every day for 
about 2 years as part of a data warehouse ETL process. It's a pivot over 55 
values on a table with some 15M rows. The total table size is over 2GB 
(table+indices+other).


CREATE TABLE assessmenticcqa_raw
(
iccqar_iccassmt_fk integer NOT NULL, -- foreign key to assessment
iccqar_ques_code character varying(255) COLLATE pg_catalog."default" NOT 
NULL, -- question code
iccqar_ans_val character varying(255) COLLATE pg_catalog."default" NOT 
NULL,   -- answer value
"lastUpdated" timestamp with time zone NOT NULL DEFAULT now(),
CONSTRAINT fk_assessmenticcqa_raw_assessment FOREIGN KEY 
(iccqar_iccassmt_fk)
REFERENCES assessmenticc_fact (iccassmt_pk) MATCH SIMPLE
ON UPDATE CASCADE
ON DELETE RESTRICT
)

TABLESPACE pg_default;

CREATE UNIQUE INDEX assessmenticcqa_raw_idx_iccqar_assmt_ques
ON assessmenticcqa_raw USING btree
(iccqar_iccassmt_fk ASC NULLS LAST, iccqar_ques_code COLLATE 
pg_catalog."default" ASC NULLS LAST)
TABLESPACE pg_default;

CREATE INDEX assessmenticcqa_raw_idx_iccqar_lastupdated
ON assessmenticcqa_raw USING btree
("lastUpdated" ASC NULLS LAST)
TABLESPACE pg_default;


The query that does the pivot is:


WITH t AS (
 SELECT assessmenticcqa_raw.iccqar_iccassmt_fk AS iccqa_iccassmt_fk,
assessmenticcqa_raw.iccqar_ques_code,
max(assessmenticcqa_raw.iccqar_ans_val::text) AS iccqar_ans_val
   FROM assessmenticcqa_raw
  WHERE assessmenticcqa_raw.iccqar_ques_code::text = ANY 
(ARRAY['DEBRIDEMENT DATE'::character varying::text
  , 
'DEBRIDEMENT THIS VISIT'::character varying::text
  , 
'DEBRIDEMENT TYPE'::character varying::text
  , 'DEPTH 
(CM)'::character varying::text
  , 'DEPTH 
DESCRIPTION'::character varying::text
  , ... 55 
total columns to pivot
 ])
  GROUP BY assessmenticcqa_raw.iccqar_iccassmt_fk, 
assessmenticcqa_raw.iccqar_ques_code
)
SELECT t.iccqa_iccassmt_fk,
max(t.iccqar_ans_val) AS iccqar_ans_val,
tilda.todate(max(t.iccqar_ans_val) FILTER (WHERE t.iccqar_ques_code::text = 
'DEBRIDEMENT DATE'::text)::character varying, NULL::date) AS 
"iccqa_DEBRIDEMENT_DATE",
max(t.iccqar_ans_val) FILTER (WHERE t.iccqar_ques_code::text = 'DEBRIDEMENT 
THIS VISIT'::text) AS "iccqa_DEBRIDEMENT_THIS_VISIT",
max(t.iccqar_ans_val) FILTER (WHERE t.iccqar_ques_code::text = 'DEBRIDEMENT 
TYPE'::text) AS "iccqa_DEBRIDEMENT_TYPE",
tilda.tofloat(max(t.iccqar_ans_val) FILTER (WHERE t.iccqar_ques_code::text 
= 'DEPTH (CM)'::text)::character varying, NULL::real) AS "iccqa_DEPTH_CM",
max(t.iccqar_ans_val) FILTER (WHERE t.iccqar_ques_code::text = 'DEPTH 
DESCRIPTION'::text) AS "iccqa_DEPTH_DESCRIPTION",
... 55 total columns being pivotted
   FROM t
  GROUP BY t.iccqa_iccassmt_fk;



This query has been working flawlessly without so much as a glitch every day 
for the last 2 years or so with of course an increasing amount of data every 
day (the data grows at about 15-20 thousand records per day). I know the query 
is not incremental but at under 5mn, it's simple and works well and can handle 
inconsistent updates on the data source we use which is pretty dirty.

The problem I am facing is that we are trying to move to Postgres V13.3 and 
this query (and several others like it) is now taking 10x longer (3,000 seconds 
vs 300 seconds) which makes it completely unacceptable. I created a V13 
instance following standard practices with pg_upgrade. I have V11 and V13 
working side by side on the exact same hardware: the VM is an 8-core (16 
threads) 64GB windows server 2012 R2 machine with SSD storage. I have vacuumed 
both V11 and V13 databases full freeze analyze. The V13 is an exact backup of 
the V11 database content-wise. The postgres.conf is the same too and hasn't 
been touched in years:


  "effective_cache_size": "52GB",
  "from_collapse_limit": "24",
  "jit": "off",
  "jit_above_cost": "2e+08",
  "jit_inline_above_cost": "5e+08",
  "jit_optimize_above_cost": "5e+08",
  "join_collapse_limit": "24",
  "max_parallel_workers": "20",
  "max_parallel_workers_per_gather": "8",
  "random_page_cost": "1.1",
  "temp_buffers": "4GB",
  "work_mem": "384MB"


I have done all my testing with either of the database on while the other was 
off (shutting down the DB) to make sure there wasn't any weird interaction. I 
have read some articles about major changes between

RE: Big performance slowdown from 11.2 to 13.3

2021-07-21 Thread l...@laurent-hasson.com


-Original Message-
From: Justin Pryzby  
Sent: Wednesday, July 21, 2021 15:15
To: l...@laurent-hasson.com
Cc: pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

On Wed, Jul 21, 2021 at 06:50:58PM +, l...@laurent-hasson.com wrote:
> The plans are pretty much identical too. I checked line by line and couldn't 
> see anything much different (note that I have a view over this query). Here 
> is the V13 version of the plan:

> I am out of my wits as to what is causing such a massive slowdown and how I 
> could fix it.
> 
> Any idea out there?

Could you send the "explain (analyze,buffers,settings) for query on the v11 and
v13 instances ?

Or a link to the execution plan pasted into explain.depesz.com.

https://wiki.postgresql.org/wiki/Slow_Query_Questions#EXPLAIN_.28ANALYZE.2C_BUFFERS.29.2C_not_just_EXPLAIN

It might be good to check using a copy of your data that there's no regression 
between 11.2 and 11.12.

--
Justin





My apologies... I thought this is what I had attached in my original email from 
PGADMIN. In any case, I reran from the command line and here are the two plans.


V11.2 explain (analyze,buffers,COSTS,TIMING)

HashAggregate  (cost=1758361.62..1758372.62 rows=200 width=1260) (actual 
time=80545.907..161176.867 rows=720950 loops=1)
  Group Key: t.iccqa_iccassmt_fk
  Buffers: shared hit=8 read=170093 written=23, temp written=82961
  CTE t
->  HashAggregate  (cost=1338668.50..1352428.93 rows=1376043 width=56) 
(actual time=23669.075..32038.977 rows=13821646 loops=1)
  Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk, 
assessmenticcqa_raw.iccqar_ques_code
  Buffers: shared read=170084 written=23
  ->  Seq Scan on assessmenticcqa_raw  (cost=0.00..1236517.01 
rows=13620198 width=38) (actual time=0.081..10525.487 rows=13821646 loops=1)
Filter: ((iccqar_ques_code)::text = ANY ('{"DEBRIDEMENT 
DATE","DEBRIDEMENT THIS VISIT","DEBRIDEMENT TYPE","DEPTH (CM)","DEPTH 
DESCRIPTION","DOES PATIENT HAVE PAIN ASSOCIATED WITH THIS WOUND?","DRAIN 
PRESENT","DRAIN TYPE","EDGE / SURROUNDING TISSUE - 
MACERATION",EDGES,EPITHELIALIZATION,"EXUDATE AMOUNT","EXUDATE 
TYPE","GRANULATION TISSUE","INDICATE OTHER TYPE OF WOUND CLOSURE","INDICATE 
TYPE","INDICATE WOUND CLOSURE","IS THIS A CLOSED SURGICAL WOUND OR SUSPECTED 
DEEP TISSUE INJURY?","LENGTH (CM)","MEASUREMENTS TAKEN","NECROTIC TISSUE 
AMOUNT","NECROTIC TISSUE TYPE",ODOR,"OTHER COMMENTS REGARDING DEBRIDEMENT 
TYPE","OTHER COMMENTS REGARDING DRAIN TYPE","OTHER COMMENTS REGARDING PAIN 
INTERVENTIONS","OTHER COMMENTS REGARDING PAIN QUALITY","OTHER COMMENTS 
REGARDING REASON MEASUREMENTS NOT TAKEN","PAIN FREQUENCY","PAIN 
INTERVENTIONS","PAIN QUALITY","PERIPHERAL TISSUE EDEMA","PERIPHERAL TISSUE 
INDURATION","REASON MEASUREMENTS NOT TAKEN","RESPONSE TO PAIN 
INTERVENTIONS",SHAPE,"SIGNS AND SYMPTOMS OF INFECTION","SKIN COLOR SURROUNDING 
WOUND",STATE,"SURFACE AREA (SQ CM)","TOTAL NECROTIC TISSUE ESCHAR","TOTAL 
NECROTIC TISSUE SLOUGH",TUNNELING,"TUNNELING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","TUNNELING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK",UNDERMINING,"UNDERMINING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","UNDERMINING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK","WIDTH (CM)","WOUND PAIN LEVEL, WHERE 0 = \"NO PAIN\" AND 10 = 
\"WORST POSSIBLE PAIN\""}'::text[]))
Rows Removed by Filter: 169940
Buffers: shared read=170084 written=23
  ->  CTE Scan on t  (cost=0.00..27520.86 rows=1376043 width=552) (actual 
time=23669.081..39393.726 rows=13821646 loops=1)
Buffers: shared read=170084 written=23, temp written=82961
Planning Time: 6.160 ms
Execution Time: 161,942.304 ms


V13.3  explain (analyze,buffers,COSTS,TIMING,SETTINGS)
==
HashAggregate  (cost=1774568.21..1774579.21 rows=200 width=1260) (actual 
time=81053.437..1699800.741 rows=722853 loops=1)
  Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk
  Batches: 5  Memory Usage: 284737kB  Disk Usage: 60kB
  Buffers: shared hit=20 read=169851, temp read=185258 written=305014
  ->  HashAggregate  

RE: Big performance slowdown from 11.2 to 13.3

2021-07-21 Thread l...@laurent-hasson.com

From: Michael Lewis  
Sent: Wednesday, July 21, 2021 18:12
To: l...@laurent-hasson.com
Cc: pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

I'm not seeing the valueof the CTE. Why not access assessmenticcqa_raw directly 
in the main query and only do GROUP BY once? Do you have many values in 
iccqar_ques_code which are not used in this query?




Yes, there are close to 600 different values, and we are picking up only a 
small amount. And by the way, this is a classic case where the query could be 
folder as a sub-select in a join clause, and I tried this as well, with the 
same results.

Thank you,
Laurent.



RE: Big performance slowdown from 11.2 to 13.3

2021-07-21 Thread l...@laurent-hasson.com


-Original Message-
From: Peter Geoghegan  
Sent: Wednesday, July 21, 2021 19:34
To: l...@laurent-hasson.com
Cc: Justin Pryzby ; pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

On Wed, Jul 21, 2021 at 4:19 PM l...@laurent-hasson.com 
 wrote:
> As you can see, the V13.3 execution is about 10x slower.
>
> It may be hard for me to create a whole copy of the database on 11.12 and 
> check that environment by itself. I'd want to do it on the same machine to 
> control variables, and I don't have much extra disk space at the moment.

I imagine that this has something to do with the fact that the hash aggregate 
spills to disk in Postgres 13.

You might try increasing hash_mem_multiplier from its default of 1.0, to 2.0 or 
even 4.0. That way you'd be able to use 2x or 4x more memory for executor nodes 
that are based on hashing (hash join and hash aggregate), without also 
affecting other kinds of nodes, which are typically much less sensitive to 
memory availability. This is very similar to increasing work_mem, except that 
it is better targeted.

It might even make sense to *decrease* work_mem and increase 
hash_mem_multiplier even further than 4.0. That approach is more aggressive, 
though, so I wouldn't use it until it actually proved necessary.

--
Peter Geoghegan



So how is this happening? I mean, it's the exact same query, looks like the 
same plan to me, it's the same data on the exact same VM etc... Why is that 
behavior so different?

As soon as I can, I'll check if perhaps the hash_mem_multiplier is somehow set 
differently between the two setups? That would be my first guess, but absent 
that, looks like a very different behavior across those 2 versions?

Thank you,
Laurent.


RE: Big performance slowdown from 11.2 to 13.3

2021-07-21 Thread l...@laurent-hasson.com



-Original Message-
From: Tom Lane  
Sent: Wednesday, July 21, 2021 19:36
To: l...@laurent-hasson.com
Cc: Justin Pryzby ; pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

"l...@laurent-hasson.com"  writes:
> My apologies... I thought this is what I had attached in my original email 
> from PGADMIN. In any case, I reran from the command line and here are the two 
> plans.

So the pain seems to be coming in with the upper hash aggregation, which is 
spilling to disk because work_mem of '384MB' is nowhere near enough.
The v11 explain doesn't show any batching there, which makes me suspect that it 
was using a larger value of work_mem.  (There could also be some edge effect 
that is making v13 use a bit more memory for the same number of tuples, which 
could lead it to spill when v11 had managed to scrape by without doing so.)

So the first thing I'd try is seeing if setting work_mem to 1GB or so improves 
matters.

The other thing that's notable is that v13 has collapsed out the CTE that used 
to sit between the two levels of hashagg.  Now I don't know of any reason that 
that wouldn't be a strict improvement, but if the work_mem theory doesn't pan 
out then that's something that'd deserve a closer look.  Does marking the WITH 
as WITH MATERIALIZED change anything about v13's performance?

regards, tom lane




Hello Tom (and Peter)! Thanks for all this info. 

I created 3 versions of this query: CTE MATERIALIZED, CTE NOT MATERIALIZED, and 
no CTE (select directly in a sub join). Only very minor change in the final 
execution time (seconds).

I'll try the following later this evening:
- set work_mem to 1GB
- play with hash_mem_multiplier as per Peter's suggestions although he did 
suggest to try being more aggressive with it and lower work_mem... so I'll play 
with those 2 variables.

Thank you,
Laurent.







RE: Big performance slowdown from 11.2 to 13.3

2021-07-21 Thread l...@laurent-hasson.com



-Original Message-
From: Tom Lane  
Sent: Wednesday, July 21, 2021 19:43
To: l...@laurent-hasson.com
Cc: Peter Geoghegan ; Justin Pryzby ; 
pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

"l...@laurent-hasson.com"  writes:
> From: Peter Geoghegan  
>> I imagine that this has something to do with the fact that the hash 
>> aggregate spills to disk in Postgres 13.

> So how is this happening? I mean, it's the exact same query, looks like the 
> same plan to me, it's the same data on the exact same VM etc... Why is that 
> behavior so different?

What Peter's pointing out is that v11 never spilled hashagg hash tables to
disk period, no matter how big they got (possibly leading to out-of-memory
situations or swapping, but evidently you have enough RAM to have avoided
that sort of trouble).  I'd momentarily forgotten that, but I think he's
dead on about that explaining the difference.  As he says, messing with
hash_mem_multiplier would be a more targeted fix than increasing work_mem
across the board.

regards, tom lane


OK, got it! That sounds and smells good. Will try later tonight or tomorrow and 
report back.

Thank you!
Laurent.




RE: Big performance slowdown from 11.2 to 13.3

2021-07-21 Thread l...@laurent-hasson.com


-Original Message-
From: l...@laurent-hasson.com  
Sent: Wednesday, July 21, 2021 19:46
To: Tom Lane 
Cc: Peter Geoghegan ; Justin Pryzby ; 
pgsql-performa...@postgresql.org
Subject: RE: Big performance slowdown from 11.2 to 13.3



-Original Message-
From: Tom Lane  
Sent: Wednesday, July 21, 2021 19:43
To: l...@laurent-hasson.com
Cc: Peter Geoghegan ; Justin Pryzby ; 
pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

"l...@laurent-hasson.com"  writes:
> From: Peter Geoghegan  
>> I imagine that this has something to do with the fact that the hash 
>> aggregate spills to disk in Postgres 13.

> So how is this happening? I mean, it's the exact same query, looks like the 
> same plan to me, it's the same data on the exact same VM etc... Why is that 
> behavior so different?

What Peter's pointing out is that v11 never spilled hashagg hash tables to
disk period, no matter how big they got (possibly leading to out-of-memory
situations or swapping, but evidently you have enough RAM to have avoided
that sort of trouble).  I'd momentarily forgotten that, but I think he's
dead on about that explaining the difference.  As he says, messing with
hash_mem_multiplier would be a more targeted fix than increasing work_mem
across the board.

regards, tom lane


OK, got it! That sounds and smells good. Will try later tonight or tomorrow and 
report back.

Thank you!
Laurent.






Hello all,

Seems like no cigar ☹ See plan pasted below. I changed the conf as follows:
  - hash_mem_multiplier = '2'
  - work_mem = '1GB'

I tried a few other configuration, i.e., 512MB/4, 256MB/8 with similar results.

Also, you mentioned previously that the hash was spilling to disk? How are you 
seeing this in the plans? What should I be looking for on my end when playing 
around with parameters to see the intended effect?

Thank you,
Laurent.


HashAggregate  (cost=1774568.21..1774579.21 rows=200 width=1260) (actual 
time=70844.078..1554843.323 rows=722853 loops=1)
  Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk
  Batches: 1  Memory Usage: 1277985kB
  Buffers: shared hit=14 read=169854, temp read=15777 written=27588
  ->  HashAggregate  (cost=1360804.75..1374830.63 rows=1402588 width=56) 
(actual time=23370.026..33839.347 rows=13852618 loops=1)
Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk, 
assessmenticcqa_raw.iccqar_ques_code
Batches: 5  Memory Usage: 2400305kB  Disk Usage: 126560kB
Buffers: shared read=169851, temp read=15777 written=27588
->  Seq Scan on assessmenticcqa_raw  (cost=0.00..1256856.62 
rows=13859750 width=38) (actual time=0.072..10906.894 rows=13852618 loops=1)
  Filter: ((iccqar_ques_code)::text = ANY ('{"DEBRIDEMENT 
DATE","DEBRIDEMENT THIS VISIT","DEBRIDEMENT TYPE","DEPTH (CM)","DEPTH 
DESCRIPTION","DOES PATIENT HAVE PAIN ASSOCIATED WITH THIS WOUND?","DRAIN 
PRESENT","DRAIN TYPE","EDGE / SURROUNDING TISSUE - 
MACERATION",EDGES,EPITHELIALIZATION,"EXUDATE AMOUNT","EXUDATE 
TYPE","GRANULATION TISSUE","INDICATE OTHER TYPE OF WOUND CLOSURE","INDICATE 
TYPE","INDICATE WOUND CLOSURE","IS THIS A CLOSED SURGICAL WOUND OR SUSPECTED 
DEEP TISSUE INJURY?","LENGTH (CM)","MEASUREMENTS TAKEN","NECROTIC TISSUE 
AMOUNT","NECROTIC TISSUE TYPE",ODOR,"OTHER COMMENTS REGARDING DEBRIDEMENT 
TYPE","OTHER COMMENTS REGARDING DRAIN TYPE","OTHER COMMENTS REGARDING PAIN 
INTERVENTIONS","OTHER COMMENTS REGARDING PAIN QUALITY","OTHER COMMENTS 
REGARDING REASON MEASUREMENTS NOT TAKEN","PAIN FREQUENCY","PAIN 
INTERVENTIONS","PAIN QUALITY","PERIPHERAL TISSUE EDEMA","PERIPHERAL TISSUE 
INDURATION","REASON MEASUREMENTS NOT TAKEN","RESPONSE TO PAIN 
INTERVENTIONS",SHAPE,"SIGNS AND SYMPTOMS OF INFECTION","SKIN COLOR SURROUNDING 
WOUND",STATE,"SURFACE AREA (SQ CM)","TOTAL NECROTIC TISSUE ESCHAR","TOTAL 
NECROTIC TISSUE SLOUGH",TUNNELING,"TUNNELING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","TUNNELING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK",UNDERMINING,"UNDERMINING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","UNDERMINING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK","WIDTH (CM)","WOUND PAIN LEVEL, WHERE 0 = \"NO PAIN\" AND 10 = 
\"WORST POSSIBLE PAIN\""}'::text[]))
  Rows Removed by Filter: 171680
  Buffers: shared read=169851
Settings: effective_cache_size = '52GB', from_collapse_limit = '24', 
hash_mem_multiplier = '2', jit = 'off', jit_above_cost = '2e+08', 
jit_inline_above_cost = '5e+08', jit_optimize_above_cost = '5e+08', 
join_collapse_limit = '24', max_parallel_workers = '20', 
max_parallel_workers_per_gather = '8', random_page_cost = '1.1', temp_buffers = 
'4GB', work_mem = '1GB'
Planning:
  Buffers: shared hit=186 read=37
Planning Time: 3.667 ms
Execution Time: 1555300.746 ms


RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread l...@laurent-hasson.com
OK. Will do another round of testing.


-Original Message-
From: David Rowley  
Sent: Thursday, July 22, 2021 00:44
To: l...@laurent-hasson.com
Cc: Tom Lane ; Peter Geoghegan ; Justin 
Pryzby ; pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

On Thu, 22 Jul 2021 at 16:37, l...@laurent-hasson.com  
wrote:
> Seems like no cigar ☹ See plan pasted below. I changed the conf as follows:
>   - hash_mem_multiplier = '2'
>   - work_mem = '1GB'

> Batches: 5  Memory Usage: 2400305kB  Disk Usage: 126560kB

You might want to keep going higher with hash_mem_multiplier until you see no 
"Disk Usage" there.  As mentioned, v11 didn't spill to disk and just used all 
the memory it pleased.  That was a bit dangerous as it could result in OOM, so 
it was fixed.

David


RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread l...@laurent-hasson.com


-Original Message-
From: l...@laurent-hasson.com  
Sent: Thursday, July 22, 2021 09:37
To: David Rowley 
Cc: Tom Lane ; Peter Geoghegan ; Justin 
Pryzby ; pgsql-performa...@postgresql.org
Subject: RE: Big performance slowdown from 11.2 to 13.3

OK. Will do another round of testing.


-Original Message-
From: David Rowley  
Sent: Thursday, July 22, 2021 00:44
To: l...@laurent-hasson.com
Cc: Tom Lane ; Peter Geoghegan ; Justin 
Pryzby ; pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

On Thu, 22 Jul 2021 at 16:37, l...@laurent-hasson.com  
wrote:
> Seems like no cigar ☹ See plan pasted below. I changed the conf as follows:
>   - hash_mem_multiplier = '2'
>   - work_mem = '1GB'

> Batches: 5  Memory Usage: 2400305kB  Disk Usage: 126560kB

You might want to keep going higher with hash_mem_multiplier until you see no 
"Disk Usage" there.  As mentioned, v11 didn't spill to disk and just used all 
the memory it pleased.  That was a bit dangerous as it could result in OOM, so 
it was fixed.

David

--

Hello all,

So, I went possibly nuclear, and still no cigar. Something's not right.
- hash_mem_multiplier = '10'
- work_mem = '1GB'

The results are
Batches: 5  Memory Usage: 2,449,457kB  Disk Usage: 105,936kB
Execution Time: 1,837,126.766 ms

It's still spilling to disk and seems to cap at 2.5GB of memory usage in spite 
of configuration. More importantly
  - I am not understanding how spilling to disk 100MB (which seems low to me 
and should be fast on our SSD), causes the query to slow down by a factor of 10.
  - It seems at the very least that memory consumption on 11 was more moderate? 
This process of ours was running several of these types of queries concurrently 
and I don't think I ever saw the machine go over 40GB in memory usage.


HashAggregate  (cost=1774568.21..1774579.21 rows=200 width=1260) (actual 
time=84860.629..1836583.909 rows=722853 loops=1)
  Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk
  Batches: 1  Memory Usage: 1277985kB
  Buffers: shared hit=46 read=169822, temp read=13144 written=23035
  ->  HashAggregate  (cost=1360804.75..1374830.63 rows=1402588 width=56) 
(actual time=27890.422..39975.074 rows=13852618 loops=1)
Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk, 
assessmenticcqa_raw.iccqar_ques_code
Batches: 5  Memory Usage: 2449457kB  Disk Usage: 105936kB
Buffers: shared hit=32 read=169819, temp read=13144 written=23035
->  Seq Scan on assessmenticcqa_raw  (cost=0.00..1256856.62 
rows=13859750 width=38) (actual time=0.053..13623.310 rows=13852618 loops=1)
  Filter: ((iccqar_ques_code)::text = ANY ('{"DEBRIDEMENT 
DATE","DEBRIDEMENT THIS VISIT","DEBRIDEMENT TYPE","DEPTH (CM)","DEPTH 
DESCRIPTION","DOES PATIENT HAVE PAIN ASSOCIATED WITH THIS WOUND?","DRAIN 
PRESENT","DRAIN TYPE","EDGE / SURROUNDING TISSUE - 
MACERATION",EDGES,EPITHELIALIZATION,"EXUDATE AMOUNT","EXUDATE 
TYPE","GRANULATION TISSUE","INDICATE OTHER TYPE OF WOUND CLOSURE","INDICATE 
TYPE","INDICATE WOUND CLOSURE","IS THIS A CLOSED SURGICAL WOUND OR SUSPECTED 
DEEP TISSUE INJURY?","LENGTH (CM)","MEASUREMENTS TAKEN","NECROTIC TISSUE 
AMOUNT","NECROTIC TISSUE TYPE",ODOR,"OTHER COMMENTS REGARDING DEBRIDEMENT 
TYPE","OTHER COMMENTS REGARDING DRAIN TYPE","OTHER COMMENTS REGARDING PAIN 
INTERVENTIONS","OTHER COMMENTS REGARDING PAIN QUALITY","OTHER COMMENTS 
REGARDING REASON MEASUREMENTS NOT TAKEN","PAIN FREQUENCY","PAIN 
INTERVENTIONS","PAIN QUALITY","PERIPHERAL TISSUE EDEMA","PERIPHERAL TISSUE 
INDURATION","REASON MEASUREMENTS NOT TAKEN","RESPONSE TO PAIN 
INTERVENTIONS",SHAPE,"SIGNS AND SYMPTOMS OF INFECTION","SKIN COLOR SURROUNDING 
WOUND",STATE,"SURFACE AREA (SQ CM)","TOTAL NECROTIC TISSUE ESCHAR","TOTAL 
NECROTIC TISSUE SLOUGH",TUNNELING,"TUNNELING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","TUNNELING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK",UNDERMINING,"UNDERMINING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","UNDERMINING 
SIZE(CM)/LOCATION - 6 - 9 O'&#x

RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread l...@laurent-hasson.com



-Original Message-
From: Tom Lane  
Sent: Thursday, July 22, 2021 11:57
To: l...@laurent-hasson.com
Cc: David Rowley ; Peter Geoghegan ; Justin 
Pryzby ; pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

I wrote:
> "l...@laurent-hasson.com"  writes:
>> It's still spilling to disk and seems to cap at 2.5GB of memory usage in 
>> spite of configuration.

> That is ... weird.

Oh: see get_hash_mem:

hash_mem = (double) work_mem * hash_mem_multiplier;

/*
 * guc.c enforces a MAX_KILOBYTES limitation on work_mem in order to
 * support the assumption that raw derived byte values can be stored in
 * 'long' variables.  The returned hash_mem value must also meet this
 * assumption.
 *
 * We clamp the final value rather than throw an error because it should
 * be possible to set work_mem and hash_mem_multiplier independently.
 */
if (hash_mem < MAX_KILOBYTES)
return (int) hash_mem;

return MAX_KILOBYTES;

So basically, we now have a hard restriction that hashaggs can't use more than 
INT_MAX kilobytes, or approximately 2.5GB, and this use case is getting eaten 
alive by that restriction.  Seems like we need to do something about that.

regards, tom lane




--

Hello!


Ah... int vs long then? Tried even more (multiplier=16) and this seems to be 
definitely the case.

Is it fair then to deduce that the total memory usage would be 2,400,305kB + 
126,560kB? Is this what under the covers V11 is consuming more or less?

Is it also expected that a spill over of just 100MB (on top of 2.4GB memory 
consumption) would cause the query to collapse like this? I am still not 
visualizing in my head how that would happen. 100MB just seems so small, and 
our SSD is fast.

Generating a dataset would take me a lot of time. This is a clinical database 
so I cannot reuse the current table. I would have to entirely mock the use case 
and create a dummy dataset from scratch.



HashAggregate  (cost=1774568.21..1774579.21 rows=200 width=1260) (actual 
time=94618.303..1795311.542 rows=722853 loops=1)
  Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk
  Batches: 1  Memory Usage: 1277985kB
  Buffers: shared hit=14 read=169854, temp read=15777 written=27588
  ->  HashAggregate  (cost=1360804.75..1374830.63 rows=1402588 width=56) 
(actual time=30753.022..45384.558 rows=13852618 loops=1)
Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk, 
assessmenticcqa_raw.iccqar_ques_code
Batches: 5  Memory Usage: 2400305kB  Disk Usage: 126560kB
Buffers: shared read=169851, temp read=15777 written=27588
->  Seq Scan on assessmenticcqa_raw  (cost=0.00..1256856.62 
rows=13859750 width=38) (actual time=0.110..14342.258 rows=13852618 loops=1)
  Filter: ((iccqar_ques_code)::text = ANY ('{"DEBRIDEMENT 
DATE","DEBRIDEMENT THIS VISIT","DEBRIDEMENT TYPE","DEPTH (CM)","DEPTH 
DESCRIPTION","DOES PATIENT HAVE PAIN ASSOCIATED WITH THIS WOUND?","DRAIN 
PRESENT","DRAIN TYPE","EDGE / SURROUNDING TISSUE - 
MACERATION",EDGES,EPITHELIALIZATION,"EXUDATE AMOUNT","EXUDATE 
TYPE","GRANULATION TISSUE","INDICATE OTHER TYPE OF WOUND CLOSURE","INDICATE 
TYPE","INDICATE WOUND CLOSURE","IS THIS A CLOSED SURGICAL WOUND OR SUSPECTED 
DEEP TISSUE INJURY?","LENGTH (CM)","MEASUREMENTS TAKEN","NECROTIC TISSUE 
AMOUNT","NECROTIC TISSUE TYPE",ODOR,"OTHER COMMENTS REGARDING DEBRIDEMENT 
TYPE","OTHER COMMENTS REGARDING DRAIN TYPE","OTHER COMMENTS REGARDING PAIN 
INTERVENTIONS","OTHER COMMENTS REGARDING PAIN QUALITY","OTHER COMMENTS 
REGARDING REASON MEASUREMENTS NOT TAKEN","PAIN FREQUENCY","PAIN 
INTERVENTIONS","PAIN QUALITY","PERIPHERAL TISSUE EDEMA","PERIPHERAL TISSUE 
INDURATION","REASON MEASUREMENTS NOT TAKEN","RESPONSE TO PAIN 
INTERVENTIONS",SHAPE,"SIGNS AND SYMPTOMS OF INFECTION","SKIN COLOR SURROUNDING 
WOUND",STATE,"SURFACE AREA (SQ CM)","TOTAL NECROTIC TISSUE ESCHAR","TOTAL 
NECROTIC TISSUE SLOUGH",TUNNELING,"TUNNELING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","TUNNELING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK",UNDERMINING,"UNDERMINING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","UND

RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread l...@laurent-hasson.com

-Original Message-
From: Peter Geoghegan  
Sent: Thursday, July 22, 2021 12:14
To: Tom Lane 
Cc: Jeff Davis ; l...@laurent-hasson.com; David Rowley 
; Justin Pryzby ; 
pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

On Thu, Jul 22, 2021 at 8:45 AM Tom Lane  wrote:
> That is ... weird.  Maybe you have found a bug in the spill-to-disk 
> logic; it's quite new after all.  Can you extract a self-contained 
> test case that behaves this way?

I wonder if this has something to do with the way that the input data is 
clustered. I recall noticing that that could significantly alter the behavior 
of HashAggs as of Postgres 13.

--
Peter Geoghegan



I could execute that test and re-cluster against the index. But I believe 
that's already done? Let me check.

Thank you,
Laurent.


RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread l...@laurent-hasson.com


-Original Message-
From: David Rowley  
Sent: Thursday, July 22, 2021 12:18
To: Peter Geoghegan 
Cc: Tom Lane ; Jeff Davis ; 
l...@laurent-hasson.com; Justin Pryzby ; 
pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

On Fri, 23 Jul 2021 at 04:14, Peter Geoghegan  wrote:
>
> On Thu, Jul 22, 2021 at 8:45 AM Tom Lane  wrote:
> > That is ... weird.  Maybe you have found a bug in the spill-to-disk 
> > logic; it's quite new after all.  Can you extract a self-contained 
> > test case that behaves this way?
>
> I wonder if this has something to do with the way that the input data 
> is clustered. I recall noticing that that could significantly alter 
> the behavior of HashAggs as of Postgres 13.

Isn't it more likely to be reaching the group limit rather than the memory 
limit?

if (input_groups * hashentrysize < hash_mem * 1024L) { if (num_partitions != 
NULL) *num_partitions = 0; *mem_limit = hash_mem * 1024L; *ngroups_limit = 
*mem_limit / hashentrysize; return; }

There are 55 aggregates on a varchar(255). I think hashentrysize is pretty big. 
If it was 255*55 then only 765591 groups fit in the 10GB of memory.

David



-

Hello,

So, FYI The query I shared is actually a simpler use case of ours 😊 We do 
have a similar pivot query over 600 columns to create a large flat tale for 
analysis on an even larger table. Takes about 15mn to run on V11 with strong 
CPU usage and no particular memory usage spike that I can detect via 
TaskManager. We have been pushing PG hard and simplify the workflows of our 
analysts and data scientists downstream.

Thank you,
Laurent.



RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread l...@laurent-hasson.com



-Original Message-
From: Justin Pryzby  
Sent: Thursday, July 22, 2021 12:23
To: l...@laurent-hasson.com
Cc: Tom Lane ; David Rowley ; Peter 
Geoghegan ; pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

On Thu, Jul 22, 2021 at 04:16:34PM +, l...@laurent-hasson.com wrote:
> Is it fair then to deduce that the total memory usage would be 2,400,305kB + 
> 126,560kB? Is this what under the covers V11 is consuming more or less?

It might be helpful to know how much RAM v11 is using.

Could you run the query with log_executor_stats=on; client_min_messages=debug;

The interesting part is this:
!   7808 kB max resident size

-- 
Justin


---

Hello Justin,

> log_executor_stats=on; client_min_messages=debug;

Would the results then come in EXPLAIN or would I need to pick something up 
from the logs?

Thank you,
Laurent.






RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread l...@laurent-hasson.com
I did try 2000MB work_mem and 16 multiplier 😊 It seems to plateau at 2GB no 
matter what. This is what the explain had:

HashAggregate  (cost=1774568.21..1774579.21 rows=200 width=1260) (actual 
time=94618.303..1795311.542 rows=722853 loops=1)
  Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk
  Batches: 1  Memory Usage: 1277985kB
  Buffers: shared hit=14 read=169854, temp read=15777 written=27588
  ->  HashAggregate  (cost=1360804.75..1374830.63 rows=1402588 width=56) 
(actual time=30753.022..45384.558 rows=13852618 loops=1)
Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk, 
assessmenticcqa_raw.iccqar_ques_code
Batches: 5  Memory Usage: 2400305kB  Disk Usage: 126560kB
Buffers: shared read=169851, temp read=15777 written=27588
->  Seq Scan on assessmenticcqa_raw  (cost=0.00..1256856.62 
rows=13859750 width=38) (actual time=0.110..14342.258 rows=13852618 loops=1)
  Filter: ((iccqar_ques_code)::text = ANY ('{"DEBRIDEMENT 
DATE","DEBRIDEMENT THIS VISIT","DEBRIDEMENT TYPE","DEPTH (CM)","DEPTH 
DESCRIPTION","DOES PATIENT HAVE PAIN ASSOCIATED WITH THIS WOUND?","DRAIN 
PRESENT","DRAIN TYPE","EDGE / SURROUNDING TISSUE - 
MACERATION",EDGES,EPITHELIALIZATION,"EXUDATE AMOUNT","EXUDATE 
TYPE","GRANULATION TISSUE","INDICATE OTHER TYPE OF WOUND CLOSURE","INDICATE 
TYPE","INDICATE WOUND CLOSURE","IS THIS A CLOSED SURGICAL WOUND OR SUSPECTED 
DEEP TISSUE INJURY?","LENGTH (CM)","MEASUREMENTS TAKEN","NECROTIC TISSUE 
AMOUNT","NECROTIC TISSUE TYPE",ODOR,"OTHER COMMENTS REGARDING DEBRIDEMENT 
TYPE","OTHER COMMENTS REGARDING DRAIN TYPE","OTHER COMMENTS REGARDING PAIN 
INTERVENTIONS","OTHER COMMENTS REGARDING PAIN QUALITY","OTHER COMMENTS 
REGARDING REASON MEASUREMENTS NOT TAKEN","PAIN FREQUENCY","PAIN 
INTERVENTIONS","PAIN QUALITY","PERIPHERAL TISSUE EDEMA","PERIPHERAL TISSUE 
INDURATION","REASON MEASUREMENTS NOT TAKEN","RESPONSE TO PAIN 
INTERVENTIONS",SHAPE,"SIGNS AND SYMPTOMS OF INFECTION","SKIN COLOR SURROUNDING 
WOUND",STATE,"SURFACE AREA (SQ CM)","TOTAL NECROTIC TISSUE ESCHAR","TOTAL 
NECROTIC TISSUE SLOUGH",TUNNELING,"TUNNELING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","TUNNELING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK",UNDERMINING,"UNDERMINING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","UNDERMINING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK","WIDTH (CM)","WOUND PAIN LEVEL, WHERE 0 = \"NO PAIN\" AND 10 = 
\"WORST POSSIBLE PAIN\""}'::text[]))
  Rows Removed by Filter: 171680
  Buffers: shared read=169851
Settings: effective_cache_size = '52GB', from_collapse_limit = '24', 
hash_mem_multiplier = '16', jit = 'off', jit_above_cost = '2e+08', 
jit_inline_above_cost = '5e+08', jit_optimize_above_cost = '5e+08', 
join_collapse_limit = '24', max_parallel_workers = '20', 
max_parallel_workers_per_gather = '8', random_page_cost = '1.1', temp_buffers = 
'4GB', work_mem = ' 2000MB'
Planning:
  Buffers: shared hit=186 read=37
Planning Time: 55.709 ms
Execution Time: 1795921.717 ms





-Original Message-
From: Peter Geoghegan  
Sent: Thursday, July 22, 2021 13:05
To: Tom Lane 
Cc: David Rowley ; l...@laurent-hasson.com; Justin Pryzby 
; pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

On Thu, Jul 22, 2021 at 9:53 AM Peter Geoghegan  wrote:
> I suspect David's theory about hash_agg_set_limits()'s ngroup limit is 
> correct. It certainly seems like a good starting point.

I also suspect that if Laurent set work_mem and/or hash_mem_multiplier
*extremely* aggressively, then eventually the hash agg would be in-memory. And 
without actually using all that much memory.

I'm not suggesting that that is a sensible resolution to Laurent's complaint. 
I'm just pointing out that it's probably not fundamentally impossible to make 
the hash agg avoid spilling through tuning these GUCs. At least I see no 
evidence of that right now.

--
Peter Geoghegan


RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread l...@laurent-hasson.com



-Original Message-
From: Justin Pryzby  
Sent: Thursday, July 22, 2021 12:36
To: l...@laurent-hasson.com
Cc: Tom Lane ; David Rowley ; Peter 
Geoghegan ; pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

On Thu, Jul 22, 2021 at 04:30:00PM +, l...@laurent-hasson.com wrote:
> Hello Justin,
> 
> > log_executor_stats=on; client_min_messages=debug;
> 
> Would the results then come in EXPLAIN or would I need to pick something up 
> from the logs?

If you're running with psql, and client_min_messages=debug, then it'll show up 
as a debug message to the client:

ts=# SET log_executor_stats=on; SET client_min_messages=debug; explain analyze 
SELECT 1; SET SET
LOG:  EXECUTOR STATISTICS
DETAIL:  ! system usage stats:
!   0.11 s user, 0.000209 s system, 0.000219 s elapsed
!   [0.040608 s user, 0.020304 s system total]
!   7808 kB max resident size
...

It can but doesn't have to use "explain" - that just avoids showing the query 
output, since it's not what's interesting here.

--
Justin


-

Justin,

I tried this but not seeing max resident size data output.

Pepper=# SET log_executor_stats=on; SET client_min_messages=debug; explain 
analyze SELECT 1;
SET
SET
LOG:  EXECUTOR STATISTICS
DETAIL:  ! system usage stats:
!   0.00 s user, 0.00 s system, 0.000109 s elapsed
!   [494.640625 s user, 19.171875 s system total]
 QUERY PLAN

 Result  (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1 
loops=1)
 Planning Time: 0.041 ms
 Execution Time: 0.012 ms
(3 rows)


For my query:


LOG:  EXECUTOR STATISTICS
DETAIL:  ! system usage stats:
!   169.625000 s user, 5.843750 s system, 175.490088 s elapsed
!   [494.640625 s user, 19.171875 s system total]
 HashAggregate  (cost=1764285.18..1764296.18 rows=200 width=1260) (actual 
time=86323.813..174737.442 rows=723659 loops=1)
   Group Key: t.iccqa_iccassmt_fk
   Buffers: shared hit=364 read=170293, temp written=83229
   CTE t
 ->  HashAggregate  (cost=1343178.39..1356985.17 rows=1380678 width=56) 
(actual time=22594.053..32519.573 rows=13865785 loops=1)
   Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk, 
assessmenticcqa_raw.iccqar_ques_code
   Buffers: shared hit=364 read=170293
   ->  Seq Scan on assessmenticcqa_raw  (cost=0.00..1240682.76 
rows=13666084 width=38) (actual time=0.170..10714.598 rows=13865785 loops=1)
 Filter: ((iccqar_ques_code)::text = ANY ('{"DEBRIDEMENT 
DATE","DEBRIDEMENT THIS VISIT","DEBRIDEMENT TYPE","DEPTH (CM)","DEPTH 
DESCRIPTION","DOE
S PATIENT HAVE PAIN ASSOCIATED WITH THIS WOUND?","DRAIN PRESENT","DRAIN 
TYPE","EDGE / SURROUNDING TISSUE - MACERATION",EDGES,EPITHELIALIZATION,"EXUDATE 
AMOUNT",
"EXUDATE TYPE","GRANULATION TISSUE","INDICATE OTHER TYPE OF WOUND 
CLOSURE","INDICATE TYPE","INDICATE WOUND CLOSURE","IS THIS A CLOSED SURGICAL 
WOUND OR SUSPECTE
D DEEP TISSUE INJURY?","LENGTH (CM)","MEASUREMENTS TAKEN","NECROTIC TISSUE 
AMOUNT","NECROTIC TISSUE TYPE",ODOR,"OTHER COMMENTS REGARDING DEBRIDEMENT 
TYPE","OTHE
R COMMENTS REGARDING DRAIN TYPE","OTHER COMMENTS REGARDING PAIN 
INTERVENTIONS","OTHER COMMENTS REGARDING PAIN QUALITY","OTHER COMMENTS 
REGARDING REASON MEASUREM
ENTS NOT TAKEN","PAIN FREQUENCY","PAIN INTERVENTIONS","PAIN 
QUALITY","PERIPHERAL TISSUE EDEMA","PERIPHERAL TISSUE INDURATION","REASON 
MEASUREMENTS NOT TAKEN","R
ESPONSE TO PAIN INTERVENTIONS",SHAPE,"SIGNS AND SYMPTOMS OF INFECTION","SKIN 
COLOR SURROUNDING WOUND",STATE,"SURFACE AREA (SQ CM)","TOTAL NECROTIC TISSUE 
ESCHAR
","TOTAL NECROTIC TISSUE SLOUGH",TUNNELING,"TUNNELING SIZE(CM)/LOCATION - 12 - 
3 O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","TUNNELING 
SIZE(CM)/LO
CATION - 6 - 9 O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK",UNDERMINING,"UNDERMINING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","UNDERMINING SIZE(CM)/LOC
ATION - 3 - 6 O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 6 - 9 
O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 9 - 12 O''CLOCK","WIDTH (CM)","WOUND 
PAIN LEVEL, WHERE
 0 = \"NO PAIN\" AND 10 = \"WORST POSSIBLE PAIN\""}'::text[]))
 Rows Removed by Filter: 172390
 Buffers: shared hit=364 read=170293
   ->  CTE Scan on t  (cost=0.00..27613.56 rows=1380678 width=552) (actual 
time=22594.062..40248.874 rows=13865785 loops=1)
 Buffers: shared hit=364 read=170293, temp written=83229
 Planning Time: 0.728 ms
 Execution Time: 175482.904 ms
(15 rows)






RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread l...@laurent-hasson.com


-Original Message-
From: Tom Lane  
Sent: Thursday, July 22, 2021 12:42
To: Peter Geoghegan 
Cc: David Rowley ; l...@laurent-hasson.com; Justin Pryzby 
; pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

Peter Geoghegan  writes:
> On Thu, Jul 22, 2021 at 9:21 AM Tom Lane  wrote:
>> Yeah, I should have said "2GB plus palloc slop".  It doesn't surprise 
>> me a bit that we seem to be eating another 20% on top of the nominal 
>> limit.

> MAX_KILOBYTES is the max_val for the work_mem GUC itself, and has been 
> for many years.

Right.  The point here is that before v13, hash aggregation was not subject to 
the work_mem limit, nor any related limit.  If you did an aggregation requiring 
more than 2GB-plus-slop, it would work just fine as long as your machine had 
enough RAM.  Now, the performance sucks and there is no knob you can turn to 
fix it.  That's unacceptable in my book.

regards, tom lane


---

Hello all,

As a user of PG, we have taken pride in the last few years in tuning the heck 
out of the system and getting great performance compared to alternatives like 
SQLServer. The customers we work with typically have data centers and are 
overwhelmingly Windows shops: we won the battle to deploy a complex operational 
system on PG vs SQLServer, but Linux vs Windows was still a bridge too far for 
many. I am surprised that this limitation introduced after V11 hasn't caused 
issues elsewhere though. Are we doing things that are such out of the normal? 
Are we early in pushing V13 to full production? 😊 Doing analytics with pivoted 
tables with hundreds of columns is not uncommon in our world.

As for the three other requests from the team:

Clustering:
==
I re-clustered the table on the index that drives the pivot logic but I didn't 
see any change:

cluster verbose assessmenticcqa_raw using 
assessmenticcqa_raw_idx_iccqar_assmt_ques;

HashAggregate  (cost=1774465.36..1774476.36 rows=200 width=1260) (actual 
time=80848.591..1763443.586 rows=722853 loops=1)
  Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk
  Batches: 1  Memory Usage: 1277985kB
  Buffers: shared hit=369 read=169577, temp read=15780 written=27584
  ->  HashAggregate  (cost=1360748.50..1374772.80 rows=1402430 width=56) 
(actual time=25475.554..38256.923 rows=13852618 loops=1)
Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk, 
assessmenticcqa_raw.iccqar_ques_code
Batches: 5  Memory Usage: 2400305kB  Disk Usage: 126552kB
Buffers: shared hit=352 read=169577, temp read=15780 written=27584
->  Seq Scan on assessmenticcqa_raw  (cost=0.00..1256812.09 
rows=13858188 width=38) (actual time=0.085..11914.135 rows=13852618 loops=1)
  Filter: ((iccqar_ques_code)::text = ANY ('{"DEBRIDEMENT 
DATE","DEBRIDEMENT THIS VISIT","DEBRIDEMENT TYPE","DEPTH (CM)","DEPTH 
DESCRIPTION","DOES PATIENT HAVE PAIN ASSOCIATED WITH THIS WOUND?","DRAIN 
PRESENT","DRAIN TYPE","EDGE / SURROUNDING TISSUE - 
MACERATION",EDGES,EPITHELIALIZATION,"EXUDATE AMOUNT","EXUDATE 
TYPE","GRANULATION TISSUE","INDICATE OTHER TYPE OF WOUND CLOSURE","INDICATE 
TYPE","INDICATE WOUND CLOSURE","IS THIS A CLOSED SURGICAL WOUND OR SUSPECTED 
DEEP TISSUE INJURY?","LENGTH (CM)","MEASUREMENTS TAKEN","NECROTIC TISSUE 
AMOUNT","NECROTIC TISSUE TYPE",ODOR,"OTHER COMMENTS REGARDING DEBRIDEMENT 
TYPE","OTHER COMMENTS REGARDING DRAIN TYPE","OTHER COMMENTS REGARDING PAIN 
INTERVENTIONS","OTHER COMMENTS REGARDING PAIN QUALITY","OTHER COMMENTS 
REGARDING REASON MEASUREMENTS NOT TAKEN","PAIN FREQUENCY","PAIN 
INTERVENTIONS","PAIN QUALITY","PERIPHERAL TISSUE EDEMA","PERIPHERAL TISSUE 
INDURATION","REASON MEASUREMENTS NOT TAKEN","RESPONSE TO PAIN 
INTERVENTIONS",SHAPE,"SIGNS AND SYMPTOMS OF INFECTION","SKIN COLOR SURROUNDING 
WOUND",STATE,"SURFACE AREA (SQ CM)","TOTAL NECROTIC TISSUE ESCHAR","TOTAL 
NECROTIC TISSUE SLOUGH",TUNNELING,"TUNNELING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","TUNNELING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK",UNDERMINING,"UNDERMINING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","UNDERMINING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK",&q

RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread l...@laurent-hasson.com


-Original Message-
From: Justin Pryzby  
Sent: Thursday, July 22, 2021 13:29
To: l...@laurent-hasson.com
Cc: Tom Lane ; David Rowley ; Peter 
Geoghegan ; pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

On Thu, Jul 22, 2021 at 05:26:26PM +, l...@laurent-hasson.com wrote:
> I tried this but not seeing max resident size data output.

Oh.  Apparently, that's not supported under windows..

#if defined(HAVE_GETRUSAGE)
appendStringInfo(&str,
 "!\t%ld kB max resident size\n",




Hello,

Damn... I know Windows is a lower priority, and this is yet another issue, but 
in Healthcare, Windows is so prevalent everywhere...

Thank you,
Laurent.




RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread l...@laurent-hasson.com


-Original Message-
From: Peter Geoghegan  
Sent: Thursday, July 22, 2021 13:36
To: l...@laurent-hasson.com
Cc: Justin Pryzby ; Tom Lane ; David 
Rowley ; pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

On Thu, Jul 22, 2021 at 10:33 AM l...@laurent-hasson.com 
 wrote:
> Damn... I know Windows is a lower priority, and this is yet another issue, 
> but in Healthcare, Windows is so prevalent everywhere...

To be clear, I didn't actually say that. I said that it doesn't matter either 
way, as far as addressing this long standing "int64 vs long"
issue goes.

--
Peter Geoghegan


Yes, agreed Peter... The "lower priority" issue was mentioned, but not in terms 
of the applicability of the fix overall. Personally, I would prefer going the 
size_t route vs int/long/int64 in C/C++/. Of course, as a user, I'd love a 
patch on V13 and something cleaner in V14.

Thanks,
Laurent.




RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread l...@laurent-hasson.com
I am not sure I understand this parameter well enough but it’s with a default 
value right now of 1000. I have read Robert’s post 
(http://rhaas.blogspot.com/2018/06/using-forceparallelmode-correctly.html) and 
could play with those parameters, but unsure whether what you are describing 
will unlock this 2GB limit.


From: Vijaykumar Jain 
Sent: Thursday, July 22, 2021 16:32
To: l...@laurent-hasson.com
Cc: Justin Pryzby ; pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

Just asking, I may be completely wrong.

is this query parallel safe?
can we force parallel workers, by setting low parallel_setup_cost or otherwise 
to make use of scatter gather and Partial HashAggregate(s)?
I am just assuming more workers doing things in parallel, would require less 
disk spill per hash aggregate (or partial hash aggregate ?) and the scatter 
gather at the end.

I did some runs in my demo environment, not with the same query, some group by 
aggregates  with around 25M rows, and it showed reasonable results, not too off.
this was pg14 on ubuntu.






RE: Big performance slowdown from 11.2 to 13.3

2021-07-23 Thread l...@laurent-hasson.com


From: Vijaykumar Jain 
Sent: Friday, July 23, 2021 10:45
To: l...@laurent-hasson.com
Cc: Justin Pryzby ; pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

On Fri, 23 Jul 2021 at 03:06, 
l...@laurent-hasson.com<mailto:l...@laurent-hasson.com> 
mailto:l...@laurent-hasson.com>> wrote:
I am not sure I understand this parameter well enough but it’s with a default 
value right now of 1000. I have read Robert’s post 
(http://rhaas.blogspot.com/2018/06/using-forceparallelmode-correctly.html) and 
could play with those parameters, but unsure whether what you are describing 
will unlock this 2GB limit.


Yeah, may be i was diverting, and possibly cannot use the windows bottleneck.

although the query is diff, the steps were
1) use system default, work_mem = 4MB, parallel_setup_cost = 1000
-- runs the query in parallel, no disk spill as work_mem suff.for my data

postgres=# explain analyze with cte as (select month_name, day_name, 
year_actual, max(date) date from dimensions.dates  group by year_actual, 
month_name, day_name) select max(date),year_actual  from cte group by 
year_actual;
 QUERY 
PLAN

 GroupAggregate  (cost=931227.78..932398.85 rows=200 width=8) (actual 
time=7850.214..7855.848 rows=51 loops=1)
   Group Key: dates.year_actual
   ->  Finalize GroupAggregate  (cost=931227.78..932333.85 rows=4200 width=28) 
(actual time=7850.075..7855.611 rows=4201 loops=1)
 Group Key: dates.year_actual, dates.month_name, dates.day_name
 ->  Gather Merge  (cost=931227.78..932207.85 rows=8400 width=28) 
(actual time=7850.069..7854.008 rows=11295 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Sort  (cost=930227.76..930238.26 rows=4200 width=28) (actual 
time=7846.419..7846.551 rows=3765 loops=3)
 Sort Key: dates.year_actual, dates.month_name, 
dates.day_name
 Sort Method: quicksort  Memory: 391kB
 Worker 0:  Sort Method: quicksort  Memory: 392kB
 Worker 1:  Sort Method: quicksort  Memory: 389kB
 ->  Partial HashAggregate  (cost=929933.00..929975.00 
rows=4200 width=28) (actual time=7841.979..7842.531 rows=3765 loops=3)
   Group Key: dates.year_actual, dates.month_name, 
dates.day_name
   Batches: 1  Memory Usage: 721kB
   Worker 0:  Batches: 1  Memory Usage: 721kB
   Worker 1:  Batches: 1  Memory Usage: 721kB
   ->  Parallel Seq Scan on dates  
(cost=0.00..820355.00 rows=10957800 width=28) (actual time=3.347..4779.784 
rows=8766240 loops=3)
 Planning Time: 0.133 ms
 Execution Time: 7855.958 ms
(20 rows)

-- set  work_mem to a very low value, to spill to disk and compare the spill in 
parallel vs serial
postgres=# set work_mem TO 64; --
SET
postgres=# explain analyze with cte as (select month_name, day_name, 
year_actual, max(date) date from dimensions.dates  group by year_actual, 
month_name, day_name) select max(date),year_actual  from cte group by 
year_actual;
 QUERY 
PLAN

 GroupAggregate  (cost=2867778.00..2868949.07 rows=200 width=8) (actual 
time=18116.529..18156.972 rows=51 loops=1)
   Group Key: dates.year_actual
   ->  Finalize GroupAggregate  (cost=2867778.00..2868884.07 rows=4200 
width=28) (actual time=18116.421..18156.729 rows=4201 loops=1)
 Group Key: dates.year_actual, dates.month_name, dates.day_name
 ->  Gather Merge  (cost=2867778.00..2868758.07 rows=8400 width=28) 
(actual time=18116.412..18155.136 rows=11126 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Sort  (cost=2866777.98..2866788.48 rows=4200 width=28) 
(actual time=17983.836..17984.981 rows=3709 loops=3)
 Sort Key: dates.year_actual, dates.month_name, 
dates.day_name
 Sort Method: external merge  Disk: 160kB
 Worker 0:  Sort Method: external merge  Disk: 168kB
 Worker 1:  Sort Method: external merge  Disk: 160kB
 ->  Partial HashAggregate  (cost=2566754.38..2866423.72 
rows=4200 width=28) (actual time=10957.390..17976.250 rows=3709 loops=3)
   Group Key: dates.year_actual, dates.month_name, 
dates.day_name
   Planned Partitions: 4  Batches: 21  Memory Usage: 
93kB  Disk Usage: 457480kB
   Worker 0:

RE: Big performance slowdown from 11.2 to 13.3

2021-07-25 Thread l...@laurent-hasson.com


-Original Message-
From: Tom Lane  
Sent: Sunday, July 25, 2021 14:08
To: l...@laurent-hasson.com
Cc: Peter Geoghegan ; David Rowley ; Justin 
Pryzby ; pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

"l...@laurent-hasson.com"  writes:
> As a user of PG, we have taken pride in the last few years in tuning the heck 
> out of the system and getting great performance compared to alternatives like 
> SQLServer. The customers we work with typically have data centers and are 
> overwhelmingly Windows shops: we won the battle to deploy a complex 
> operational system on PG vs SQLServer, but Linux vs Windows was still a 
> bridge too far for many. I am surprised that this limitation introduced after 
> V11 hasn't caused issues elsewhere though.

Maybe it has, but you're the first to report the problem, or at least the first 
to report it with enough detail to trace the cause.

I've pushed a fix that removes the artificial restriction on work_mem times 
hash_mem_multiplier; it will be in next month's 13.4 release.
You'll still need to increase hash_mem_multiplier to get satisfactory 
performance on your workload, but at least it'll be possible to do that.

regards, tom lane


-

Hello Tom,

So happy to help in whatever capacity I can 😊

This is fantastic news! Can't wait to try 13.4 asap and will report back.

Thank you,
Laurent.


RE: Big performance slowdown from 11.2 to 13.3

2021-07-27 Thread l...@laurent-hasson.com
Tom,

One question that popped up in my head. hash_mem_multiplier is an upper-bound 
right: it doesn't reserve memory ahead of time correct? So there is no reason 
for me to spend undue amounts of time fine-tuning this parameter? If I have 
work_mem to 521MB, then I can set hash_mem_multiplier to 8 and should be OK. 
This doesn't mean that every query will consume 4GB of memory.

Thank you,
Laurent.


-Original Message-
From: Tom Lane  
Sent: Sunday, July 25, 2021 14:08
To: l...@laurent-hasson.com
Cc: Peter Geoghegan ; David Rowley ; Justin 
Pryzby ; pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

"l...@laurent-hasson.com"  writes:
> As a user of PG, we have taken pride in the last few years in tuning the heck 
> out of the system and getting great performance compared to alternatives like 
> SQLServer. The customers we work with typically have data centers and are 
> overwhelmingly Windows shops: we won the battle to deploy a complex 
> operational system on PG vs SQLServer, but Linux vs Windows was still a 
> bridge too far for many. I am surprised that this limitation introduced after 
> V11 hasn't caused issues elsewhere though.

Maybe it has, but you're the first to report the problem, or at least the first 
to report it with enough detail to trace the cause.

I've pushed a fix that removes the artificial restriction on work_mem times 
hash_mem_multiplier; it will be in next month's 13.4 release.
You'll still need to increase hash_mem_multiplier to get satisfactory 
performance on your workload, but at least it'll be possible to do that.

regards, tom lane




RE: Big performance slowdown from 11.2 to 13.3

2021-07-28 Thread l...@laurent-hasson.com


-Original Message-
From: Tom Lane  
Sent: Tuesday, July 27, 2021 23:31
To: l...@laurent-hasson.com
Cc: Peter Geoghegan ; David Rowley ; Justin 
Pryzby ; pgsql-performa...@postgresql.org
Subject: Re: Big performance slowdown from 11.2 to 13.3

I wrote:
> Yeah, I wouldn't sweat over the specific value.  The pre-v13 behavior 
> was effectively equivalent to hash_mem_multiplier = infinity, so if 
> you weren't having any OOM problems before, just crank it up.

Oh, wait, scratch that: the old executor's behavior is accurately described by 
that statement, but the planner's is not.  The planner will not pick a hashagg 
plan if it guesses that the hash table would exceed the configured limit 
(work_mem before, now work_mem times hash_mem_multiplier).  So raising 
hash_mem_multiplier to the moon might bias the v13 planner to pick hashagg 
plans in cases where earlier versions would not have.  This doesn't describe 
your immediate problem, but it might be a reason to not just set the value as 
high as you can.

BTW, this also suggests that the planner is underestimating the amount of 
memory needed for the hashagg, both before and after.
That might be something to investigate at some point.

regards, tom lane


This is very useful to know... all things I'll get to test after 13.4 is 
released. I'll report back when I am able to.

Thank you,
Laurent.




RE: Big performance slowdown from 11.2 to 13.3

2021-08-20 Thread l...@laurent-hasson.com
KEN","PAIN FREQUENCY","PAIN 
INTERVENTIONS","PAIN QUALITY","PERIPHERAL TISSUE EDEMA","PERIPHERAL TISSUE 
INDURATION","REASON MEASUREMENTS NOT TAKEN","RESPONSE TO PAIN 
INTERVENTIONS",SHAPE,"SIGNS AND SYMPTOMS OF INFECTION","SKIN COLOR SURROUNDING 
WOUND",STATE,"SURFACE AREA (SQ CM)","TOTAL NECROTIC TISSUE ESCHAR","TOTAL 
NECROTIC TISSUE SLOUGH",TUNNELING,"TUNNELING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","TUNNELING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK",UNDERMINING,"UNDERMINING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","UNDERMINING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK","WIDTH (CM)","WOUND PAIN LEVEL, WHERE 0 = \"NO PAIN\" AND 10 = 
\"WORST POSSIBLE PAIN\""}'::text[]))
  Rows Removed by Filter: 30428
  Buffers: shared hit=158815
Planning:
  Buffers: shared hit=3
Planning Time: 0.552 ms
Execution Time: 54241.152 ms



V11.2 - 74s
HashAggregate  (cost=1629249.48..1629251.48 rows=200 width=1764) (actual 
time=68833.447..73384.374 rows=742896 loops=1)
  Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk
  Buffers: shared read=173985
  ->  Finalize HashAggregate  (cost=1205455.43..1219821.33 rows=1436590 
width=56) (actual time=19441.489..28630.297 rows=14176024 loops=1)
Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk, 
assessmenticcqa_raw.iccqar_ques_code
Buffers: shared read=173985
->  Gather  (cost=418922.41..1151583.31 rows=7182950 width=56) (actual 
time=3698.235..8445.971 rows=14176024 loops=1)
  Workers Planned: 5
  Workers Launched: 5
  Buffers: shared read=173985
  ->  Partial HashAggregate  (cost=417922.41..432288.31 
rows=1436590 width=56) (actual time=3559.562..4619.406 rows=2362671 loops=6)
Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk, 
assessmenticcqa_raw.iccqar_ques_code
Buffers: shared read=173985
->  Parallel Seq Scan on assessmenticcqa_raw  
(cost=0.00..396656.48 rows=2835457 width=38) (actual time=0.261..1817.102 
rows=2362671 loops=6)
  Filter: ((iccqar_ques_code)::text = ANY 
('{"DEBRIDEMENT DATE","DEBRIDEMENT THIS VISIT","DEBRIDEMENT TYPE","DEPTH 
(CM)","DEPTH DESCRIPTION","DOES PATIENT HAVE PAIN ASSOCIATED WITH THIS 
WOUND?","DRAIN PRESENT","DRAIN TYPE","EDGE / SURROUNDING TISSUE - 
MACERATION",EDGES,EPITHELIALIZATION,"EXUDATE AMOUNT","EXUDATE 
TYPE","GRANULATION TISSUE","INDICATE OTHER TYPE OF WOUND CLOSURE","INDICATE 
TYPE","INDICATE WOUND CLOSURE","IS THIS A CLOSED SURGICAL WOUND OR SUSPECTED 
DEEP TISSUE INJURY?","LENGTH (CM)","MEASUREMENTS TAKEN","NECROTIC TISSUE 
AMOUNT","NECROTIC TISSUE TYPE",ODOR,"OTHER COMMENTS REGARDING DEBRIDEMENT 
TYPE","OTHER COMMENTS REGARDING DRAIN TYPE","OTHER COMMENTS REGARDING PAIN 
INTERVENTIONS","OTHER COMMENTS REGARDING PAIN QUALITY","OTHER COMMENTS 
REGARDING REASON MEASUREMENTS NOT TAKEN","PAIN FREQUENCY","PAIN 
INTERVENTIONS","PAIN QUALITY","PERIPHERAL TISSUE EDEMA","PERIPHERAL TISSUE 
INDURATION","REASON MEASUREMENTS NOT TAKEN","RESPONSE TO PAIN 
INTERVENTIONS",SHAPE,"SIGNS AND SYMPTOMS OF INFECTION","SKIN COLOR SURROUNDING 
WOUND",STATE,"SURFACE AREA (SQ CM)","TOTAL NECROTIC TISSUE ESCHAR","TOTAL 
NECROTIC TISSUE SLOUGH",TUNNELING,"TUNNELING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","TUNNELING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK",UNDERMINING,"UNDERMINING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","UNDERMINING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK","WIDTH (CM)","WOUND PAIN LEVEL, WHERE 0 = \"NO PAIN\" AND 10 = 
\"WORST POSSIBLE PAIN\""}'::text[]))
  Rows Removed by Filter: 31608
  Buffers: shared read=173985
Planning Time: 22.6

Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-21 Thread l...@laurent-hasson.com
Hello all,

I think I have identified a major performance issue between V11.2 and 13.4 with 
respect to exception handling in UDFs. I have the following simplified query 
that pivots data and makes use of a UDF to convert data to a specific type, in 
this case, float:


select "iccqa_iccassmt_fk"
 , Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 
'DEPTH (CM)') ,null) as "iccqa_DEPTH_CM"
 , Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 
'LENGTH (CM)') ,null) as "iccqa_LENGTH_CM"
 , Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 
'WIDTH (CM)') ,null) as "iccqa_WIDTH_CM"
 , Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 
'DRAIN PRESENT') ,null) as "iccqa_DRAIN_PRESENT"
 , Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 
'MEASUREMENTS TAKEN') ,null) as "iccqa_MEASUREMENTS_TAKEN"
 , Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 
'SIGNS AND SYMPTOMS OF INFECTION') ,null) as 
"iccqa_SIGNS_AND_SYMPTOMS_OF_INFECTION"
from  (
-- 'A pivoted view of ICC QA assessments'
select VNAHGEDW_FACTS.AssessmentICCQA_Raw."iccqar_iccassmt_fk" as 
"iccqa_iccassmt_fk" -- The key identifying an ICC assessment.
 , VNAHGEDW_FACTS.AssessmentICCQA_Raw."iccqar_ques_code" as 
"iccqar_ques_code" -- The question long code from the meta-data.
 , max(VNAHGEDW_FACTS.AssessmentICCQA_Raw."iccqar_ans_val") as 
"iccqar_ans_val" -- The official answer, if applicable) from the meta-data.
  from VNAHGEDW_FACTS.AssessmentICCQA_Raw
where VNAHGEDW_FACTS.AssessmentICCQA_Raw."iccqar_ques_code" in ('DEPTH (CM)', 
'LENGTH (CM)', 'WIDTH (CM)'
   , 'DRAIN 
PRESENT', 'MEASUREMENTS TAKEN', 'SIGNS AND SYMPTOMS OF INFECTION'
   ) group by 1, 2
) T
 group by 1
;


The UDF is simple as follows:


CREATE OR REPLACE FUNCTION TILDA.toFloat(str varchar, val real)
RETURNS real AS $$
BEGIN
  RETURN case when str is null then val else str::real end;
EXCEPTION WHEN OTHERS THEN
  RETURN val;
END;
$$ LANGUAGE plpgsql COST 1 IMMUTABLE;



It works as a coalesce but with a conversion. I think I have identified some 
large performance difference with the exception handling. It so happens that 
with the last 3 columns ('DRAIN PRESENT', 'MEASUREMENTS TAKEN' and 'SIGNS AND 
SYMPTOMS OF INFECTION'), the data is VERY dirty. There is a mix of 0/1, YES/NO, 
and other mistyped stuff. This means these 3 columns throw lots of exceptions 
in the UDF. To illustrate, I simply break this into 2 queries.



select "iccqa_iccassmt_fk"
 , (Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 
'DEPTH (CM)') ,null))::real as "iccqa_DEPTH_CM"
 , (Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 
'LENGTH (CM)') ,null))::real as "iccqa_LENGTH_CM"
 , (Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 
'WIDTH (CM)') ,null))::real as "iccqa_WIDTH_CM"
-- , (Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 
'DRAIN PRESENT') ,null))::real as "iccqa_DRAIN_PRESENT"
-- , (Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 
'MEASUREMENTS TAKEN') ,null))::real as "iccqa_MEASUREMENTS_TAKEN"
-- , (Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 
'SIGNS AND SYMPTOMS OF INFECTION') ,null))::real as 
"iccqa_SIGNS_AND_SYMPTOMS_OF_INFECTION"
from  (
-- 'A pivoted view of ICC QA assessments'
select VNAHGEDW_FACTS.AssessmentICCQA_Raw."iccqar_iccassmt_fk" as 
"iccqa_iccassmt_fk" -- The key identifying an ICC assessment.
 , VNAHGEDW_FACTS.AssessmentICCQA_Raw."iccqar_ques_code" as 
"iccqar_ques_code" -- The question long code from the meta-data.
 , max(VNAHGEDW_FACTS.AssessmentICCQA_Raw."iccqar_ans_val") as 
"iccqar_ans_val" -- The official answer, if applicable) from the meta-data.
  from VNAHGEDW_FACTS.AssessmentICCQA_Raw
where VNAHGEDW_FACTS.AssessmentICCQA_Raw."iccqar_ques_code" in ('DEPTH (CM)', 
'LENGTH (CM)', 'WIDTH (CM)'
   , 'DRAIN 
PRESENT', 'MEASUREMENTS TAKEN', 'SIGNS AND SYMPTOMS OF INFECTION'
   )
group by 1, 2
) T
 group by 1
;


The performance is as expected.


HashAggregate  (cost=448463.70..448467.20 rows=200 width=16) (actual 
time=6760.797..9585.397 rows=677899 loops=1)
  Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk
  Batches: 1  Memory Usage: 147489kB
  Buffers: shared hit=158815
  ->  HashAggregate  (cost=405997.87..417322.09 rows=1132422 width=56) (actual 
time=4576.514..5460.770 rows=2374628 loops=1)
Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk, 
assessmenticcqa_raw.iccqar_ques_code
Batches: 1  Memory Usage: 368657kB
Buffers: shared hit=158815
->  Seq Scan on assessmenticcqa_raw  (cost=0.00..3882

RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-21 Thread l...@laurent-hasson.com
OK... I apologize for the long email before. Right after I sent it, I thought 
of a much simpler use-case to illustrate the issue which doesn't depend on any 
special data I have access o and complex pivoting. It's as raw as I can make it.

I create a table with 1M rows and 2 columns. Column "a" is a random string, 
while column "b" is a random integer as a string. Then I use a UDF that 
converts strings to floats and handles an exception if the incoming string is 
not parsable as a float. Then I do a simple select of each column. In the "a" 
case, the UDF throws and catches lots of exceptions. In the "b" case, the 
conversion is clean and exceptions are not thrown.


create table sampletest (a varchar, b varchar);

insert into sampletest (a, b)
select substr(md5(random()::text), 0, 15), 
(1*random())::integer::varchar
from generate_series(1,100);

CREATE OR REPLACE FUNCTION toFloat(str varchar, val real)
RETURNS real AS $$
BEGIN
  RETURN case when str is null then val else str::real end;
EXCEPTION WHEN OTHERS THEN
  RETURN val;
END;
$$ LANGUAGE plpgsql COST 1 IMMUTABLE;

select MAX(toFloat(a, null)) as "a" from sampletest;

select MAX(toFloat(b, null)) as "b" from sampletest;



On purpose, I am doing a max(toFloat) instead of toFloat(max) to exercise the 
UDF 1M times.


V13.4 "a" scenario (exceptions)
-
Aggregate  (cost=14778.40..14778.41 rows=1 width=4) (actual 
time=774098.537..774098.538 rows=1 loops=1)
  Buffers: shared hit=6373
  ->  Seq Scan on sampletest  (cost=0.00..11975.60 rows=560560 width=32) 
(actual time=0.011..285.458 rows=100 loops=1)
Buffers: shared hit=6370
Planning Time: 0.066 ms
Execution Time: 774,098.563 ms


V13.4 "b" scenario (no exceptions)
-
Aggregate  (cost=14778.40..14778.41 rows=1 width=4) (actual 
time=1510.200..1510.201 rows=1 loops=1)
  Buffers: shared hit=6385
  ->  Seq Scan on sampletest  (cost=0.00..11975.60 rows=560560 width=32) 
(actual time=0.024..115.196 rows=100 loops=1)
Buffers: shared hit=6370
Planning:
  Buffers: shared hit=26
Planning Time: 0.361 ms
Execution Time: 1,530.659 ms


V11.2 "a" scenario (exceptions)
-
Aggregate  (cost=21658.00..21658.01 rows=1 width=4) (actual 
time=26528.286..26528.286 rows=1 loops=1)
  Buffers: shared hit=6393
  ->  Seq Scan on sampletest  (cost=0.00..16562.00 rows=1019200 width=15) 
(actual time=0.037..190.633 rows=100 loops=1)
Buffers: shared hit=6370
Planning Time: 1.182 ms
Execution Time: 26,530.492 ms


V11.2 "b" scenario (no exceptions)
-
Aggregate  (cost=21658.00..21658.01 rows=1 width=4) (actual 
time=1856.116..1856.116 rows=1 loops=1)
  Buffers: shared hit=6370
  ->  Seq Scan on sampletest  (cost=0.00..16562.00 rows=1019200 width=8) 
(actual time=0.014..88.152 rows=100 loops=1)
Buffers: shared hit=6370
Planning Time: 0.098 ms
Execution Time: 1,856.152 ms





Summary:

  *   Scenario V11.2/a: 26.6s
  *   Scenario V11.2/b: 1.9s
  *   Scenario V13.4/a: 774.1s
  *   Scenario V13.4/b: 1.5s

Conclusion:

  *   The no-exception scenario performs 20% better on 13.4 vs 11.2 (nice for a 
straight scan!)
  *   On 11.2, exceptions add an overhead of over 14x (1.9s vs 26.6s). I did 
not expect exceptions to add such a large overhead. Why is that?
  *   Between 11.2 and 13.4, the no-exceptions scenario "b" performs 30x slower 
(26.6s vs 774.1s).

Thank you!
Laurent Hasson.



From: l...@laurent-hasson.com 
Sent: Saturday, August 21, 2021 03:57
To: pgsql-performa...@postgresql.org
Subject: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

Hello all,

I think I have identified a major performance issue between V11.2 and 13.4 with 
respect to exception handling in UDFs. I have the following simplified query 
that pivots data and makes use of a UDF to convert data to a specific type, in 
this case, float:


select "iccqa_iccassmt_fk"
 , Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 
'DEPTH (CM)') ,null) as "iccqa_DEPTH_CM"
 , Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 
'LENGTH (CM)') ,null) as "iccqa_LENGTH_CM"
 , Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 
'WIDTH (CM)') ,null) as "iccqa_WIDTH_CM"
 , Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 
'DRAIN PRESENT') ,null) as "iccqa_DRAIN_PRESENT"
 , Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 
'MEASUREMEN

RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-21 Thread l...@laurent-hasson.com



-Original Message-
From: Tom Lane  
Sent: Saturday, August 21, 2021 11:05
To: l...@laurent-hasson.com
Cc: pgsql-performa...@postgresql.org
Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

"l...@laurent-hasson.com"  writes:
> OK... I apologize for the long email before. Right after I sent it, I thought 
> of a much simpler use-case to illustrate the issue which doesn't depend on 
> any special data I have access o and complex pivoting. It's as raw as I can 
> make it.
> I create a table with 1M rows and 2 columns. Column "a" is a random string, 
> while column "b" is a random integer as a string. Then I use a UDF that 
> converts strings to floats and handles an exception if the incoming string is 
> not parsable as a float. Then I do a simple select of each column. In the "a" 
> case, the UDF throws and catches lots of exceptions. In the "b" case, the 
> conversion is clean and exceptions are not thrown.

I tried this script on a few different versions and got these psql-measured 
timings for the test queries:

HEAD:
Time: 12234.297 ms (00:12.234)
Time: 3029.643 ms (00:03.030)

v14:
Time: 12519.038 ms (00:12.519)
Time: 3211.315 ms (00:03.211)

v13:
Time: 12132.026 ms (00:12.132)
Time: 3114.582 ms (00:03.115)

v12:
Time: 11787.554 ms (00:11.788)
Time: 3520.875 ms (00:03.521)

v11:
Time: 13066.495 ms (00:13.066)
Time: 3503.790 ms (00:03.504)

v10:
Time: 15890.844 ms (00:15.891)
Time: 4999.843 ms (00:05.000)

(Caveats: these are assert-enabled debug builds, so they're all slower than 
production builds, but the overhead should be pretty uniform across branches I 
think.  Also, I wasn't trying hard to eliminate noise, e.g. I didn't do 
multiple runs.  So I wouldn't trust these results to be reproducible to better 
than 10% or so.)

The overhead of an EXCEPTION block is definitely high, and more so when an 
exception actually occurs, but these are known facts and my results are not out 
of line with my expectations.  Yours are though, so something is drastically 
slowing the exception- recovery path in your installation.  Do you have any 
extensions loaded?

regards, tom lane


--

So you mean that on average, the 4x overhead of exceptions is around what you'd 
expect?

As for results in general, yes, your numbers look pretty uniform across 
versions. On my end, comparing V11.2 vs V13.4 shows a much different picture!

I have a few extensions installed: plpgsql, fuzzystrmatch, pg_trgm and 
tablefunc. Same on either versions of the db installs I have, and same 
extension versions.

V11.2:
extname  
|extowner|extnamespace|extrelocatable|extversion|extconfig|extcondition|
-|||--|--|-||
plpgsql  |  10|  11|false |1.0   |NULL |NULL
|
fuzzystrmatch|  10|2200|true  |1.1   |NULL |NULL
|
pg_trgm  |  10|2200|true  |1.3   |NULL |NULL
|
tablefunc|  10|2200|true  |1.0   |NULL |NULL
|

V13.4
oid  |extname  
|extowner|extnamespace|extrelocatable|extversion|extconfig|extcondition|
-|-|||--|--|-||
13428|plpgsql  |  10|  11|false |1.0   |NULL 
|NULL|
16676|fuzzystrmatch|  10|2200|true  |1.1   |NULL 
|NULL|
16677|pg_trgm  |  10|2200|true  |1.4   |NULL 
|NULL|
16678|tablefunc|  10|2200|true  |1.0   |NULL 
|NULL|

Thank you,
Laurent.




RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-21 Thread l...@laurent-hasson.com


   >  -Original Message-
   >  From: Tom Lane 
   >  Sent: Saturday, August 21, 2021 14:05
   >  To: Mladen Gogala 
   >  Cc: pgsql-performance@lists.postgresql.org
   >  Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4
   >  
   >  Mladen Gogala  writes:
   >  > I know that 14 is a beta version but the performance is significantly
   >  > worse than v13 (I assume it's 13.4). Head revision is better than v14
   >  > but still worse than v13.  Can you expand a bit on the difference?
   >  
   >  [ shrug... ]  I don't see any meaningful differences between those
   >  numbers --- they're within 3% or so across versions, which is less than
   >  the margin of error considering I wasn't trying to control for outside
   >  effects like CPU speed stepping.  Microbenchmarks like this one are
   >  notoriously noisy.  Maybe there's some real difference there, but these
   >  numbers aren't to be trusted that much.
   >  
   >  What I was looking for was some evidence matching Laurent's report of
   >  the exception-recovery path being 500X slower than non-exception.
   >  That would have been obvious even with the sloppiest of measurements
   >  ... but I'm not seeing it.
   >  
   >regards, tom lane
   >  

Hello Tom,

The difference for the Exceptions-scenario between V11.2 and V13.4 that I 
observed was 30x.
It is the difference on V13.4 between the Exceptions and no-exceptions 
scenarios that is 500x+.

Just to clarify.

I am following up with Justin's suggestions and will respond with updated info 
soon.

Thank you!
Laurent Hasson.


RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-21 Thread l...@laurent-hasson.com



   >  -Original Message-
   >  From: Justin Pryzby 
   >  Sent: Saturday, August 21, 2021 15:20
   >  To: Tom Lane 
   >  Cc: l...@laurent-hasson.com; pgsql-performa...@postgresql.org
   >  Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4
   >  
   >  On Sat, Aug 21, 2021 at 02:17:26PM -0400, Tom Lane wrote:
   >  > "l...@laurent-hasson.com"  writes:
   >  > > So you mean that on average, the 4x overhead of exceptions is
   >  around what you'd expect?
   >  >
   >  > Doesn't surprise me any, no.  Exception recovery has to clean up after
   >  > a wide variety of possible errors, with only minimal assumptions about
   >  > what the system state had been.  So it's expensive.  More to the
   >  > point, the overhead's been broadly the same for quite some time.
   >  >
   >  > > As for results in general, yes, your numbers look pretty uniform
   >  across versions. On my end, comparing V11.2 vs V13.4 shows a much
   >  different picture!
   >  >
   >  > I'm baffled why that should be so.  I do not think any of the
   >  > extensions you mention add any exception-recovery overhead,
   >  especially
   >  > not in sessions that haven't used them.
   >  
   >  Laurent, did you install binaries for v13.4 or compile it ?
   >  
   >  What about these ?
   >  
   >  SHOW shared_preload_libraries;
   >  SHOW session_preload_libraries;
   >  SHOW local_preload_libraries;
   >  
   >  Would you try to reproduce the issue with a fresh database:
   >  CREATE DATABASE udftest; ...
   >  
   >  Or a fresh instance created with initdb.
   >  
   >  As I recall, you're running postgres under a windows VM - I'm not sure if
   >  that's relevant.
   >  
   >  --
   >  Justin

Hello Justin,

- I used the standard installer from 
https://www.enterprisedb.com/downloads/postgres-postgresql-downloads for 
Windows X64 and upgraded from 13.3, which itself was pg_upgraded from 11.2.
- Yes, windows VM on VMWARE.
- No entries from shared_preload_libraries, session_preload_libraries or 
local_preload_libraries.
- Version is "PostgreSQL 13.4, compiled by Visual C++ build 1914, 64-bit".
- I created a new database and reran the scenarios without much of a change.
- I think I am going to install a whole fresh new instance from scratch and see 
if there may have been some weird stuff happening with the upgrade path I took?

Thank you,
Laurent Hasson.







RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-21 Thread l...@laurent-hasson.com


   >  -Original Message-
   >  From: Justin Pryzby 
   >  Sent: Saturday, August 21, 2021 18:17
   >  To: l...@laurent-hasson.com
   >  Cc: Tom Lane ; pgsql-performa...@postgresql.org
   >  Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4
   >  
   >  Could you send SELECT * FROM pg_config() and try to find the CPU
   >  model ?
   >  
   >  I think it's possible the hypervisor is trapping and emulating unhandled
   >  CPU instructions.
   >  
   >  Actually, it would be interesting to see if the performance differs
   >  between
   >  11.2 and 11.13.  It's possible that EDB compiled 11.13 on a newer CPU
   >  (or a newer compiler) than 11.2 was compiled.
   >  
   >  If you test that, it should be on a separate VM, unless the existing data
   >  dir can be restored from backup.  Once you've started a cluster with
   >  updated binaries, you should avoid downgrading the binaries.



Hello all,

OK, I was able to do a clean install of 13.4 on the VM. All stock settings, no 
extensions loaded, pure clean straight out of the install.

create table sampletest (a varchar, b varchar);
-- truncate table sampletest;
insert into sampletest (a, b)
select substr(md5(random()::text), 0, 15), 
(1*random())::integer::varchar
  from generate_series(1,100);

CREATE OR REPLACE FUNCTION toFloat(str varchar, val real)
RETURNS real AS $$
BEGIN
  RETURN case when str is null then val else str::real end;
EXCEPTION WHEN OTHERS THEN
  RETURN val;
END;
$$ LANGUAGE plpgsql COST 1 IMMUTABLE;


explain (analyze,buffers,COSTS,TIMING) select MAX(toFloat(b, null)) as "b" from 
sampletest

Aggregate  (cost=21370.00..21370.01 rows=1 width=4) (actual 
time=1780.561..1780.563 rows=1 loops=1)
  Buffers: shared hit=6387
  ->  Seq Scan on sampletest  (cost=0.00..16370.00 rows=100 width=8) 
(actual time=0.053..97.329 rows=100 loops=1)
Buffers: shared hit=6370
Planning:
  Buffers: shared hit=36
Planning Time: 2.548 ms
Execution Time: 1,810.330 ms


explain (analyze,buffers,COSTS,TIMING) select MAX(toFloat(a, null)) as "a" from 
sampletest

Aggregate  (cost=21370.00..21370.01 rows=1 width=4) (actual 
time=863243.876..863243.877 rows=1 loops=1)
  Buffers: shared hit=6373
  ->  Seq Scan on sampletest  (cost=0.00..16370.00 rows=100 width=15) 
(actual time=0.009..301.553 rows=100 loops=1)
Buffers: shared hit=6370
Planning:
  Buffers: shared hit=44
Planning Time: 0.469 ms
Execution Time: 863,243.911 ms


So I am still able to reproduce this on a different VM and a clean install of 
13.4 ☹


SELECT * FROM pg_config();

BINDIR  C:/PROGRA~1/POSTGR~1/13/bin
DOCDIR  C:/PROGRA~1/POSTGR~1/13/doc
HTMLDIR C:/PROGRA~1/POSTGR~1/13/doc
INCLUDEDIR  C:/PROGRA~1/POSTGR~1/13/include
PKGINCLUDEDIR   C:/PROGRA~1/POSTGR~1/13/include
INCLUDEDIR-SERVER   C:/PROGRA~1/POSTGR~1/13/include/server
LIBDIR  C:/PROGRA~1/POSTGR~1/13/lib
PKGLIBDIR   C:/PROGRA~1/POSTGR~1/13/lib
LOCALEDIR   C:/PROGRA~1/POSTGR~1/13/share/locale
MANDIR  C:/Program Files/PostgreSQL/13/man
SHAREDIRC:/PROGRA~1/POSTGR~1/13/share
SYSCONFDIR  C:/Program Files/PostgreSQL/13/etc
PGXSC:/Program Files/PostgreSQL/13/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE   --enable-thread-safety --enable-nls --with-ldap --with-openssl 
--with-uuid --with-libxml --with-libxslt --with-icu --with-tcl --with-perl 
--with-python
CC  not recorded
CPPFLAGSnot recorded
CFLAGS  not recorded
CFLAGS_SL   not recorded
LDFLAGS not recorded
LDFLAGS_EX  not recorded
LDFLAGS_SL  not recorded
LIBSnot recorded
VERSION PostgreSQL 13.4


And here is SYSINFO:

C:\Users\LHASSON>systeminfo

Host Name: PRODDB
OS Name:   Microsoft Windows Server 2012 R2 Standard
OS Version:6.3.9600 N/A Build 9600
OS Manufacturer:   Microsoft Corporation
OS Configuration:  Member Server
OS Build Type: Multiprocessor Free
Original Install Date: 2015-09-19, 18:19:41
System Boot Time:  2021-07-22, 11:45:09
System Manufacturer:   VMware, Inc.
System Model:  VMware Virtual Platform
System Type:   x64-based PC
Processor(s):  4 Processor(s) Installed.
   [01]: Intel64 Family 6 Model 79 Stepping 1 
GenuineIntel ~2397 Mhz
   [02]: Intel64 Family 6 Model 79 Stepping 1 
GenuineIntel ~2397 Mhz
   [03]: Intel64 Family 6 Model 79 Stepping 1 
GenuineIntel ~2397 Mhz
   [04]: Intel64 Family 6 Model 79 Stepping 1 
GenuineIntel ~2397 Mhz
BIOS Version:  Phoenix Technologies LTD 6.00, 2020-05-28
Windows Directory: C:\Windows
System Directory:  C:\Windows\system32
Boot Device:   \Device\HarddiskVolume1
System Locale: en-us;English (United States)
Input Local

RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-21 Thread l...@laurent-hasson.com


   >  -Original Message-
   >  From: l...@laurent-hasson.com 
   >  Sent: Saturday, August 21, 2021 19:02
   >  To: Justin Pryzby 
   >  Cc: Tom Lane ; pgsql-performa...@postgresql.org
   >  Subject: RE: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4
   >  
   >  
   >  
   > >  -Original Message-
   > >  From: Justin Pryzby 
   > >  Sent: Saturday, August 21, 2021 18:17
   > >  To: l...@laurent-hasson.com
   > >  Cc: Tom Lane ; pgsql-
   >  performa...@postgresql.org
   > >  Subject: Re: Big Performance drop of Exceptions in UDFs between
   >  V11.2
   > >  and 13.4
   > >
   > >  Could you send SELECT * FROM pg_config() and try to find the CPU
   > >  model ?
   > >
   > >  I think it's possible the hypervisor is trapping and emulating
   >  unhandled
   > >  CPU instructions.
   > >
   > >  Actually, it would be interesting to see if the performance differs
   > >  between
   > >  11.2 and 11.13.  It's possible that EDB compiled 11.13 on a newer
   >  CPU
   > >  (or a newer compiler) than 11.2 was compiled.
   > >
   > >  If you test that, it should be on a separate VM, unless the existing
   >  data
   > >  dir can be restored from backup.  Once you've started a cluster with
   > >  updated binaries, you should avoid downgrading the binaries.
   >  
   >  
   >  
   >  Hello all,
   >  
   >  OK, I was able to do a clean install of 13.4 on the VM. All stock 
settings,
   >  no extensions loaded, pure clean straight out of the install.
   >  
   >  create table sampletest (a varchar, b varchar);
   >  -- truncate table sampletest;
   >  insert into sampletest (a, b)
   >  select substr(md5(random()::text), 0, 15),
   >  (1*random())::integer::varchar
   >from generate_series(1,100);
   >  
   >  CREATE OR REPLACE FUNCTION toFloat(str varchar, val real) RETURNS
   >  real AS $$ BEGIN
   >RETURN case when str is null then val else str::real end; EXCEPTION
   >  WHEN OTHERS THEN
   >RETURN val;
   >  END;
   >  $$ LANGUAGE plpgsql COST 1 IMMUTABLE;
   >  
   >  
   >  explain (analyze,buffers,COSTS,TIMING) select MAX(toFloat(b, null)) as
   >  "b" from sampletest
   >  
   >  Aggregate  (cost=21370.00..21370.01 rows=1 width=4) (actual
   >  time=1780.561..1780.563 rows=1 loops=1)
   >Buffers: shared hit=6387
   >->  Seq Scan on sampletest  (cost=0.00..16370.00 rows=100
   >  width=8) (actual time=0.053..97.329 rows=100 loops=1)
   >  Buffers: shared hit=6370
   >  Planning:
   >Buffers: shared hit=36
   >  Planning Time: 2.548 ms
   >  Execution Time: 1,810.330 ms
   >  
   >  
   >  explain (analyze,buffers,COSTS,TIMING) select MAX(toFloat(a, null)) as
   >  "a" from sampletest
   >  
   >  Aggregate  (cost=21370.00..21370.01 rows=1 width=4) (actual
   >  time=863243.876..863243.877 rows=1 loops=1)
   >Buffers: shared hit=6373
   >->  Seq Scan on sampletest  (cost=0.00..16370.00 rows=100
   >  width=15) (actual time=0.009..301.553 rows=100 loops=1)
   >  Buffers: shared hit=6370
   >  Planning:
   >Buffers: shared hit=44
   >  Planning Time: 0.469 ms
   >  Execution Time: 863,243.911 ms
   >  
   >  
   >  So I am still able to reproduce this on a different VM and a clean install
   >  of 13.4 ☹
   >  
   >  
   >  SELECT * FROM pg_config();
   >  
   >  BINDIRC:/PROGRA~1/POSTGR~1/13/bin
   >  DOCDIRC:/PROGRA~1/POSTGR~1/13/doc
   >  HTMLDIR   C:/PROGRA~1/POSTGR~1/13/doc
   >  INCLUDEDIRC:/PROGRA~1/POSTGR~1/13/include
   >  PKGINCLUDEDIR C:/PROGRA~1/POSTGR~1/13/include
   >  INCLUDEDIR-SERVER C:/PROGRA~1/POSTGR~1/13/include/server
   >  LIBDIRC:/PROGRA~1/POSTGR~1/13/lib
   >  PKGLIBDIR C:/PROGRA~1/POSTGR~1/13/lib
   >  LOCALEDIR C:/PROGRA~1/POSTGR~1/13/share/locale
   >  MANDIRC:/Program Files/PostgreSQL/13/man
   >  SHAREDIR  C:/PROGRA~1/POSTGR~1/13/share
   >  SYSCONFDIRC:/Program Files/PostgreSQL/13/etc
   >  PGXS  C:/Program Files/PostgreSQL/13/lib/pgxs/src/makefiles/pgxs.mk
   >  CONFIGURE --enable-thread-safety --enable-nls --with-ldap --with-
   >  openssl --with-uuid --with-libxml --with-libxslt --with-icu --with-tcl 
--with-
   >  perl --with-python
   >  CCnot recorded
   >  CPPFLAGS  not recorded
   >  CFLAGSnot recorded
   >  CFLAGS_SL not recorded
   >  LDFLAGS   not recorded
   >  LDFLAGS_EXnot recorded
   >  LDFLAGS_SLnot recorded
   >  LIBS  not r

RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-22 Thread l...@laurent-hasson.com



   >  -Original Message-
   >  From: Justin Pryzby 
   >  Sent: Sunday, August 22, 2021 11:48
   >  To: Ranier Vilela 
   >  Cc: l...@laurent-hasson.com; Tom Lane ; pgsql-
   >  performa...@postgresql.org
   >  Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4
   >  
   >  On Sun, Aug 22, 2021 at 10:50:47AM -0300, Ranier Vilela wrote:
   >  > > Tried to check this with Very Sleepy at Windows 10 (bare metal).
   >  > > Not sure it can help if someone can guide how to test this better?
   >  
   >  > explain (analyze,buffers,COSTS,TIMING) select MAX(toFloat(a, null)) as
   >  > "a" from sampletest;
   >  
   >  Your 100sec result *seems* to reproduce the problem, but it'd be more
   >  clear if you showed the results of both queries (toFloat(a) vs 
toFloat(b)).
   >  Laurent's queries took 800sec vs 2sec.
   >  
   >  > postgres.png (print screen from Very Sleepy) postgres.csv
   >  
   >  This looks useful, thanks.  It seems like maybe win64 builds are very slow
   >  running this:
   >  
   >  exec_stmt_block() /
   >  BeginInternalSubTransaction() /
   >  AbortSubTransaction() /
   >  reschedule_timeouts() /
   >  schedule_alarm() /
   >  setitimer() /
   >  pg_timer_thread() /
   >  WaitForSingleObjectEx ()
   >  
   >  We should confirm whether there's a dramatic regression caused by
   >  postgres source code (and not by compilation environment or windows
   >  version changes).
   >  Test if there's a dramatic difference between v11 and v12, or v12 and
   >  v13.
   >  To be clear, the ~4x difference in v11 between Laurent's "exceptional"
   >  and "nonexceptional" cases is expected.  But the 400x difference in v13
   >  is not.
   >  
   >  If it's due to a change in postgres source code, we should find what
   >  commit caused the regression.
   >  
   >  First, check if v12 is affected.  Right now, we know that v11.2 is ok and
   >  v13.4 is not ok.  Then (unless someone has a hunch where to look), you
   >  could use git bisect to find the culprit commit.
   >  
   >  Git log shows 85 commits affecting those files across the 2 branches -
   >  once we determine whether v12 is affected, that alone eliminates a
   >  significant fraction of the commits to be checked.
   >  
   >  git log --oneline --cherry-pick
   >  origin/REL_11_STABLE...origin/REL_13_STABLE
   >  src/backend/access/transam/xact.c src/backend/port/win32/timer.c
   >  src/backend/utils/misc/timeout.c src/pl/plpgsql/src/pl_exec.c
   >  
   >  --
   >  Justin



So, I have other installs of Postgres I can also test on my laptop. No VM, 
straight install of Windows 10.


PostgreSQL 12.3, compiled by Visual C++ build 1914, 64-bit install
No-exceptions scenario
---
Aggregate  (cost=14778.40..14778.41 rows=1 width=4) (actual 
time=1462.836..1462.837 rows=1 loops=1)
  Buffers: shared hit=6379
  ->  Seq Scan on sampletest  (cost=0.00..11975.60 rows=560560 width=32) 
(actual time=0.020..86.506 rows=100 loops=1)
Buffers: shared hit=6370
Planning Time: 0.713 ms
Execution Time: 1463.359 ms

Exceptions scenario
---
I canceled the query after 18mn...



PostgreSQL 11.1, compiled by Visual C++ build 1914, 64-bit
No-exceptions scenario
---
Aggregate  (cost=14778.40..14778.41 rows=1 width=4) (actual 
time=1784.915..1784.915 rows=1 loops=1)
  Buffers: shared hit=6377
  ->  Seq Scan on sampletest  (cost=0.00..11975.60 rows=560560 width=32) 
(actual time=0.026..107.194 rows=100 loops=1)
Buffers: shared hit=6370
Planning Time: 0.374 ms
Execution Time: 1785.203 ms

Exceptions scenario
---
Aggregate  (cost=14778.40..14778.41 rows=1 width=4) (actual 
time=33891.778..33891.778 rows=1 loops=1)
  Buffers: shared hit=6372
  ->  Seq Scan on sampletest  (cost=0.00..11975.60 rows=560560 width=32) 
(actual time=0.015..171.325 rows=100 loops=1)
Buffers: shared hit=6370
Planning Time: 0.090 ms
Execution Time: 33891.806 ms







RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-22 Thread l...@laurent-hasson.com



   >  -Original Message-
   >  From: Tom Lane 
   >  Sent: Sunday, August 22, 2021 13:51
   >  To: Justin Pryzby 
   >  Cc: Ranier Vilela ; l...@laurent-hasson.com;
   >  pgsql-performa...@postgresql.org
   >  Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4
   >  
   >  Justin Pryzby  writes:
   >  > This looks useful, thanks.  It seems like maybe win64 builds are very
   >  > slow running this:
   >  
   >  > exec_stmt_block() /
   >  > BeginInternalSubTransaction() /
   >  > AbortSubTransaction() /
   >  > reschedule_timeouts() /
   >  > schedule_alarm() /
   >  > setitimer() /
   >  > pg_timer_thread() /
   >  > WaitForSingleObjectEx ()
   >  
   >  Hmm ... we should not be there unless there are active timeout events,
   >  which there aren't by default.  I wonder whether either Ranier or
   >  Laurent have statement_timeout or some similar option enabled.
   >  
   >  I tried setting statement_timeout = '1 min' just to see if that would 
affect
   >  the results.  It does, but only incrementally on my Linux box (on v13, the
   >  exception-causing query slows from ~13sec to ~14sec).  It's possible that
   >  our Windows version of
   >  setitimer() is far slower, but that doesn't make a lot of sense really 
--- the
   >  client side of that just briefly takes a critical section.  It shouldn't 
be
   >  blocking.
   >  
   >  Also, the Windows version (src/backend/port/win32/timer.c) hasn't
   >  changed at all since before v11.  So even if it's slow, that doesn't tell 
us
   >  what changed.
   >  
   >  There is a patch in v14 (09cf1d522) that drastically reduces the rate at
   >  which we make setitimer() calls, which would likely be enough to fix any
   >  performance problem that may exist here.
   >  But it's still unclear what's different between v11 and v13.
   >  
   >regards, tom lane


Hello Tom,

On both my clean 13.4 install and current 11.2 install, I have
#statement_timeout = 0  # in milliseconds, 0 is disabled

Note that the 13.4 clean install I gave last measurements for has all stock 
settings.

Thank you,
Laurent.





RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-22 Thread l...@laurent-hasson.com


   >  -Original Message-
   >  From: l...@laurent-hasson.com 
   >  Sent: Sunday, August 22, 2021 14:37
   >  To: Tom Lane ; Justin Pryzby
   >  
   >  Cc: Ranier Vilela ; pgsql-
   >  performa...@postgresql.org
   >  Subject: RE: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4
   >  
   >  
   >  
   > >  -Original Message-
   > >  From: Tom Lane 
   > >  Sent: Sunday, August 22, 2021 13:51
   >     >  To: Justin Pryzby 
   > >  Cc: Ranier Vilela ; l...@laurent-hasson.com;
   > >  pgsql-performa...@postgresql.org
   > >  Subject: Re: Big Performance drop of Exceptions in UDFs between
   >  V11.2
   > >  and 13.4
   > >
   > >  Justin Pryzby  writes:
   > >  > This looks useful, thanks.  It seems like maybe win64 builds are
   >  very
   > >  > slow running this:
   > >
   > >  > exec_stmt_block() /
   > >  > BeginInternalSubTransaction() /
   > >  > AbortSubTransaction() /
   > >  > reschedule_timeouts() /
   > >  > schedule_alarm() /
   > >  > setitimer() /
   > >  > pg_timer_thread() /
   > >  > WaitForSingleObjectEx ()
   > >
   > >  Hmm ... we should not be there unless there are active timeout
   >  events,
   > >  which there aren't by default.  I wonder whether either Ranier or
   > >  Laurent have statement_timeout or some similar option enabled.
   > >
   > >  I tried setting statement_timeout = '1 min' just to see if that 
would
   >  affect
   > >  the results.  It does, but only incrementally on my Linux box (on 
v13,
   >  the
   > >  exception-causing query slows from ~13sec to ~14sec).  It's possible
   >  that
   > >  our Windows version of
   > >  setitimer() is far slower, but that doesn't make a lot of sense 
really ---
   >  the
   > >  client side of that just briefly takes a critical section.  It 
shouldn't be
   > >  blocking.
   > >
   > >  Also, the Windows version (src/backend/port/win32/timer.c) hasn't
   > >  changed at all since before v11.  So even if it's slow, that 
doesn't tell
   >  us
   > >  what changed.
   > >
   > >  There is a patch in v14 (09cf1d522) that drastically reduces the 
rate
   >  at
   > >  which we make setitimer() calls, which would likely be enough to fix
   >  any
   > >  performance problem that may exist here.
   > >  But it's still unclear what's different between v11 and v13.
   > >
   > >  regards, tom lane
   >  
   >  
   >  Hello Tom,
   >  
   >  On both my clean 13.4 install and current 11.2 install, I have
   >  #statement_timeout = 0# in milliseconds, 0 is
   >  disabled
   >  
   >  Note that the 13.4 clean install I gave last measurements for has all 
stock
   >  settings.
   >  
   >  Thank you,
   >  Laurent.
   >  
   >  

One more fresh install, of 11.13 this time and the issue is not there... 😊

Aggregate  (cost=14778.40..14778.41 rows=1 width=4) (actual 
time=1963.573..1963.574 rows=1 loops=1)
  Buffers: shared hit=6377
  ->  Seq Scan on sampletest  (cost=0.00..11975.60 rows=560560 width=32) 
(actual time=0.027..110.896 rows=100 loops=1)
Buffers: shared hit=6370
Planning Time: 0.427 ms
Execution Time: 1963.981 ms


Aggregate  (cost=21370.00..21370.01 rows=1 width=4) (actual 
time=31685.853..31685.853 rows=1 loops=1)
  Buffers: shared hit=6370
  ->  Seq Scan on sampletest  (cost=0.00..16370.00 rows=100 width=15) 
(actual time=0.029..180.664 rows=100 loops=1)
Buffers: shared hit=6370
Planning Time: 0.092 ms
Execution Time: 31685.904 ms

I am still experiencing a larger slowdown in the "with-exceptions" scenario 
being 16x slower compared to other measurements you have all produced.. But at 
least, it's manageable compared to the multi 100x times.

So, now, in summary:

- I have tried V13.4, V12.3, 11.13, 11.2, 11.1 on several Windows VMs and my 
personal laptop (no VM).
- All V11.x seem to behave uniformly.
- Starting with 12.3, I am experiencing the major slowdown in the "with 
exceptions" scenario.


So, I was thinking about stuff and a lot of your intuitions seem to drive 
towards an issue with the compiler used to compile the Winx64 version... But is 
it possible that the JIT is getting in there and making things weird? Given 
that it's a major change in V12 and this is when I am starting to see the issue 
popup, I figured it might be another avenue to look into?

Thank you,
Laurent Hasson.

  




RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-22 Thread l...@laurent-hasson.com



   >  -Original Message-
   >  From: Tom Lane 
   >  Sent: Sunday, August 22, 2021 15:24
   >  To: l...@laurent-hasson.com
   >  Cc: Justin Pryzby ; Ranier Vilela
   >  ; pgsql-performa...@postgresql.org
   >  Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4
   >  
   >  "l...@laurent-hasson.com"  writes:
   >  > So, now, in summary:
   >  
   >  > - I have tried V13.4, V12.3, 11.13, 11.2, 11.1 on several Windows VMs
   >  and my personal laptop (no VM).
   >  > - All V11.x seem to behave uniformly.
   >  > - Starting with 12.3, I am experiencing the major slowdown in the
   >  "with exceptions" scenario.
   >  
   >  Interesting.  There's no meaningful difference between v11 and v12 as
   >  far as timeout handling goes, so I'm starting to think that that's a red
   >  herring.
   >  
   >  (Although, after having done some web-searching, I do wonder why
   >  timer.c is using a manual-reset event.  It looks like auto-reset would
   >  work just as well with less code, and I found some suggestions that it
   >  might perform better.)
   >  
   >  > So, I was thinking about stuff and a lot of your intuitions seem to 
drive
   >  towards an issue with the compiler used to compile the Winx64
   >  version... But is it possible that the JIT is getting in there and making
   >  things weird? Given that it's a major change in V12 and this is when I am
   >  starting to see the issue popup, I figured it might be another avenue to
   >  look into?
   >  
   >  Hm, is JIT even enabled in your build?  If so, does setting jit = 0 change
   >  anything?
   >  
   >regards, tom lane

Hello Tom,

I just ran the test with jit=off in the config and restated the server. No 
change on 13.4. I'd think that the query cost as per the planner would be too 
small to kick in the JIT but thought to check anyways. Doesn't seem to be the 
cause.

Thanks.,
Laurent.







RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-22 Thread l...@laurent-hasson.com



   >  -Original Message-
   >  From: l...@laurent-hasson.com 
   >  Sent: Sunday, August 22, 2021 15:29
   >  To: Tom Lane 
   >  Cc: Justin Pryzby ; Ranier Vilela
   >  ; pgsql-performa...@postgresql.org
   >  Subject: RE: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4
   >  
   >  
   >  
   > >  -Original Message-
   > >  From: Tom Lane 
   > >  Sent: Sunday, August 22, 2021 15:24
   > >  To: l...@laurent-hasson.com
   > >  Cc: Justin Pryzby ; Ranier Vilela
   > >  ; pgsql-performa...@postgresql.org
   > >  Subject: Re: Big Performance drop of Exceptions in UDFs between
   >  V11.2
   > >  and 13.4
   > >
   > >  "l...@laurent-hasson.com"  writes:
   > >  > So, now, in summary:
   > >
   > >  > - I have tried V13.4, V12.3, 11.13, 11.2, 11.1 on several Windows
   >  VMs
   > >  and my personal laptop (no VM).
   > >  > - All V11.x seem to behave uniformly.
   > >  > - Starting with 12.3, I am experiencing the major slowdown in the
   > >  "with exceptions" scenario.
   > >
   > >  Interesting.  There's no meaningful difference between v11 and v12
   >  as
   > >  far as timeout handling goes, so I'm starting to think that that's 
a red
   > >  herring.
   > >
   > >  (Although, after having done some web-searching, I do wonder why
   > >  timer.c is using a manual-reset event.  It looks like auto-reset 
would
   > >  work just as well with less code, and I found some suggestions that 
it
   > >  might perform better.)
   > >
   > >  > So, I was thinking about stuff and a lot of your intuitions seem 
to
   >  drive
   > >  towards an issue with the compiler used to compile the Winx64
   > >  version... But is it possible that the JIT is getting in there and 
making
   > >  things weird? Given that it's a major change in V12 and this is 
when I
   >  am
   > >  starting to see the issue popup, I figured it might be another 
avenue
   >  to
   > >  look into?
   > >
   > >  Hm, is JIT even enabled in your build?  If so, does setting jit = 0
   >  change
   > >  anything?
   > >
   > >  regards, tom lane
   >  
   >  Hello Tom,
   >  
   >  I just ran the test with jit=off in the config and restated the server. No
   >  change on 13.4. I'd think that the query cost as per the planner would be
   >  too small to kick in the JIT but thought to check anyways. Doesn't seem
   >  to be the cause.
   >  
   >  Thanks.,
   >  Laurent.
   >  
   >  
   >  
   >  


Also Tom,

I do have a Linux install of 13.3, and things work beautifully, so this is 
definitely a Windows thing here that started in V12.

No exceptions
-
Aggregate  (cost=21370.00..21370.01 rows=1 width=4) (actual 
time=1796.311..1796.313 rows=1 loops=1)
  Buffers: shared hit=6370
  ->  Seq Scan on sampletest  (cost=0.00..16370.00 rows=100 width=8) 
(actual time=0.006..113.720 rows=100 loops=1)
Buffers: shared hit=6370
Planning:
  Buffers: shared hit=5
Planning Time: 0.121 ms
Execution Time: 1796.346 ms

With Exceptions
--
Aggregate  (cost=14778.40..14778.41 rows=1 width=4) (actual 
time=6355.051..6355.052 rows=1 loops=1)
  Buffers: shared hit=6373
  ->  Seq Scan on sampletest  (cost=0.00..11975.60 rows=560560 width=32) 
(actual time=0.011..163.499 rows=100 loops=1)
Buffers: shared hit=6370
Planning Time: 0.064 ms
Execution Time: 6355.077 ms





RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-22 Thread l...@laurent-hasson.com



   >  -Original Message-
   >  From: Tom Lane 
   >  Sent: Sunday, August 22, 2021 16:11
   >  To: l...@laurent-hasson.com
   >  Cc: Justin Pryzby ; Ranier Vilela
   >  ; pgsql-performa...@postgresql.org
   >  Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4
   >  
   >  "l...@laurent-hasson.com"  writes:
   >  > I do have a Linux install of 13.3, and things work beautifully, so this 
is
   >  definitely a Windows thing here that started in V12.
   >  
   >  It's good to have a box around it, but that's still a pretty large box 
:-(.
   >  
   >  I'm hoping that one of our Windows-using developers will see if they can
   >  reproduce this, and if so, try to bisect where it started.
   >  Not sure how to make further progress without that.
   >  
   >regards, tom lane

Hello Tom,

If there is any way I can help further... I am definitely not able to do a dev 
environment and local build, but if we have a windows developer reproducing the 
issue between 11 and 12, then that should help. If someone makes a debug build 
available to me, I can provide additional help based on that.

That being said, do you have any suggestion how I could circumvent the issue 
altogether? Is there a way I could convert a String to some type (integer, 
float, date...) without exceptions and in case of failure, return a default 
value? Maybe there is a way to do this and I can avoid exception handling 
altogether? Or use something else than plpgsql? I am always under the 
impression that plpgsql is the best performing option?

I have seen regex-based options out there, but none being fully satisfying for 
floating points in particular.

Thank you,
Laurent.







RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-22 Thread l...@laurent-hasson.com


   >  -Original Message-
   >  From: Andrew Dunstan 
   >  Sent: Sunday, August 22, 2021 17:27
   >  To: Tom Lane ; l...@laurent-hasson.com
   >  Cc: Justin Pryzby ; Ranier Vilela
   >  ; pgsql-performa...@postgresql.org
   >  Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4
   >  
   >  
   >  On 8/22/21 4:11 PM, Tom Lane wrote:
   >  > "l...@laurent-hasson.com"  writes:
   >  >> I do have a Linux install of 13.3, and things work beautifully, so 
this is
   >  definitely a Windows thing here that started in V12.
   >  > It's good to have a box around it, but that's still a pretty large box
   >  > :-(.
   >  >
   >  > I'm hoping that one of our Windows-using developers will see if they
   >  > can reproduce this, and if so, try to bisect where it started.
   >  > Not sure how to make further progress without that.
   >  >
   >  >
   >  
   >  
   >  Can do. Assuming the assertion that it started in Release 12 is correct, I
   >  should be able to find it by bisecting between the branch point for 12
   >  and the tip of that branch. That's a little over 20 probes by my
   >  calculation.
   >  
   >  
   >  cheers
   >  
   >  
   >  andrew
   >  
   >  
   >  --
   >  Andrew Dunstan
   >  EDB: https://www.enterprisedb.com


I tried it on 11.13 and 12.3. Is there a place where I could download 12.1 and 
12.2 and test that? Is it worth it or you think you have all you need?

Thanks,
Laurent.



RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-22 Thread l...@laurent-hasson.com


   >  -Original Message-
   >  From: Justin Pryzby 
   >  Sent: Sunday, August 22, 2021 20:43
   >  To: Ranier Vilela 
   >  Cc: l...@laurent-hasson.com; Tom Lane ; pgsql-
   >  performa...@postgresql.org
   >  Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4
   >  
   >  On Sun, Aug 22, 2021 at 08:44:34PM -0300, Ranier Vilela wrote:
   >  > > If there is any way I can help further... I am definitely not able
   >  > > to do a dev environment and local build, but if we have a windows
   >  > > developer reproducing the issue between 11 and 12, then that
   >  should
   >  > > help. If someone makes a debug build available to me, I can provide
   >  > > additional help based on that.
   >  >
   >  > Please, download from this link (Google Drive):
   >  >
   >  > https://drive.google.com/file/d/13kPbNmk54lR6t-lwcwi-
   >  63UdM55sA27t/view
   >  > ?usp=sharing
   >  
   >  Laurent gave a recipe to reproduce the problem, and you seemed to be
   >  able to reproduce it, so I think Laurent's part is done.  The burden now
   >  lies with postgres developers to isolate the issue, and Andrew said he
   >  would bisect to look for the culprit commit.
   >  
   >  --
   >  Justin


Hello Ranier,
I am not sure what to do with that build. I am a Java/JavaScript guy these 
days. I haven't coded C/C++ in over 15 years now and I don't have a debugging 
environment 😊 If I can run the scenario I created and get a log file, that I 
can do if that helps.

Justin, I think I agree with you although I am concerned that none of you were 
able to truly reproduce the results I have now reproduced on plain base-line 
installs on 2 VMs (Windows Server 2012) and a laptop (Windows 10 pro), across 
multiple versions of the installer (11, 12 and 13).

In any case, i'll do my best to help. If you think you have a fix and it's in 
one dll or exe and I can just manually patch a 13.4 install and test again, 
I'll do that with pleasure.

Thank you,
Laurent.




RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-26 Thread l...@laurent-hasson.com
Hello all,

Any update on this issue?

Thank you!
Laurent.

   >  -Original Message-
   >  From: l...@laurent-hasson.com 
   >  Sent: Sunday, August 22, 2021 23:23
   >  To: Justin Pryzby ; Ranier Vilela
   >  
   >  Cc: Tom Lane ; pgsql-performa...@postgresql.org
   >  Subject: RE: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4
   >  
   >  
   >  
   > >  -Original Message-
   > >  From: Justin Pryzby 
   > >  Sent: Sunday, August 22, 2021 20:43
   >     >  To: Ranier Vilela 
   > >  Cc: l...@laurent-hasson.com; Tom Lane ; pgsql-
   > >  performa...@postgresql.org
   > >  Subject: Re: Big Performance drop of Exceptions in UDFs between
   >  V11.2
   > >  and 13.4
   > >
   > >  On Sun, Aug 22, 2021 at 08:44:34PM -0300, Ranier Vilela wrote:
   > >  > > If there is any way I can help further... I am definitely not 
able
   > >  > > to do a dev environment and local build, but if we have a
   >  windows
   > >  > > developer reproducing the issue between 11 and 12, then that
   > >  should
   > >  > > help. If someone makes a debug build available to me, I can
   >  provide
   > >  > > additional help based on that.
   > >  >
   > >  > Please, download from this link (Google Drive):
   > >  >
   > >  > https://drive.google.com/file/d/13kPbNmk54lR6t-lwcwi-
   > >  63UdM55sA27t/view
   > >  > ?usp=sharing
   > >
   > >  Laurent gave a recipe to reproduce the problem, and you seemed to
   >  be
   > >  able to reproduce it, so I think Laurent's part is done.  The burden
   >  now
   > >  lies with postgres developers to isolate the issue, and Andrew said
   >  he
   > >  would bisect to look for the culprit commit.
   > >
   > >  --
   > >  Justin
   >  
   >  
   >  Hello Ranier,
   >  I am not sure what to do with that build. I am a Java/JavaScript guy
   >  these days. I haven't coded C/C++ in over 15 years now and I don't have
   >  a debugging environment 😊 If I can run the scenario I created and get a
   >  log file, that I can do if that helps.
   >  
   >  Justin, I think I agree with you although I am concerned that none of you
   >  were able to truly reproduce the results I have now reproduced on plain
   >  base-line installs on 2 VMs (Windows Server 2012) and a laptop
   >  (Windows 10 pro), across multiple versions of the installer (11, 12 and
   >  13).
   >  
   >  In any case, i'll do my best to help. If you think you have a fix and 
it's in
   >  one dll or exe and I can just manually patch a 13.4 install and test 
again,
   >  I'll do that with pleasure.
   >  
   >  Thank you,
   >  Laurent.
   >  



RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-26 Thread l...@laurent-hasson.com


   >  -Original Message-
   >  From: Andrew Dunstan 
   >  Sent: Thursday, August 26, 2021 12:39
   >  To: l...@laurent-hasson.com; Justin Pryzby ;
   >  Ranier Vilela 
   >  Cc: Tom Lane ; pgsql-performa...@postgresql.org
   >  Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4
   >  
   >  
   >  On 8/26/21 10:47 AM, l...@laurent-hasson.com wrote:
   >  > Hello all,
   >  >
   >  > Any update on this issue?
   >  
   >  
   >  
   >  Please don't top-post.
   >  
   >  
   >  We are working on the issue. Please be patient.
   >  
   >  
   >  cheers
   >  
   >  
   >  andrew
   >  
   >  --
   >  Andrew Dunstan
   >  EDB: https://www.enterprisedb.com


OK... Outlook automatically top posts and I forgot.

I am being patient.

Thanks,
Laurent.



RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-28 Thread l...@laurent-hasson.com



   >  -Original Message-
   >  From: Tom Lane 
   >  Sent: Friday, August 27, 2021 13:43
   >  To: l...@laurent-hasson.com
   >  Cc: Justin Pryzby ; Ranier Vilela
   >  ; pgsql-performa...@postgresql.org
   >  Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4
   >  
   >  "l...@laurent-hasson.com"  writes:
   >  > That being said, do you have any suggestion how I could circumvent
   >  the
   >  > issue altogether?
   >  
   >  Based on Andrew's report, it seems like you might be able to work
   >  around it for the time being by disabling message translations, i.e.
   >SET lc_messages = 'C';
   >  Even if that's not acceptable in your work environment, it would be
   >  useful to verify that you see an improvement from it.
   >  
   >regards, tom lane



SET lc_messages = 'C';
drop table sampletest;
create table sampletest (a varchar, b varchar);
insert into sampletest (a, b)
select substr(md5(random()::text), 0, 15), 
(1*random())::integer::varchar
  from generate_series(1,10);

CREATE OR REPLACE FUNCTION toFloat(str varchar, val real)
RETURNS real AS $$
BEGIN
  RETURN case when str is null then val else str::real end;
EXCEPTION WHEN OTHERS THEN
  RETURN val;
END;
$$ LANGUAGE plpgsql COST 1 IMMUTABLE;

show lc_messages; -- OK 'C'







RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-28 Thread l...@laurent-hasson.com



   >  -Original Message-
   >  From: Tom Lane 
   >  Sent: Friday, August 27, 2021 13:43
   >  To: l...@laurent-hasson.com
   >  Cc: Justin Pryzby ; Ranier Vilela
   >  ; pgsql-performa...@postgresql.org
   >  Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4
   >  
   >  "l...@laurent-hasson.com"  writes:
   >  > That being said, do you have any suggestion how I could circumvent
   >  the
   >  > issue altogether?
   >  
   >  Based on Andrew's report, it seems like you might be able to work
   >  around it for the time being by disabling message translations, i.e.
   >SET lc_messages = 'C';
   >  Even if that's not acceptable in your work environment, it would be
   >  useful to verify that you see an improvement from it.
   >  
   >regards, tom lane

Hello Tom hit the send button accidentally.


SET lc_messages = 'C';
drop table sampletest;
create table sampletest (a varchar, b varchar);
insert into sampletest (a, b)
select substr(md5(random()::text), 0, 15), 
(1*random())::integer::varchar
  from generate_series(1,10);

CREATE OR REPLACE FUNCTION toFloat(str varchar, val real)
RETURNS real AS $$
BEGIN
  RETURN case when str is null then val else str::real end;
EXCEPTION WHEN OTHERS THEN
  RETURN val;
END;
$$ LANGUAGE plpgsql COST 1 IMMUTABLE;

show lc_messages; --> OK 'C'

explain (analyze,buffers,COSTS,TIMING) 
select MAX(toFloat(b, null)) as "b" from sampletest

Aggregate  (cost=2137.00..2137.01 rows=1 width=4) (actual time=175.551..175.552 
rows=1 loops=1)
  Buffers: shared hit=637
  ->  Seq Scan on sampletest  (cost=0.00..1637.00 rows=10 width=8) (actual 
time=0.014..9.270 rows=10 loops=1)
Buffers: shared hit=637
Planning Time: 0.087 ms
Execution Time: 175.600 ms


explain (analyze,buffers,COSTS,TIMING) 
select MAX(toFloat(a, null)) as "a" from sampletest

Aggregate  (cost=2137.00..2137.01 rows=1 width=4) (actual 
time=88031.549..88031.551 rows=1 loops=1)
  Buffers: shared hit=637
  ->  Seq Scan on sampletest  (cost=0.00..1637.00 rows=10 width=15) (actual 
time=0.008..34.494 rows=10 loops=1)
Buffers: shared hit=637
Planning:
  Buffers: shared hit=4
Planning Time: 0.171 ms
Execution Time: 88031.585 ms

Doesn't seem to make a difference unless I misunderstood what you were asking 
for regarding the locale?

Thank you,
Laurent.





RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-28 Thread l...@laurent-hasson.com



   >  -Original Message-
   >  From: Tom Lane 
   >  Sent: Saturday, August 28, 2021 15:51
   >  To: l...@laurent-hasson.com
   >  Cc: Andrew Dunstan ; Justin Pryzby
   >  ; Ranier Vilela ; pgsql-
   >  performa...@postgresql.org
   >  Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4
   >  
   >  "l...@laurent-hasson.com"  writes:
   >  > SET lc_messages = 'C';
   >  > show lc_messages; --> OK 'C'
   >  
   >  > explain (analyze,buffers,COSTS,TIMING) select MAX(toFloat(b, null)) as
   >  > "b" from sampletest ...
   >  > Execution Time: 175.600 ms
   >  
   >  > explain (analyze,buffers,COSTS,TIMING) select MAX(toFloat(a, null)) as
   >  > "a" from sampletest ...
   >  > Execution Time: 88031.585 ms
   >  
   >  > Doesn't seem to make a difference unless I misunderstood what you
   >  were asking for regarding the locale?
   >  
   >  Hmm.  This suggests that whatever effect Andrew found with NLS is
   >  actually not the explanation for your problem.  So I'm even more
   >  confused than before.
   >  
   >regards, tom lane

I am so sorry to hear... So, curious on my end: is this something that you are 
not able to reproduce on your environments? On my end, I did reproduce it on 
different VMs and my local laptop, across windows Server 2012 and Windows 10, 
so I'd figure it would be pretty easy to reproduce?

Thank you!
Laurent.




RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-29 Thread l...@laurent-hasson.com
>Sure, there's no question that message translation will have *some* cost.
>But on my machine it is an incremental tens-of-percent kind of cost,
>and that is the result you're getting as well.  So it's not very clear
>where these factor-of-several-hundred differences are coming from.
>A hypothesis that has not yet come up, may be some defect in the code 
>generation, 
>by the previous msvc compiler used, because in all my tests I always use the 
>latest version, 
>which has several corrections in the code generation part.



Hello all,

I don't think this reproduces the issue I experience. I saw a difference of 
around 500x! What you see is 5x, which according to Tom would be expected for 
an execution path involving exceptions. And NLS should have an impact as well 
since more work happens. From the numbers you published, I see 10-15% change 
which again would be expected?

I cannot think of anything that would be specific to me with regards to this 
scenario given that I have tried it in quite a few environments from plain 
stock installs. Until one of you is able to reproduce this, you may be chasing 
other issues. 

Is it possible that the client I am using or the way I am creating the test 
database might affect this scenario? I use DBeaver and use the default settings 
to create the database:
- default encoding: UTF8
- collate: English_United States.1252
- ctype: English_United States.1252
- default tablespace: pg_default

Settings:
NameValue   Unit
allow_system_table_mods off [NULL]
application_nameDBeaver 21.1.3 - Main [NULL]
archive_cleanup_command [NULL]
archive_command (disabled)  [NULL]
archive_modeoff [NULL]
archive_timeout 0   s
array_nulls on  [NULL]
authentication_timeout  60  s
autovacuum  on  [NULL]
autovacuum_analyze_scale_factor 0.1 [NULL]
autovacuum_analyze_threshold50  [NULL]
autovacuum_freeze_max_age   2   [NULL]
autovacuum_max_workers  3   [NULL]
autovacuum_multixact_freeze_max_age 4   [NULL]
autovacuum_naptime  60  s
autovacuum_vacuum_cost_delay2   ms
autovacuum_vacuum_cost_limit-1  [NULL]
autovacuum_vacuum_insert_scale_factor   0.2 [NULL]
autovacuum_vacuum_insert_threshold  1000[NULL]
autovacuum_vacuum_scale_factor  0.2 [NULL]
autovacuum_vacuum_threshold 50  [NULL]
autovacuum_work_mem -1  kB
backend_flush_after 0   8kB
backslash_quote safe_encoding   [NULL]
backtrace_functions [NULL]
bgwriter_delay  200 ms
bgwriter_flush_after0   8kB
bgwriter_lru_maxpages   100 [NULL]
bgwriter_lru_multiplier 2   [NULL]
block_size  8192[NULL]
bonjour off [NULL]
bonjour_name[NULL]
bytea_outputhex [NULL]
check_function_bodies   on  [NULL]
checkpoint_completion_target0.5 [NULL]
checkpoint_flush_after  0   8kB
checkpoint_timeout  300 s
checkpoint_warning  30  s
client_encoding UTF8[NULL]
client_min_messages notice  [NULL]
cluster_name[NULL]
commit_delay0   [NULL]
commit_siblings 5   [NULL]
config_file C:/Program Files/PostgreSQL/13/data/postgresql.conf [NULL]
constraint_exclusionpartition   [NULL]
cpu_index_tuple_cost0.005   [NULL]
cpu_operator_cost   0.0025  [NULL]
cpu_tuple_cost  0.01[NULL]
cursor_tuple_fraction   0.1 [NULL]
data_checksums  off [NULL]
data_directory  C:/Program Files/PostgreSQL/13/data [NULL]
data_directory_mode 700 [NULL]
data_sync_retry off [NULL]
DateStyle   ISO, YMD[NULL]
db_user_namespace   off [NULL]
deadlock_timeout1000ms
debug_assertionsoff [NULL]
debug_pretty_print  on  [NULL]
debug_print_parse   off [NULL]
debug_print_planoff [NULL]
debug_print_rewritten   off [NULL]
default_statistics_target   100 [NULL]
default_table_access_method heap[NULL]
default_tablespace  [NULL]
default_text_search_config  pg_catalog.english  [NULL]
default_transaction_deferrable  off [NULL]
default_transaction_isolation   read committed  [NULL]
default_transaction_read_only   off [NULL]
dynamic_library_path$libdir [NULL]
dynamic_shared_memory_type  windows [NULL]
effective_cache_size524288  8kB
effective_io_concurrency0   [NULL]
enable_bitmapscan   on  [NULL]
enable_gathermerge  on  [NULL]
enable_hashagg  on  [NULL]
enable_hashjoin on  [NULL]
enable_incremental_sort on  [NULL]
enable_indexonlyscanon  [NULL]
enable_indexscanon  [NULL]
enable_material on  [NULL]
enable_mergejoinon  [NULL]
enable_nestloop on  [NULL]
enable_parallel_append  on  [NULL]
enable_parallel_hashon  [NULL]
enable_partition_pruning  

RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-29 Thread l...@laurent-hasson.com


From: Ranier Vilela  
Sent: Sunday, August 29, 2021 14:20
To: l...@laurent-hasson.com
Cc: Tom Lane ; Andrew Dunstan ; Justin 
Pryzby ; pgsql-performa...@postgresql.org
Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

Em dom., 29 de ago. de 2021 às 13:03, mailto:l...@laurent-hasson.com 
<mailto:l...@laurent-hasson.com> escreveu:
>Sure, there's no question that message translation will have *some* cost.
>But on my machine it is an incremental tens-of-percent kind of cost,
>and that is the result you're getting as well.  So it's not very clear
>where these factor-of-several-hundred differences are coming from.
>A hypothesis that has not yet come up, may be some defect in the code 
>generation, 
>by the previous msvc compiler used, because in all my tests I always use the 
>latest version, 
>which has several corrections in the code generation part.



Hello all,

I don't think this reproduces the issue I experience. I saw a difference of 
around 500x! What you see is 5x, which according to Tom would be expected for 
an execution path involving exceptions. And NLS should have an impact as well 
since more work happens. From the numbers you published, I see 10-15% change 
which again would be expected?
Yes, It seems to me that is expected for NLS usage.
 

I cannot think of anything that would be specific to me with regards to this 
scenario given that I have tried it in quite a few environments from plain 
stock installs. Until one of you is able to reproduce this, you may be chasing 
other issues. 
I think I'm unable to reproduce the issue, because I didn't use any plain stock 
installs.
Postgres env tests here, is a fresh build with the latest msvc.
I have no intention of repeating the issue, with something exactly the same as 
your environment, 
but with a very different environment.

Can you show the version of Postgres, at your Windows 10 env, who got this 
result?
Planning Time: 0.171 ms
Execution Time: 88031.585 ms

regards,
Ranier Vilela



---
Hello Ranier,

All my tests were on latest 13.4 install I downloaded from the main site.

SELECT version();
PostgreSQL 13.4, compiled by Visual C++ build 1914, 64-bit


As per the following:

> I think I'm unable to reproduce the issue, because I didn't use any plain 
> stock installs.
> Postgres env tests here, is a fresh build with the latest msvc.
> I have no intention of repeating the issue, with something exactly the same 
> as your environment, 
> but with a very different environment.

I am not sure I understand. Are you saying the standard installs may be faulty? 
A stock install from the stock installer on a windows machine should take 10mn 
top. If it doesn't reproduce the issue out of the box, then at least I have a 
confirmation that there may be something weird that I am somehow repeating 
across all the installs I have performed???

Thank you,
Laurent.




RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-29 Thread l...@laurent-hasson.com


   >  -Original Message-
   >  From: Tom Lane 
   >  Sent: Sunday, August 29, 2021 12:19
   >  To: l...@laurent-hasson.com
   >  Cc: Ranier Vilela ; Andrew Dunstan
   >  ; Justin Pryzby ; pgsql-
   >  performa...@postgresql.org
   >  Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4
   >  
   >  "l...@laurent-hasson.com"  writes:
   >  > Is it possible that the client I am using or the way I am creating the 
test
   >  database might affect this scenario? I use DBeaver and use the default
   >  settings to create the database:
   >  > - default encoding: UTF8
   >  > - collate: English_United States.1252
   >  > - ctype: English_United States.1252
   >  
   >  Yeah, I was thinking of quizzing you about that.  I wonder whether
   >  something is thinking it needs to transcode to WIN1252 encoding and
   >  then back to UTF8, based on the .1252 property of the LC_XXX settings.
   >  That shouldn't account for any 500X factor either, but we're kind of
   >  grasping at straws here.
   >  
   >  Does Windows have any locale choices that imply UTF8 encoding
   >  exactly, and if so, do your results change when using that?  
Alternatively,
   >  try creating a database with WIN1252 encoding and those locale
   >  settings.
   >  
   >regards, tom lane

Yeah, grasping at straws... and no material changes 😊 This is mystifying.

show lc_messages;
-- English_United States.1252

create table sampletest (a varchar, b varchar);
insert into sampletest (a, b)
select substr(md5(random()::text), 0, 15), 
(1*random())::integer::varchar
  from generate_series(1,10);

CREATE OR REPLACE FUNCTION toFloat(str varchar, val real)
RETURNS real AS $$
BEGIN
  RETURN case when str is null then val else str::real end;
EXCEPTION WHEN OTHERS THEN
  RETURN val;
END;
$$ LANGUAGE plpgsql COST 1 IMMUTABLE;

explain (analyze,buffers,COSTS,TIMING) 
select MAX(toFloat(a, null)) as "a" from sampletest
--Aggregate  (cost=1477.84..1477.85 rows=1 width=4) (actual 
time=89527.032..89527.033 rows=1 loops=1)
--  Buffers: shared hit=647
--  ->  Seq Scan on sampletest  (cost=0.00..1197.56 rows=56056 width=32) 
(actual time=0.024..37.811 rows=10 loops=1)
--Buffers: shared hit=637
--Planning:
--  Buffers: shared hit=24
--Planning Time: 0.347 ms
--Execution Time: 89527.501 ms

explain (analyze,buffers,COSTS,TIMING) 
select MAX(toFloat(b, null)) as "b" from sampletest
--Aggregate  (cost=2137.00..2137.01 rows=1 width=4) (actual 
time=186.605..186.606 rows=1 loops=1)
--  Buffers: shared hit=637
--  ->  Seq Scan on sampletest  (cost=0.00..1637.00 rows=10 width=8) 
(actual time=0.008..9.679 rows=10 loops=1)
--Buffers: shared hit=637
--Planning:
--  Buffers: shared hit=4
--Planning Time: 0.339 ms
--Execution Time: 186.641 ms


At this point, I am not sure how to proceed except to rethink that toFloat() 
function and many other places where we use exceptions. We get such dirty data 
that I need a "safe" way to convert a string to float without throwing an 
exception. BTW, I tried other combinations in case there may have been some 
weird interactions with the ::REAL conversion operator, but nothing made any 
change. Could you recommend another approach off the top of your head? I could 
use regexes for testing etc... Or maybe there is another option like a no-throw 
conversion that's built in or in some extension that you may know of? Like the 
"SAFE." Prefix in BigQuery.

Thank you,
Laurent.





RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4 (workarounds)

2021-08-29 Thread l...@laurent-hasson.com


   >  -Original Message-
   >  From: Justin Pryzby 
   >  Sent: Sunday, August 29, 2021 23:17
   >  To: Pavel Stehule 
   >  Cc: l...@laurent-hasson.com; Tom Lane ; Ranier
   >  Vilela ; Andrew Dunstan
   >  ; pgsql-performa...@postgresql.org
   >  Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4 (workarounds)
   >  
   >  On Mon, Aug 30, 2021 at 04:43:23AM +0200, Pavel Stehule wrote:
   >  > po 30. 8. 2021 v 2:44 odesílatel l...@laurent-hasson.com napsal:
   >  > > At this point, I am not sure how to proceed except to rethink that
   >  > > toFloat() function and many other places where we use exceptions.
   >  We
   >  > > get such dirty data that I need a "safe" way to convert a string to
   >  > > float without throwing an exception. BTW, I tried other
   >  combinations
   >  > > in case there may have been some weird interactions with the ::REAL
   >  > > conversion operator, but nothing made any change. Could you
   >  > > recommend another approach off the top of your head? I could use
   >  > > regexes for testing etc... Or maybe there is another option like a
   >  > > no-throw conversion that's built in or in some extension that you
   >  may know of? Like the "SAFE." Prefix in BigQuery.
   >  >
   >  > CREATE OR REPLACE FUNCTION safe_to_double_precision(t text)
   >  RETURNS
   >  > double precision AS $$ BEGIN
   >  >   IF $1 SIMILAR TO '[+-]?([0-9]*[.])?[0-9]+' THEN
   >  > RETURN $1::double precision;
   >  >   ELSE
   >  > RETURN NULL;
   >  >   END IF;
   >  > END;
   >  > $$ LANGUAGE plpgsql IMMUTABLE STRICT;
   >  
   >  This tries to use a regex to determine if something is a "Number" or not.
   >  Which has all the issues enumerated in painful detail by long answers on
   >  stack overflow, and other wiki/blog/forums.
   >  
   >  Rather than trying to define Numbers using regex, I'd try to avoid only
   >  the most frequent exceptions and get 90% of the performance back.  I
   >  don't know what your data looks like, but you might try things like this:
   >  
   >  IF $1 IS NULL THEN RETURN $2
   >  ELSE IF $1 ~ '^$' THEN RETURN $2
   >  ELSE IF $1 ~ '[[:alpha:]]{2}' THEN RETURN $2 ELSE IF $1 !~ '[[:digit:]]' 
THEN
   >  RETURN $2
   >  BEGIN
   > RETURN $1::float;
   >  EXCEPTION WHEN OTHERS THEN
   > RETURN $2;
   >  END;
   >  
   >  You can check the stackoverflow page for ideas as to what kind of thing
   >  to reject, but it may depend mostly on your data (what is the most
   >  common string?
   >  The most common exceptional string?).
   >  
   >  I think it's possible that could even be *faster* than the original, 
since it
   >  avoids the exception block for values which are for sure going to cause
   >  an exception anyway.  It might be that using alternation (|) is faster (if
   >  less
   >  readable) than using a handful of IF branches.
   >  
   >  --
   >  Justin

That's exactly where my head was at. I have looked different way to test for a 
floating point number and recognize the challenge 😊

The data is very messy with people entering data by hand. We have seen alpha 
and punctuation, people copy/pasting from excel so large numbers get the "e" 
notation. It's a total mess. The application that authors that data is a piece 
of crap and we have no chance to change it unfortunately. Short of rolling out 
an ETL process, which is painful for the way our data comes in, I need an in-db 
solution.

Thank you!
Laurent.


RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-30 Thread l...@laurent-hasson.com

   >  I tried this scenario using edb's 13.3 x64 install:
   >  
   >  postgres=# select version();
   >version
   >  
   >   PostgreSQL 13.3, compiled by Visual C++ build 1914, 64-bit
   >  (1 row)
   >  
   >  
   >  postgres=# \l postgres
   >List of databases
   > Name   |  Owner   | Encoding | Collate | Ctype | Access privileges
   >  --+--+--+-+---+---
   >   postgres | postgres | UTF8 | C   | C |
   >  (1 row)
   >  
   >  postgres=# explain (analyze,buffers,COSTS,TIMING) postgres-# select
   >  MAX(toFloat(a, null)) as "a" from sampletest;
   > QUERY PLAN
   >  
-
   >  ---
   >  Aggregate  (cost=2137.00..2137.01 rows=1 width=4) (actual
   >  time=44962.279..44962.280 rows=1 loops=1)
   > Buffers: shared hit=657
   > ->  Seq Scan on sampletest  (cost=0.00..1637.00 rows=10
   >  width=15) (actual time=0.009..8.900 rows=10 loops=1)
   >   Buffers: shared hit=637
   >   Planning:
   > Buffers: shared hit=78
   >   Planning Time: 0.531 ms
   >   Execution Time: 44963.747 ms
   >  (8 rows)
   >  
   >  and with locally compiled REL_13_STABLE's head on the same machine:
   >  
   >  rjuju=# select version();
   >version
   >  
   >   PostgreSQL 13.4, compiled by Visual C++ build 1929, 64-bit
   >  (1 row)
   >  
   >  rjuju=# \l rjuju
   > List of databases  Name  | Owner | Encoding | 
Collate |
   >  Ctype | Access privileges
   >  ---+---+--+-+---+---
   >   rjuju | rjuju | UTF8 | C   | C |
   >  (1 row)
   >  
   >  rjuju-# select MAX(toFloat(a, null)) as "a" from sampletest;
   >QUERY PLAN
   >  
-
   >  --
   >   Aggregate  (cost=1477.84..1477.85 rows=1 width=4) (actual
   >  time=460.334..460.334 rows=1 loops=1)
   > Buffers: shared hit=646 read=1
   > ->  Seq Scan on sampletest  (cost=0.00..1197.56 rows=56056
   >  width=32) (actual time=0.010..7.612 rows=10 loops=1)
   >   Buffers: shared hit=637
   >   Planning:
   > Buffers: shared hit=20 read=1
   >   Planning Time: 0.125 ms
   >   Execution Time: 460.527 ms
   >  (8 rows)
   >  
   >  Note that I followed [1], so I simply used "build" and "install".  I have 
no
   >  idea what is done by default and if NLS is included or not.
   >  
   >  So if default build on windows has NLS included, it probably means that
   >  either there's something specific on edb's build (I have no idea how their
   >  build is produced) or their version of msvc is responsible for that.
   >  
   >  [1]: https://www.postgresql.org/docs/current/install-windows-
   >  full.html#id-1.6.4.8.10



---

Hello,

So you are seeing a 100x difference.

   >   Execution Time: 44963.747 ms
   >   Execution Time: 460.527 ms

I see on https://www.postgresql.org/download/ that there is a different 
installer from 2ndQuadrant. I am going to try that one and see what I come up 
with. Are there any other "standard" distros of Postgres that I could try out?

Additionally, is there a DLL or EXE file that you could make available to me 
that I could simply patch on my current install and see if it makes any 
difference? Or a zip of the lib/bin folders? I found out I could download 
Visual Studio community edition so I am trying this, but may not have the time 
to get through a build any time soon as per my unfamiliarity with the process. 
I'll follow Ranier's steps and see if that gets me somewhere.

Thank you,
Laurent.



RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-30 Thread l...@laurent-hasson.com

   >  I see on https://www.postgresql.org/download/ that there is a different
   >  installer from 2ndQuadrant. I am going to try that one and see what I
   >  come up with. Are there any other "standard" distros of Postgres that I
   >  could try out?
   >  
   >  I found out I could download Visual Studio community edition so I am
   >  trying this, but may not have the time to get through a build any time
   >  soon as per my unfamiliarity with the process. I'll follow Ranier's steps
   >  and see if that gets me somewhere.
   >  
   >  Thank you,
   >  Laurent.


Hello all,

I think I had a breakthrough. I tried to create a local build and wasn't able 
to. But I downloaded the 2nd Quadrant installer and the issue disappeared!!! I 
think this is proof that it's not my personal environment, nor something 
intrinsic in the codebase, but definitely something in the standard EDB 
installer.


create table sampletest (a varchar, b varchar);
insert into sampletest (a, b)
select substr(md5(random()::text), 0, 15), 
(1*random())::integer::varchar
  from generate_series(1,10);

CREATE OR REPLACE FUNCTION toFloat(str varchar, val real)
RETURNS real AS $$
BEGIN
  RETURN case when str is null then val else str::real end;
EXCEPTION WHEN OTHERS THEN
  RETURN val;
END;
$$ LANGUAGE plpgsql COST 1 IMMUTABLE;

explain (analyze,buffers,COSTS,TIMING) select MAX(toFloat(a, null)) as "a" from 
sampletest;
--Aggregate  (cost=2137.00..2137.01 rows=1 width=4) (actual 
time=2092.922..2092.923 rows=1 loops=1)
--  Buffers: shared hit=637
--  ->  Seq Scan on sampletest  (cost=0.00..1637.00 rows=10 width=15) 
(actual time=0.028..23.925 rows=10 loops=1)
--Buffers: shared hit=637
--Planning Time: 0.168 ms
--Execution Time: 2092.957 ms

explain (analyze,buffers,COSTS,TIMING) select MAX(toFloat(b, null)) as "b" from 
sampletest;
--Aggregate  (cost=2137.00..2137.01 rows=1 width=4) (actual 
time=369.475..369.476 rows=1 loops=1)
--  Buffers: shared hit=637
--  ->  Seq Scan on sampletest  (cost=0.00..1637.00 rows=10 width=8) 
(actual time=0.020..18.746 rows=10 loops=1)
--Buffers: shared hit=637
--Planning Time: 0.129 ms
--Execution Time: 369.507 ms


Thank you,
Laurent!




RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-08-31 Thread l...@laurent-hasson.com


   >  -Original Message-
   >  From: Andrew Dunstan 
   >  Sent: Tuesday, August 31, 2021 09:40
   >  To: l...@laurent-hasson.com; Julien Rouhaud 
   >  Cc: Tom Lane ; Ranier Vilela ;
   >  Justin Pryzby ; pgsql-
   >  performa...@postgresql.org
   >  Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4
   >  
   >  
   >  On 8/30/21 10:18 PM, l...@laurent-hasson.com wrote:
   >  >>  I see on https://www.postgresql.org/download/ that there is a
   >  different
   >  >>  installer from 2ndQuadrant. I am going to try that one and see
   >  what I
   >  >>  come up with. Are there any other "standard" distros of Postgres
   >  that I
   >  >>  could try out?
   >  >>
   >  >>  I found out I could download Visual Studio community edition so I
   >  am
   >  >>  trying this, but may not have the time to get through a build any
   >  time
   >  >>  soon as per my unfamiliarity with the process. I'll follow 
Ranier's
   >  steps
   >  >>  and see if that gets me somewhere.
   >  >>
   >  >>  Thank you,
   >  >>  Laurent.
   >  >
   >  >
   >  > Hello all,
   >  >
   >  > I think I had a breakthrough. I tried to create a local build and wasn't
   >  able to. But I downloaded the 2nd Quadrant installer and the issue
   >  disappeared!!! I think this is proof that it's not my personal
   >  environment, nor something intrinsic in the codebase, but definitely
   >  something in the standard EDB installer.
   >  >
   >  >
   >  
   >  No, you're on the wrong track. As I reported earlier, I have reproduced
   >  this issue with a vanilla build which has no installer involvement
   >  whatsoever.
   >  
   >  I'm pretty sure the reason you are not seeing this with the 2ndQuadrant
   >  installer is quite simple: it wasn't build with NLS support.
   >  
   >  Let me repeat what I said earlier. I will get to the bottom of this.
   >  Please be patient and stop running after red herrings.
   >  
   >  
   >  cheers
   >  
   >  
   >  andrew
   >  
   >  
   >  --
   >  Andrew Dunstan
   >  EDB: https://www.enterprisedb.com

OK... I thought that track had been abandoned as per Julien's last message. 
Anyways, I'll be patient!

Thank you for all the work.
Laurent.




Better performance no-throw conversion?

2021-09-08 Thread l...@laurent-hasson.com
Hello,

Some databases such as SQLServer (try_cast) or BigQuery (safe.cast) offer 
not-throw conversion. In general, these tend to perform better than custom UDFs 
that catch exceptions and are also simpler to use. For example, in Postgres, I 
have a function that does the following:

CREATE OR REPLACE FUNCTION toFloat(str varchar, val real)
RETURNS real AS $$
BEGIN
  RETURN case when str is null then val else str::real end;
EXCEPTION WHEN OTHERS THEN
  RETURN val;
END;
$$ LANGUAGE plpgsql COST 1 IMMUTABLE;

I couldn't find a reference to such capabilities in Postgres and wondered if I 
missed it, and if not, is there any plan to add such a feature?

Thank you!
Laurent Hasson.


RE: Better performance no-throw conversion?

2021-09-08 Thread l...@laurent-hasson.com

> From: Michael Lewis  
> Sent: Wednesday, September 8, 2021 13:40
> To: Tom Lane 
> Cc: l...@laurent-hasson.com; pgsql-performa...@postgresql.org
> Subject: Re: Better performance no-throw conversion?
>
> On Wed, Sep 8, 2021 at 11:33 AM Tom Lane <mailto:t...@sss.pgh.pa.us> wrote:
> "mailto:l...@laurent-hasson.com"; <mailto:l...@laurent-hasson.com> writes:
> > Some databases such as SQLServer (try_cast) or BigQuery (safe.cast) offer 
> > not-throw conversion.
> > ...
> > I couldn't find a reference to such capabilities in Postgres and wondered 
> > if I missed it, and if not, is there any plan to add such a feature?
>
> There is not anybody working on that AFAIK.  It seems like it'd have
> to be done on a case-by-case basis, which makes it awfully tedious.
>
> Do you just mean a separate function for each data type? I use similar 
> functions (without a default value though) to ensure that values extracted 
> from jsonb keys can be used as needed. Sanitizing the data on input is a long 
> term goal, but not possible immediately.
>
> Is there any documentation on the impact of many many exception blocks? That 
> is, if such a cast function is used on a dataset of 1 million rows, what 
> overhead does that exception incur? Is it only when there is an exception or 
> is it on every row?
>
>

Hello Michael,

There was a recent thread (Big Performance drop of Exceptions in UDFs between 
V11.2 and 13.4) that I started a few weeks back where it was identified that 
the exception block in the function I posted would cause a rough 3x-5x 
performance overhead for exception handling and was as expected. I identified a 
separate issue with the performance plummeting 100x on certain Windows builds, 
but that's a separate issue.

Thank you,
Laurent.









RE: Better performance no-throw conversion?

2021-09-08 Thread l...@laurent-hasson.com



   >  -Original Message-
   >  From: Andrew Dunstan 
   >  Sent: Wednesday, September 8, 2021 13:31
   >  To: l...@laurent-hasson.com; pgsql-performa...@postgresql.org
   >  Subject: Re: Better performance no-throw conversion?
   >  
   >  
   >  On 9/8/21 1:17 PM, l...@laurent-hasson.com wrote:
   >  >
   >  > Hello,
   >  >
   >  >
   >  >
   >  > Some databases such as SQLServer (try_cast) or BigQuery (safe.cast)
   >  > offer not-throw conversion. In general, these tend to perform better
   >  > than custom UDFs that catch exceptions and are also simpler to use.
   >  > For example, in Postgres, I have a function that does the following:
   >  >
   >  >
   >  >
   >  > CREATE OR REPLACE FUNCTION toFloat(str varchar, val real)
   >  >
   >  > RETURNS real AS $$
   >  >
   >  > BEGIN
   >  >
   >  >   RETURN case when str is null then val else str::real end;
   >  >
   >  > EXCEPTION WHEN OTHERS THEN
   >  >
   >  >   RETURN val;
   >  >
   >  > END;
   >  >
   >  > $$ LANGUAGE plpgsql COST 1 IMMUTABLE;
   >  >
   >  >
   >  >
   >  > I couldn't find a reference to such capabilities in Postgres and
   >  > wondered if I missed it, and if not, is there any plan to add such a
   >  > feature?
   >  >
   >  >
   >  >
   >  
   >  
   >  Not that I know of, but you could probably do this fairly simply in C.
   >  
   >  
   >  cheers
   >  
   >  
   >  andrew
   >  
   >  --
   >  Andrew Dunstan
   >  EDB: https://www.enterprisedb.com


Hello Andrew,

I work across multiple platforms (windows, linux, multiple managed cloud 
versions...) and a C-based solution would be problematic for us.

Thank you,
Laurent.







RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-09-13 Thread l...@laurent-hasson.com


   >  -Original Message-
   >  From: Andrew Dunstan 
   >  Sent: Thursday, September 2, 2021 13:00
   >  To: Julien Rouhaud 
   >  Cc: l...@laurent-hasson.com; Tom Lane ; Ranier
   >  Vilela ; Justin Pryzby ;
   >  pgsql-performa...@postgresql.org
   >  Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4
   >  
   >  
   >  On 9/2/21 11:34 AM, Julien Rouhaud wrote:
   >  > On Thu, Sep 2, 2021 at 11:22 PM Andrew Dunstan
   >   wrote:
   >  >> Here are a couple of pictures of profiles made with a tool called
   >  >> sleepy. The bad profile is from release 13.4 built with the latest
   >  >> gettext, built with vcpkg. The good profile is the same build but
   >  >> using the intl-8.dll copied from the release 11.13 installer. The
   >  >> good run takes about a minute. The bad run takes about 30 minutes.
   >  >>
   >  >>
   >  >> I'm not exactly sure what the profiles tell us.
   >  > Isn't GetLocaleInfoA suspicious?  Especially since the doc [1] says
   >  > that it shouldn't be called anymore unless you want to have
   >  > compatibility with OS from more than a decade ago?
   >  >
   >  > [1]
   >  > https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-
   >  winnls-ge
   >  > tlocaleinfoa
   >  
   >  Possibly, but the profile doesn't show it as having a great impact.
   >  
   >  Maybe surrounding code is affected.
   >  
   >  cheers
   >  
   >  andrew
   >  
   >  
   >  --
   >  Andrew Dunstan
   >  EDB: https://www.enterprisedb.com


Hello all,

Any further update or guidance on this issue at this time?

Thank you,
Laurent.


RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-09-13 Thread l...@laurent-hasson.com


   >  -Original Message-
   >  From: Andrew Dunstan 
   >  Sent: Monday, September 13, 2021 11:36
   >  To: l...@laurent-hasson.com; Julien Rouhaud 
   >  Cc: Tom Lane ; Ranier Vilela ;
   >  Justin Pryzby ; pgsql-
   >  performa...@postgresql.org
   >  Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4
   >  
   >  
   >  On 9/13/21 10:32 AM, l...@laurent-hasson.com wrote:
   >  >
   >  > Hello all,
   >  >
   >  > Any further update or guidance on this issue at this time?
   >  >
   >  
   >  Wait for a new installer. Our team is working on it. As I have previously
   >  advised you, please be patient.
   >  
   >  
   >  cheers
   >  
   >  
   >  andrew
   >  
   >  --
   >  Andrew Dunstan
   >  EDB: https://www.enterprisedb.com


Hello Andrew,

I'll be as patient as is needed and appreciate absolutely all the work you are 
all doing. I also know V14 is just around the corner too so the team is super 
busy 😊

Just looking for some super-rough ETA for some rough planning on our end. Is 
this something potentially for 13.5 later this year? Or something that may 
happen before the end of Sept? Or still unknown? And I understand all is always 
tentative.

Thank you!
Laurent.




RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-09-24 Thread l...@laurent-hasson.com


> -Original Message-
> From: Andrew Dunstan 
> Sent: Friday, September 24, 2021 16:57
> To: l...@laurent-hasson.com; Julien Rouhaud 
> Cc: Tom Lane ; Ranier Vilela ;
> Justin Pryzby ; pgsql-performa...@postgresql.org
> Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2 and
> 13.4
> 
> 
> On 9/13/21 4:36 PM, Andrew Dunstan wrote:
> > On 9/13/21 11:53 AM, l...@laurent-hasson.com wrote:
> >>>  -Original Message-
> >>>  From: Andrew Dunstan 
> >>    >  Sent: Monday, September 13, 2021 11:36
> >>>  To: l...@laurent-hasson.com; Julien Rouhaud 
> >>>  Cc: Tom Lane ; Ranier Vilela
> ;
> >>>  Justin Pryzby ; pgsql-
> >>>  performa...@postgresql.org
> >>>  Subject: Re: Big Performance drop of Exceptions in UDFs between
> V11.2
> >>>  and 13.4
> >>>
> >>>
> >>>  On 9/13/21 10:32 AM, l...@laurent-hasson.com wrote:
> >>>  >
> >>>  > Hello all,
> >>>  >
> >>>  > Any further update or guidance on this issue at this time?
> >>>  >
> >>>
> >>>  Wait for a new installer. Our team is working on it. As I have 
> >> previously
> >>>  advised you, please be patient.
> >>>
> >>>
> >>>  cheers
> >>>
> >>>
> >>>  andrew
> >>>
> >>>  --
> >>>  Andrew Dunstan
> >>>  EDB: https://www.enterprisedb.com
> >>
> >>
> >> Hello Andrew,
> >>
> >> I'll be as patient as is needed and appreciate absolutely all the
> >> work you are all doing. I also know V14 is just around the corner too
> >> so the team is super busy 😊
> >>
> >> Just looking for some super-rough ETA for some rough planning on our
> end. Is this something potentially for 13.5 later this year? Or something that
> may happen before the end of Sept? Or still unknown? And I understand all
> is always tentative.
> >>
> > This is not governed at all by the Postgres release cycle. The issue
> > is not with Postgres but with the version of libintl used in the
> > build. I can't speak for the team, they will publish an updated
> > installer when they get it done. But rest assured it's being worked
> > on. I got email about it just this morning.
> >
> >
> 
> EDB has now published new installers for versions later than release 11,
> containing Postgres built with an earlier version of gettext that does not
> exhibit the problem. Please verify that these fix the issue. If you already
> have Postgres installed from our installer you should be able to upgrade
> using Stackbuilder. Otherwise, you can download from our usual download
> sites.
> 
> 
> cheers
> 
> 
> andrew
> 
> 
> --
> Andrew Dunstan
> EDB: https://www.enterprisedb.com

[Laurent Hasson] 

Thank you Andrew!!! I may be able to check this over the weekend.

Thank you,
Laurent.


RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-09-25 Thread l...@laurent-hasson.com

   >  > EDB has now published new installers for versions later than release
   >  > 11, containing Postgres built with an earlier version of gettext that
   >  > does not exhibit the problem. Please verify that these fix the issue.
   >  > If you already have Postgres installed from our installer you should
   >  > be able to upgrade using Stackbuilder. Otherwise, you can download
   >  > from our usual download sites.
   >  >
   >  > cheers
   >  >
   >  > andrew
   >  >
   >  > --
   >  > Andrew Dunstan
   >  > EDB: https://www.enterprisedb.com
 

Hello Andrew,

I just download the 13.4 Windows x86-64 installer from 
https://www.enterprisedb.com/downloads/postgres-postgresql-downloads but it's 
the exact same file bit for bit from the previous version I had. Am I looking 
at the wrong place?

Thank you
Laurent.



RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-09-27 Thread l...@laurent-hasson.com

From: tushar  
Sent: Monday, September 27, 2021 11:50
To: Andrew Dunstan ; l...@laurent-hasson.com; Julien 
Rouhaud 
Cc: Tom Lane ; Ranier Vilela ; Justin 
Pryzby ; pgsql-performa...@postgresql.org
Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

On 9/27/21 6:55 PM, Andrew Dunstan wrote:
Hello Andrew,

I just download the 13.4 Windows x86-64 installer from 
https://www.enterprisedb.com/downloads/postgres-postgresql-downloads but it's 
the exact same file bit for bit from the previous version I had. Am I looking 
at the wrong place?

Thanks. We're dealing with that. However, you can update that version
via stackbuilder. It will show you that 13.4.2 is available. This has
the correct libintl DLL. I just did this to verify it.
Thanks, look like the issue is fixed now, you can try to download the 
'postgresql-13.4-2-windows-x64.exe' installer from the above mentioned link.
-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company


Fantastic, I may be able to try again tonight and will report back. The 
environment I work in is isolated from the internet, so I can't use 
StackBuilder.

Thank you,
Laurent.



RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

2021-09-27 Thread l...@laurent-hasson.com

From: tushar  
Sent: Monday, September 27, 2021 11:50
To: Andrew Dunstan ; l...@laurent-hasson.com; Julien 
Rouhaud 
Cc: Tom Lane ; Ranier Vilela ; Justin 
Pryzby ; pgsql-performa...@postgresql.org
Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

On 9/27/21 6:55 PM, Andrew Dunstan wrote:
Hello Andrew,

I just download the 13.4 Windows x86-64 installer from 
https://www.enterprisedb.com/downloads/postgres-postgresql-downloads but it's 
the exact same file bit for bit from the previous version I had. Am I looking 
at the wrong place?

Thanks. We're dealing with that. However, you can update that version
via stackbuilder. It will show you that 13.4.2 is available. This has
the correct libintl DLL. I just did this to verify it.

Thanks, look like the issue is fixed now, you can try to download the 
'postgresql-13.4-2-windows-x64.exe' installer from the above mentioned link.
-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company


---

Hello all!

WOW Time for a cigar as there is double good news 😊
- The scenario no longer exacerbates the system and performance went from 
around 90s to around 2.7 seconds! That's in line with older 11.2 builds I was 
measuring against.
- The simpler scenario (no throw) looks like it improved by roughly 20%, from 
186ms to 146ms

I had run the scenarios multiple times before and the times were on the 
average, so I think those gains are real. Thank you for all your efforts. The 
Postgres community is amazing!


Here is the scenario again:

drop table sampletest;
create table sampletest (a varchar, b varchar);
insert into sampletest (a, b)
select substr(md5(random()::text), 0, 15), 
(1*random())::integer::varchar
  from generate_series(1,10);
CREATE OR REPLACE FUNCTION toFloat(str varchar, val real)
RETURNS real AS $$
BEGIN
  RETURN case when str is null then val else str::real end;
EXCEPTION WHEN OTHERS THEN
  RETURN val;
END;
$$ LANGUAGE plpgsql COST 1 IMMUTABLE;

This is what I had on the original 13.4 Windows x64 eDB build:

explain (analyze,buffers,COSTS,TIMING) 
select MAX(toFloat(a, null)) as "a" from sampletest
--Aggregate  (cost=1477.84..1477.85 rows=1 width=4) (actual 
time=89527.032..89527.033 rows=1 loops=1)
--  Buffers: shared hit=647
--  ->  Seq Scan on sampletest  (cost=0.00..1197.56 rows=56056 width=32) 
(actual time=0.024..37.811 rows=10 loops=1)
--Buffers: shared hit=637
--Planning:
--  Buffers: shared hit=24
--Planning Time: 0.347 ms
--Execution Time: 89527.501 ms


explain (analyze,buffers,COSTS,TIMING) 
select MAX(toFloat(b, null)) as "b" from sampletest
--Aggregate  (cost=2137.00..2137.01 rows=1 width=4) (actual 
time=186.605..186.606 rows=1 loops=1)
--  Buffers: shared hit=637
--  ->  Seq Scan on sampletest  (cost=0.00..1637.00 rows=10 width=8) 
(actual time=0.008..9.679 rows=10 loops=1)
--Buffers: shared hit=637
--Planning:
--  Buffers: shared hit=4
--Planning Time: 0.339 ms
--Execution Time: 186.641 ms


This is what I get on the new build

explain (analyze,buffers,COSTS,TIMING) 
select MAX(toFloat(a, null)) as "a" from sampletest
--QUERY PLAN
 |
-|
--Aggregate  (cost=2137.00..2137.01 rows=1 width=4) (actual 
time=2711.314..2711.315 rows=1 loops=1)  |
--  Buffers: shared hit=637 
 |
--  ->  Seq Scan on sampletest  (cost=0.00..1637.00 rows=10 width=15) 
(actual time=0.009..12.557 rows=10 loops=1)|
--Buffers: shared hit=637   
 |
--Planning Time: 0.062 ms   
 |
--Execution Time: 2711.336 ms   
 |

explain (analyze,buffers,COSTS,TIMING) 
select MAX(toFloat(b, null)) as "b" from sampletest
--QUERY PLAN
   |
---|
--Aggregate  (cost=2137.00..2137.01 rows=1 width=4) (actual 
time=146.689..146.689 rows=1 loops=1)  |
--  Buffers: shared hit=637 
   |
--  ->  Seq Scan on sampletest  (cost=0.00..1637.00 rows=10 width=8) 
(act

An I/O error occurred while sending to the backend (PG 13.4)

2021-12-04 Thread l...@laurent-hasson.com
Hello,

I hope this email finds you all well!

I have a data warehouse with a fairly complex ETL process that has been running 
for years now across PG 9.6, 11.2 and now 13.4 for the past couple of months. I 
have been getting the error "An I/O error occurred while sending to the 
backend" quite often under load in 13.4 which I never used to get on 11.2. I 
have applied some tricks, particularly with the socketTimeout JDBC 
configuration.

So my first question is whether anyone has any idea why this is happening? My 
hardware and general PG configuration have not changed between 11.2 and 13.4 
and I NEVER experienced this on 11.2 in about 2y of production.

Second, I have one stored procedure that takes a very long time to run (40mn 
more or less), so obviously, I'd need to set socketTimeout to something like 1h 
in order to call it and not timeout. That doesn't seem reasonable?

I understand that there is not just Postgres 13.4, but also the JDBC driver. I 
ran production for a several days on V42.2.19 (which had run with PG11.2 fine) 
to try and got the error a couple of times, i.e., the same as with 42.2.24, so 
I am not sure this has to do with the JDBC Driver.

So I am not sure what to do now. I do not know if there are some related 
configuration options since 11.2 that could trigger this issue that I missed, 
or some other phenomenon going on. I have always had a few "long running" 
queries in the system (i.e., > 20mn) and never experienced this on 11.2, and 
experiencing this maybe once or twice a week on 13.4, seemingly randomly. So 
sometimes, the queries run fine, and others, they time out. Weird.

Thanks,
Laurent.



RE: An I/O error occurred while sending to the backend (PG 13.4)

2021-12-04 Thread l...@laurent-hasson.com


   >  -Original Message-
   >  From: Justin Pryzby 
   >  Sent: Saturday, December 4, 2021 12:59
   >  To: l...@laurent-hasson.com
   >  Cc: pgsql-performa...@postgresql.org
   >  Subject: Re: An I/O error occurred while sending to the backend (PG
   >  13.4)
   >  
   >  On Sat, Dec 04, 2021 at 05:32:10PM +0000, l...@laurent-hasson.com
   >  wrote:
   >  > I have a data warehouse with a fairly complex ETL process that has
   >  been running for years now across PG 9.6, 11.2 and now 13.4 for the
   >  past couple of months. I have been getting the error "An I/O error
   >  occurred while sending to the backend" quite often under load in 13.4
   >  which I never used to get on 11.2. I have applied some tricks, 
particularly
   >  with the socketTimeout JDBC configuration.
   >  >
   >  > So my first question is whether anyone has any idea why this is
   >  happening? My hardware and general PG configuration have not
   >  changed between 11.2 and 13.4 and I NEVER experienced this on 11.2 in
   >  about 2y of production.
   >  >
   >  > Second, I have one stored procedure that takes a very long time to run
   >  (40mn more or less), so obviously, I'd need to set socketTimeout to
   >  something like 1h in order to call it and not timeout. That doesn't seem
   >  reasonable?
   >  
   >  Is the DB server local or remote (TCP/IP) to the client?
   >  
   >  Could you collect the corresponding postgres query logs when this
   >  happens ?
   >  
   >  It'd be nice to see a network trace for this too.  Using tcpdump or
   >  wireshark.
   >  Preferably from the client side.
   >  
   >  FWIW, I suspect the JDBC socketTimeout is a bad workaround.
   >  
   >  --
   >  Justin

It's a remote server, but all on a local network. Network performance is I am 
sure not the issue. Also, the system is on Windows Server. What are you 
expecting to see out of a tcpdump? I'll try to get PG logs on the failing query.

Thank you,
Laurent.







RE: An I/O error occurred while sending to the backend (PG 13.4)

2022-02-23 Thread l...@laurent-hasson.com



   >  -Original Message-
   >  From: l...@laurent-hasson.com 
   >  Sent: Saturday, December 4, 2021 14:18
   >  To: Justin Pryzby 
   >  Cc: pgsql-performa...@postgresql.org
   >  Subject: RE: An I/O error occurred while sending to the backend (PG 13.4)
   >  
   >  
   > >  -Original Message-
   > >  From: Justin Pryzby 
   > >  Sent: Saturday, December 4, 2021 12:59
   > >  To: l...@laurent-hasson.com
   > >  Cc: pgsql-performa...@postgresql.org
   > >  Subject: Re: An I/O error occurred while sending to the backend (PG
   > >  13.4)
   > >
   > >  On Sat, Dec 04, 2021 at 05:32:10PM +, l...@laurent-hasson.com
   > >  wrote:
   > >  > I have a data warehouse with a fairly complex ETL process that has
   > >  been running for years now across PG 9.6, 11.2 and now 13.4 for the
   > >  past couple of months. I have been getting the error "An I/O error
   > >  occurred while sending to the backend" quite often under load in 
13.4
   > >  which I never used to get on 11.2. I have applied some tricks,
   >  particularly
   > >  with the socketTimeout JDBC configuration.
   > >  >
   > >  > So my first question is whether anyone has any idea why this is
   > >  happening? My hardware and general PG configuration have not
   > >  changed between 11.2 and 13.4 and I NEVER experienced this on 11.2
   >  in
   > >  about 2y of production.
   > >  >
   > >  > Second, I have one stored procedure that takes a very long time to
   >  run
   > >  (40mn more or less), so obviously, I'd need to set socketTimeout to
   > >  something like 1h in order to call it and not timeout. That doesn't 
seem
   > >  reasonable?
   > >
   > >  Is the DB server local or remote (TCP/IP) to the client?
   > >
   > >  Could you collect the corresponding postgres query logs when this
   > >  happens ?
   > >
   > >  It'd be nice to see a network trace for this too.  Using tcpdump or
   > >  wireshark.
   > >  Preferably from the client side.
   > >
   > >  FWIW, I suspect the JDBC socketTimeout is a bad workaround.
   > >
   > >  --
   > >  Justin
   >  
   >  It's a remote server, but all on a local network. Network performance is I
   >  am sure not the issue. Also, the system is on Windows Server. What are you
   >  expecting to see out of a tcpdump? I'll try to get PG logs on the failing 
query.
   >  
   >  Thank you,
   >  Laurent.
   >  
   >  
   >  
   >  

Hello Justin,

It has been ages! The issue has been happening a bit more often recently, as 
much as once every 10 days or so. As a reminder, the set up is Postgres 13.4 on 
Windows Server with 16cores and 64GB memory. The scenario where this occurs is 
an ETL tool called Pentaho Kettle (V7) connecting to the DB for DataWarehouse 
workloads. The tool is Java-based and connects via JDBC using 
postgresql-42.2.5.jar. There are no particular settings besides the 
socketTimeout setting mentioned above.

The workload has some steps being lots of quick transactions for dimension 
tables for example, but some fact table calculations, especially large pivots, 
can make queries run for 40mn up to over an hour (a few of those).

I caught these in the logs at the time of a failure but unsure what to make of 
that:


2022-02-21 02:08:16.214 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:29.347 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:30.371 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:30.463 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:30.596 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:30.687 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:30.786 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:30.873 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:30.976 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:31.050 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:31.131 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:31.240 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:31.906 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:31.988 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
supported
2022-02-21 02:08:33.068 EST [1704] LOG:  setsockopt(TCP_USER_TIMEOUT) not 
su

RE: An I/O error occurred while sending to the backend (PG 13.4)

2022-02-28 Thread l...@laurent-hasson.com


>From: Ranier Vilela  
>Sent: Thursday, February 24, 2022 08:46
>To: Justin Pryzby 
>Cc: l...@laurent-hasson.com; pgsql-performa...@postgresql.org
>Subject: Re: An I/O error occurred while sending to the backend (PG 13.4) 
>
>Em qui., 24 de fev. de 2022 às 09:59, Justin Pryzby 
><mailto:pry...@telsasoft.com> escreveu:
>On Thu, Feb 24, 2022 at 08:50:45AM -0300, Ranier Vilela wrote:
>> I can't understand why you are still using 13.4?
>> [1] There is a long discussion about the issue with 13.4, the project was
>> made to fix a DLL bottleneck.
>> 
>> Why you not use 13.6?
>
>That other problem (and its fix) were in the windows build environment, and not
>an issue in some postgres version.
>Yeah, correct.
>But I think that it was very clear in the other thread that version 13.4, 
>on Windows, may have a slowdown, because of the DLL problem.
>So it would be better to use the latest available version 
>that has this specific fix and many others.
>
>regards,
>Ranier Vilela


OK, absolutely. I was thinking about even moving to 14. I know migrations 
within a release are painless, but my experience with upgrading across releases 
has also been quite good (short of bugs that were found of course). Any opinion 
on 14.2?

Thank you, Laurent.




RE: An I/O error occurred while sending to the backend (PG 13.4)

2022-02-28 Thread l...@laurent-hasson.com


   >  -Original Message-
   >  From: Justin Pryzby 
   >  Sent: Friday, February 25, 2022 08:02
   >  To: l...@laurent-hasson.com
   >  Cc: pgsql-performa...@postgresql.org
   >  Subject: Re: An I/O error occurred while sending to the backend (PG 13.4)
   >  
   >  On Wed, Feb 23, 2022 at 07:04:15PM -0600, Justin Pryzby wrote:
   >  > And the aforementioned network trace.  You could set a capture filter
   >  > on TCP
   >  > SYN|RST so it's not absurdly large.  From my notes, it might look like 
this:
   >  > (tcp[tcpflags]&(tcp-rst|tcp-syn|tcp-fin)!=0)
   >  
   >  I'd also add '|| icmp'.  My hunch is that you'll see some ICMP (not 
"ping")
   >  being sent by an intermediate gateway, resulting in the connection being
   >  reset.
   >  
   >  --
   >  Justin


Hello Justin,

I am so sorry but I do not understand what you are asking me to do. I am 
unfamiliar with these commands. Is this a postgres configuration file? Is this 
something I just do once or something I leave on to hopefully catch it when the 
issue occurs? Is this something to do on the DB machine or the ETL machine? FYI:

  - My ETL machine is on 10.64.17.211
  - My DB machine is on 10.64.17.210
  - Both on Windows Server 2012 R2, x64

So sorry for the bother,
Laurent.