Re: Death postgres

2023-05-10 Thread Peter J. Holzer
On 2023-05-10 22:52:47 +0200, Marc Millas wrote:
> On Wed, May 10, 2023 at 7:24 PM Peter J. Holzer  wrote:
> 
> On 2023-05-10 16:35:04 +0200, Marc Millas wrote:
> >  Unique  (cost=72377463163.02..201012533981.80 rows=1021522829864 width=
> 97)
> >    ->  Gather Merge  (cost=72377463163.02..195904919832.48 rows=
> 1021522829864 width=97)
> ...
> >                ->  Parallel Hash Left Join  (cost=
> 604502.76..1276224253.51 rows=204304565973 width=97)
> >                      Hash Cond: ((t1.col_ano)::text = 
> (t2.col_ano)::text)
> ...
> >
> > //so.. the planner guess that those 2 join will generate 1000 billions
> rows...
> 
> Are some of the col_ano values very frequent? If say the value 42 occurs
> 1 million times in both table_a and table_b, the join will create 1
> trillion rows for that value alone. That doesn't explain the crash or the
> disk usage, but it would explain the crazy cost (and would probably be a
> hint that this query is unlikely to finish in any reasonable time).
> 
> 
> good guess, even if a bit surprising: there is one (and only one) "value" 
> which
> fit your supposition: NULL

But NULL doesn't equal NULL, so that would result in only one row in the
left join. So that's not it.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


Re: Death postgres

2023-05-10 Thread Marc Millas
On Wed, May 10, 2023 at 7:24 PM Peter J. Holzer  wrote:

> On 2023-05-10 16:35:04 +0200, Marc Millas wrote:
> >  Unique  (cost=72377463163.02..201012533981.80 rows=1021522829864
> width=97)
> >->  Gather Merge  (cost=72377463163.02..195904919832.48
> rows=1021522829864 width=97)
> ...
> >->  Parallel Hash Left Join
>  (cost=604502.76..1276224253.51 rows=204304565973 width=97)
> >  Hash Cond: ((t1.col_ano)::text = (t2.col_ano)::text)
> ...
> >
> > //so.. the planner guess that those 2 join will generate 1000 billions
> rows...
>
> Are some of the col_ano values very frequent? If say the value 42 occurs
> 1 million times in both table_a and table_b, the join will create 1
> trillion rows for that value alone. That doesn't explain the crash or the
> disk usage, but it would explain the crazy cost (and would probably be a
> hint that this query is unlikely to finish in any reasonable time).
>
> hp
>
> good guess, even if a bit surprising: there is one (and only one) "value"
which fit your supposition: NULL
75 in each table which perfectly fit the planner rows estimate.
One question: what is postgres doing when it planned to hash 1000 billions
rows ?
Did postgres create an appropriate ""space"" to handle those 1000 billions
hash values ?
 thanks,
MM

> --
>_  | Peter J. Holzer| Story must make more sense than reality.
> |_|_) ||
> | |   | h...@hjp.at |-- Charles Stross, "Creative writing
> __/   | http://www.hjp.at/ |   challenge!"
>


Re: Death postgres

2023-05-10 Thread Peter J. Holzer
On 2023-05-10 16:35:04 +0200, Marc Millas wrote:
>  Unique  (cost=72377463163.02..201012533981.80 rows=1021522829864 width=97)
>    ->  Gather Merge  (cost=72377463163.02..195904919832.48 rows=1021522829864 
> width=97)
...
>                ->  Parallel Hash Left Join  (cost=604502.76..1276224253.51 
> rows=204304565973 width=97)
>                      Hash Cond: ((t1.col_ano)::text = (t2.col_ano)::text)
...
> 
> //so.. the planner guess that those 2 join will generate 1000 billions rows...

Are some of the col_ano values very frequent? If say the value 42 occurs
1 million times in both table_a and table_b, the join will create 1
trillion rows for that value alone. That doesn't explain the crash or the
disk usage, but it would explain the crazy cost (and would probably be a
hint that this query is unlikely to finish in any reasonable time).

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


Re: Death postgres

2023-05-10 Thread Laurenz Albe
On Wed, 2023-05-10 at 16:35 +0200, Marc Millas wrote:
> > > postgres 14.2 on Linux redhat
> > > 
> > > temp_file_limit set around 210 GB.
> > > 
> > > a select request with 2 left join have crashed the server (oom killer) 
> > > after the postgres
> > > disk occupation did grow from 15TB to 16 TB.
> > > 
> > > What are the cases where postgres may grow without caring about 
> > > temp_file_limit ?
> > 
> > That's too little information for a decent answer.
> > One obvious answer is: if it is not writing temporary files.
> 
> so here is a little more info:
> 
>                                                         QUERY PLAN            
>                                              
> ---
>  Unique  (cost=72377463163.02..201012533981.80 rows=1021522829864 width=97)
>    ->  Gather Merge  (cost=72377463163.02..195904919832.48 rows=1021522829864 
> width=97)
>          Workers Planned: 5
>          ->  Sort  (cost=72377462162.94..72888223577.87 rows=204304565973 
> width=97)
>                Sort Key: t1.col_ine, (CASE WHEN (t2.col_ibi IS NULL) THEN 
> t3.col_ibi ELSE t2.col_ibi END)
>                ->  Parallel Hash Left Join  (cost=604502.76..1276224253.51 
> rows=204304565973 width=97)
>                      Hash Cond: ((t1.col_ano)::text = (t2.col_ano)::text)
>                      ->  Parallel Hash Left Join  (cost=300803.38..582295.38 
> rows=4857277 width=52)
>                            Hash Cond: ((t1.col_ne)::text = (t3.col_ne)::text)
>                            ->  Parallel Seq Scan on table_a t1  
> (cost=0.00..184052.35 rows=2616335 width=53)
>                            ->  Parallel Hash  (cost=243466.06..243466.06 
> rows=2965306 width=31)
>                                  ->  Parallel Seq Scan on table_b t3  
> (cost=0.00..243466.06 rows=2965306 width=31)
>                      ->  Parallel Hash  (cost=243466.06..243466.06 
> rows=2965306 width=34)
>                            ->  Parallel Seq Scan on table_b t2  
> (cost=0.00..243466.06 rows=2965306 width=34)
>  JIT:
>    Functions: 19
>    Options: Inlining true, Optimization true, Expressions true, Deforming true
> (17 rows)

Perhaps parallel query drives you OOM.  Does the problem also happen if
"max_parallel_workers_per_gather" is set to 0?

Yours,
Laurenz Albe




Re: Death postgres

2023-05-10 Thread Marc Millas
On Sun, May 7, 2023 at 8:42 PM Laurenz Albe 
wrote:

> On Sat, 2023-05-06 at 03:14 +0200, Marc Millas wrote:
> > postgres 14.2 on Linux redhat
> >
> > temp_file_limit set around 210 GB.
> >
> > a select request with 2 left join have crashed the server (oom killer)
> after the postgres
> > disk occupation did grow from 15TB to 16 TB.
> >
> > What are the cases where postgres may grow without caring about
> temp_file_limit ?
>
> That's too little information for a decent answer.
> One obvious answer is: if it is not writing temporary files.
>
> Yours,
> Laurenz Albe
>

Logical ...

so here is a little more info:

db=# analyze myschema.table_a;
ANALYZE

db=# with ta as(select 'myschema' as s, 'table_a' as t), p as (select *
from information_schema.columns, ta where table_schema=ta.s and
table_name=ta.t),  tableid as(select relid from pg_stat_user_tables, ta
where schemaname=ta.s and relname=ta.t) select staattnum, column_name,
stanullfrac, stadistinct from tableid, pg_statistic join p on
p.ordinal_position=staattnum where starelid=tableid.relid order by
staattnum;
 staattnum | column_name  | stanullfrac | stadistinct
---+--+-+-
 1 | col_ne |   0 |  -0.6100224
 2 | col_brg|  0.0208 |   6
 3 | col_ano|   0 |  447302
 4 | col_ine|   0 |  -0.5341927
 5 | col_cha|   0 |  11
(5 rows)

db=# select count(*) from myschema.table_a;
  count
--
 13080776
(1 row)

db=# select count(distinct col_ano) from myschema.table_a;
  count
--
 10149937
(1 row)


// stats looks somewhat absurd, as analyze guess 20 times less distinct
values as a select distinct does on column col_ano...
db=# analyze myschema.table_b;
ANALYZE
db=# with ta as(select 'myschema' as s, 'table_b' as t), p as (select *
from information_schema.columns, ta where table_schema=ta.s and
table_name=ta.t),  tableid as(select relid from pg_stat_user_tables, ta
where schemaname=ta.s and relname=ta.t) select staattnum, column_name,
stanullfrac, stadistinct from tableid, pg_statistic join p on
p.ordinal_position=staattnum where starelid=tableid.relid order by
staattnum;
 staattnum |   column_name   | stanullfrac | stadistinct
---+-+-+-
 1 | col_nerg  |   0 |  161828
 2 | col_ibi   |   0 |  362161
 3 | col_imi   |   0 |  381023
 4 | col_ipi   |   0 |  391915
 5 | col_ne|   0 | -0.53864235
 6 | col_ano   |   0 |  482225
(6 rows)

db=# select count(*) from myschema.table_b;
  count
--
 14811485
(1 row)

db=# select count(distinct col_ano) from myschema.table_b;
  count
--
 10149937
(1 row)

//same remark


db=# explain  select distinct t1.col_ine,  case when t2.col_ibi is null
then t3.col_ibi else t2.col_ibi end  from myschema.table_a t1 left join
myschema.table_b t2 on t1.col_ano=t2.col_ano Left join myschema.table_b t3
on t1.NUM_ENQ=t3.NUM_ENQ;
QUERY PLAN

---
 Unique  (cost=72377463163.02..201012533981.80 rows=1021522829864 width=97)
   ->  Gather Merge  (cost=72377463163.02..195904919832.48
rows=1021522829864 width=97)
 Workers Planned: 5
 ->  Sort  (cost=72377462162.94..72888223577.87 rows=204304565973
width=97)
   Sort Key: t1.col_ine, (CASE WHEN (t2.col_ibi IS NULL) THEN
t3.col_ibi ELSE t2.col_ibi END)
   ->  Parallel Hash Left Join  (cost=604502.76..1276224253.51
rows=204304565973 width=97)
 Hash Cond: ((t1.col_ano)::text = (t2.col_ano)::text)
 ->  Parallel Hash Left Join
 (cost=300803.38..582295.38 rows=4857277 width=52)
   Hash Cond: ((t1.col_ne)::text =
(t3.col_ne)::text)
   ->  Parallel Seq Scan on table_a t1
 (cost=0.00..184052.35 rows=2616335 width=53)
   ->  Parallel Hash  (cost=243466.06..243466.06
rows=2965306 width=31)
 ->  Parallel Seq Scan on table_b t3
 (cost=0.00..243466.06 rows=2965306 width=31)
 ->  Parallel Hash  (cost=243466.06..243466.06
rows=2965306 width=34)
   ->  Parallel Seq Scan on table_b t2
 (cost=0.00..243466.06 rows=2965306 width=34)
 JIT:
   Functions: 19
   Options: Inlining true, Optimization true, Expressions true, Deforming
true
(17 rows)

//so.. the planner guess that those 2 join will generate 1000 billions
rows...

//so, I try to change stats

db=# alter table myschema.table_a alter column col_ano set
(n_distinct=-0.8);
ALTER TABLE
db=# analyze myschema.table_a;
ANALYZE
db=# with ta as(select 'myschema' as s, 'table_a' as t), p as (select *
from 

Re: "PANIC: could not open critical system index 2662" - twice

2023-05-10 Thread Evgeny Morozov
On 10/05/2023 6:39 am, Kirk Wolak wrote:
> It could be as simple as creating temp tables in the other database
> (since I believe pg_class was hit).
We do indeed create temp tables, both in other databases and in the ones
being tested. (We also create non-temp tables there.)
>
> Also, not sure if the OP has a set of things done after he creates the
> DB that may help?

Basically we read rows from the source database, create some partitions
of tables in the target database, insert into a temp table there using
BULK COPY, then using a regular INSERT copy from the temp tables to the
new partitions.


Now that the probem has been reproduced and understood by the PG
developers, could anyone explain why PG crashed entirely with the
"PANIC" error back in April when only specific databases were corrupted,
not any global objects necesary for PG to run? And why did it not crash
with the "PANIC" on this occasion?



Re: Return rows in input array's order?

2023-05-10 Thread Dominique Devienne
On Wed, May 10, 2023 at 1:08 PM Andrew Gierth 
wrote:

> > "Dominique" == Dominique Devienne  writes:
>  Dominique> I assume that if the PK is composite, and I pass the PK
>  Dominique> tuples as separate same-cardinality "parallel" arrays, I can
>  Dominique> "zip" those arrays back via a multi-join using their
>  Dominique> ordinals before joining with the composite-PK table?
>
> You don't need to, because unnest can do it for you:
>

Wow, that's fantastic. Thanks!

 Dominique> PS: I guess the ideal plan depends both on the table itself,
>  Dominique> but also the cardinality of the array(s) passed in as bind
>  Dominique> variable(s) at runtime to the prepared statement, right?
>
> Yes, in the sense that what matters is what proportion of the table is
> being fetched. Is it likely that you'll be passing in very long lists of
> ids relative to the table size?
>

I'm writing a new mid-tier implementation of an existing protocol / API,
so I don't decide "how much" the clients ask for. The API certainly allows
a small request to return a large amount data / rows from several tables.

Although the queries using list of IDs (SKs) as where-clauses are typically
internal implementation details, and not per-se client requests.


>  Dominique> But from past posts, I got the impression the plan of a
>  Dominique> prepared statement is "fixed", and does not depend on "bind
>  Dominique> peeking" like it can in Oracle, to take those bound array's
>  Dominique> cardinality into account at PQexecPrepared-time?
>
> It's a bit more complicated than that and it often depends on what the
> client library is doing; many clients don't do a protocol-level named
> prepare until after a client-side prepared statement has been used
> several times; and even after doing a named prepare, the planner won't
> try a generic plan until after several more uses.
>

I'm in C++ using my own thin wrapper on top of libpq directly.

And I do tend to PQprepare extensively, since my mid-tier implementation
is long lived and many queries will be used many many times. This used to
matter a lot with Oracle OCI, but maybe lack of bind-peeking to re-plan or
select among a choices of plan makes always preparing statements a bad
choice with PostgreSQL / LibPQ?


> We distinguish between "generic plan" and "custom plan"; a generic plan
> is one produced without knowledge of the parameter values and must work
> for any parameter values, while a custom plan only works for one
> specific set of parameter values and can't usually be re-used. Custom
> plans take the parameter values into account both for estimation and for
> constant-folding optimizations. Generic plans are used after about the
> 5th use of a statement if the cost of the generic plan isn't worse than
> the average costs of the custom plans from the previous uses, plus a
> fudge factor representing the CPU cost of custom planning.
>

Indeed it's more complicated than I thought... Interesting though.


> The planning hazard in cases like this is that when doing a generic
> plan, the planner has no idea at all what the array cardinalities will
> be; it doesn't try and cache information like that from the custom
> plans. So it will make a zeroth-order approximation (i.e. a constant)
> derived by the time-honoured method of rectal extraction, and this may
> make the generic plan look a lot cheaper than it should.
>

Funny colorful language :). Thanks again, you've been tremendously helpful.
--DD


Patroni Issue

2023-05-10 Thread Inzamam Shafiq
Hi Team,

I hope you are doing well.

I am working on patroni auto failover, I have 3 ETCD nodes, 2 pgsql/patroni 
nodes, ETCD cluster is running fine with no issues, now I have installed 
postgresql on patroni nodes and configured streaming replication using 
pg_basebackup, which is running fine. On top of that I have created yaml file 
for patroni, when I start patroni on the first(primary) node, it show the 
following logs and keep on showing this message and stay in starting state. can 
someone please help me to identify the issue, logs are as follows,

-bash-4.2$ patroni /etc/patroni/patroni.yml
2023-05-10 16:43:59,147 INFO: Selected new etcd server http://172.xx.xx.xx:2379
2023-05-10 16:43:59,151 INFO: No PostgreSQL configuration items changed, 
nothing to reload.
2023-05-10 16:43:59,171 WARNING: Postgresql is not running.
2023-05-10 16:43:59,171 INFO: Lock owner: None; I am pg1
2023-05-10 16:43:59,174 INFO: pg_controldata:
  pg_control version number: 1201
  Catalog version number: 201909212
  Database system identifier: 7230768165275119881
  Database cluster state: shut down
  pg_control last modified: Wed May 10 16:43:54 2023
  Latest checkpoint location: 0/6000150
  Latest checkpoint's REDO location: 0/6000150
  Latest checkpoint's REDO WAL file: 00010006
  Latest checkpoint's TimeLineID: 1
  Latest checkpoint's PrevTimeLineID: 1
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:490
  Latest checkpoint's NextOID: 16386
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 479
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 0
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Wed May 10 16:43:53 2023
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 5
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float4 argument passing: by value
  Float8 argument passing: by value
  Data page checksum version: 0
  Mock authentication nonce: 
c557e5f51c201ffaa61d5372fe87384044552322ff979d1e05f6030be1fc7cc0

2023-05-10 16:43:59,185 INFO: Lock owner: None; I am pg1
2023-05-10 16:43:59,191 INFO: starting as a secondary
2023-05-10 16:43:59.461 PKT [22878] LOG:  starting PostgreSQL 12.14 on 
x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 
64-bit
2023-05-10 16:43:59.461 PKT [22878] LOG:  listening on IPv4 address "0.0.0.0", 
port 5432
2023-05-10 16:43:59,476 INFO: postmaster pid=22878
2023-05-10 16:43:59.479 PKT [22878] LOG:  listening on Unix socket 
"/var/run/postgresql/.s.PGSQL.5432"
localhost:5432 - no response
2023-05-10 16:43:59.498 PKT [22878] LOG:  redirecting log output to logging 
collector process
2023-05-10 16:43:59.498 PKT [22878] HINT:  Future log output will appear in 
directory "log".
localhost:5432 - no response
localhost:5432 - no response
localhost:5432 - no response
localhost:5432 - no response
localhost:5432 - no response
localhost:5432 - no response
localhost:5432 - no response
localhost:5432 - no response
localhost:5432 - no response
2023-05-10 16:44:09,163 INFO: Lock owner: None; I am pg1
2023-05-10 16:44:09,163 INFO: not healthy enough for leader race
2023-05-10 16:44:09,169 INFO: restarting after failure in progress
localhost:5432 - no response
localhost:5432 - no response
localhost:5432 - no response
localhost:5432 - no response
localhost:5432 - no response
localhost:5432 - no response
localhost:5432 - no response
localhost:5432 - no response
localhost:5432 - no response
localhost:5432 - no response
2023-05-10 16:44:19,163 INFO: Lock owner: None; I am pg1
2023-05-10 16:44:19,163 INFO: not healthy enough for leader race
2023-05-10 16:44:19,169 INFO: restarting after failure in progress
localhost:5432 - no response
localhost:5432 - no response
localhost:5432 - no response

Thanks.

Regards,

Inzamam Shafiq
Sr. DBA


Re: Return rows in input array's order?

2023-05-10 Thread Andrew Gierth
> "Dominique" == Dominique Devienne  writes:

 Dominique> Is it possible to maintain $1's order directly in SQL?

 >> This is the correct way:
 >> 
 >> SELECT ... FROM unnest($1) WITH ORDINALITY AS u(id,ord)
 >> JOIN yourtable t ON t.id=u.id
 >> ORDER BY u.ord;

 Dominique> Thanks Andrew, for spelling it out for me. Appreciated.
 Dominique> Also thanks to others who chimed in.

 Dominique> I assume that if the PK is composite, and I pass the PK
 Dominique> tuples as separate same-cardinality "parallel" arrays, I can
 Dominique> "zip" those arrays back via a multi-join using their
 Dominique> ordinals before joining with the composite-PK table?

You don't need to, because unnest can do it for you:

SELECT ... FROM unnest($1,$2,$3) WITH ORDINALITY AS u(id1,id2,id3,ord)
   JOIN yourtable t ON t.id1=u.id1
   AND t.id2=u.id2
   AND t.id3=u.id3
 ORDER BY u.ord;

(I did actually consider using a join on the ordinal column to implement
multi-arg unnest internally, but the overhead was too much. So instead
the executor knows how to do the zipping itself.)

 Dominique> PS: I guess the ideal plan depends both on the table itself,
 Dominique> but also the cardinality of the array(s) passed in as bind
 Dominique> variable(s) at runtime to the prepared statement, right?

Yes, in the sense that what matters is what proportion of the table is
being fetched. Is it likely that you'll be passing in very long lists of
ids relative to the table size?

 Dominique> But from past posts, I got the impression the plan of a
 Dominique> prepared statement is "fixed", and does not depend on "bind
 Dominique> peeking" like it can in Oracle, to take those bound array's
 Dominique> cardinality into account at PQexecPrepared-time?

It's a bit more complicated than that and it often depends on what the
client library is doing; many clients don't do a protocol-level named
prepare until after a client-side prepared statement has been used
several times; and even after doing a named prepare, the planner won't
try a generic plan until after several more uses.

We distinguish between "generic plan" and "custom plan"; a generic plan
is one produced without knowledge of the parameter values and must work
for any parameter values, while a custom plan only works for one
specific set of parameter values and can't usually be re-used. Custom
plans take the parameter values into account both for estimation and for
constant-folding optimizations. Generic plans are used after about the
5th use of a statement if the cost of the generic plan isn't worse than
the average costs of the custom plans from the previous uses, plus a
fudge factor representing the CPU cost of custom planning.

The planning hazard in cases like this is that when doing a generic
plan, the planner has no idea at all what the array cardinalities will
be; it doesn't try and cache information like that from the custom
plans. So it will make a zeroth-order approximation (i.e. a constant)
derived by the time-honoured method of rectal extraction, and this may
make the generic plan look a lot cheaper than it should.

-- 
Andrew (irc:RhodiumToad)




Re: Return rows in input array's order?

2023-05-10 Thread Dominique Devienne
On Wed, May 10, 2023 at 9:49 AM Andrew Gierth 
wrote:

>  Dominique> Is it possible to maintain $1's order directly in SQL?
>
> This is the correct way:
>
> SELECT ... FROM unnest($1) WITH ORDINALITY AS u(id,ord)
>JOIN yourtable t ON t.id=u.id
>  ORDER BY u.ord;
>

Thanks Andrew, for spelling it out for me. Appreciated.
Also thanks to others who chimed in.

I assume that if the PK is composite, and I pass the PK tuples as separate
same-cardinality "parallel" arrays, I can "zip" those arrays back via a
multi-join
using their ordinals before joining with the composite-PK table? --DD

PS: I guess the ideal plan depends both on the table itself, but also the
cardinality
  of the array(s) passed in as bind variable(s) at runtime to the prepared
statement, right?
  But from past posts, I got the impression the plan of a prepared
statement is "fixed",
  and does not depend on "bind peeking" like it can in Oracle, to take
those bound
  array's cardinality into account at PQexecPrepared-time?

PPS: This is something I actually failed to do in Oracle in the past...


Re: Return rows in input array's order?

2023-05-10 Thread Andrew Gierth
> "Dominique" == Dominique Devienne  writes:

 Dominique> Hi. With an integer identity primary key table,
 Dominique> we fetch a number of rows with WHERE id = ANY($1),
 Dominique> with $1 an int[] array. The API using that query must return
 Dominique> rows in the input int[] array order, and uses a client-side
 Dominique> mapping to achieve that currently.

 Dominique> Is it possible to maintain $1's order directly in SQL?
 Dominique> Efficiently?

This is the correct way:

SELECT ... FROM unnest($1) WITH ORDINALITY AS u(id,ord)
   JOIN yourtable t ON t.id=u.id
 ORDER BY u.ord;

This doesn't assume there won't be holes (if you want, you can change it
to a left join to get a null row instead for missing ids).

The query plan you get for this should be something like:

  Nested Loop
Function Scan on unnest
Index Scan on yourtable_pkey

(less likely, depending on table sizes, would be a Merge Join with
similar inputs. If your table is very small you might get a hashjoin and
separate sort, but that shouldn't happen with realistic data sizes.)

Notice that this is entirely deterministic about the output ordering
without needing to do any sorting. (The planner knows that the output of
WITH ORDINALITY function scans is automatically ordered by the ordinal
column, so it will usually generate plans that take advantage of that.)
The presence of "ORDER BY u.ord" ensures that the output order is
correct regardless of plan choice.

-- 
Andrew (irc:RhodiumToad)