Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2024-04-08 Thread Daniel Verite
Alexander Lakhin wrote:

> >> Now that ExecQueryUsingCursor() is gone, it's not clear, what does
> >> the following comment mean:?
> >>  * We must turn off gexec_flag to avoid infinite recursion.  Note that
> >>  * this allows ExecQueryUsingCursor to be applied to the individual 
> >> query
> >>  * results.
> > Hmm, the point about recursion is still valid isn't it?  I agree the
> > reference to ExecQueryUsingCursor is obsolete, but I think we need to
> > reconstruct what this comment is actually talking about.  It's
> > certainly pretty obscure ...
> 
> Sorry, I wasn't clear enough, I meant to remove only that reference, not
> the quoted comment altogether.

The comment might want to stress the fact that psql honors
FETCH_COUNT "on top of" \gset, so if the user issues for instance:

  select 'select ' || i  from generate_series(1,) as i \gexec

what's going to be sent to the server is a series of:

 BEGIN
 DECLARE _psql_cursor NO SCROLL CURSOR FOR
select 
 FETCH FORWARD  FROM _psql_cursor (possibly repeated)
 CLOSE _psql_cursor
 COMMIT

Another choice would be to ignore FETCH_COUNT and send exactly the
queries that \gset produces, with the assumption that it better
matches the user's expectation. Maybe that alternative was considered
and the comment reflects the decision.

Since the new implementation doesn't rewrite the user-supplied queries,
the point is moot, and this part should be removed:
  "Note that this allows ExecQueryUsingCursor to be applied to the
  individual query results"
I'll wait a bit for other comments and submit a patch.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2024-04-08 Thread Daniel Verite
Tom Lane wrote:

> I've reconsidered after realizing that implementing FETCH_COUNT
> atop traditional single-row mode would require either merging
> single-row results into a bigger PGresult or persuading psql's
> results-printing code to accept an array of PGresults not just
> one.  Either of those would be expensive and ugly, not to mention
> needing chunks of code we don't have today.

Yes, we must accumulate results because the aligned format needs to
know the columns widths for a entire "page", and the row-by-row logic
does not fit that well in that case.
One of the posted patches implemented this with an array of PGresult
in single-row mode [1] but I'm confident that the newer version you
pushed with the libpq changes is a better approach.

> So I whacked the patch around till I liked it better, and pushed it.

Thanks for taking care of this!


[1]
https://www.postgresql.org/message-id/092583fb-97c5-428f-8d99-fd31be4a5...@manitou-mail.org


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Fixing backslash dot for COPY FROM...CSV

2024-04-06 Thread Daniel Verite
Tom Lane wrote:

> This is sufficiently weird that I'm starting to come around to
> Daniel's original proposal that we just drop the server's recognition
> of \. altogether (which would allow removal of some dozens of lines of
> complicated and now known-buggy code)

FWIW my plan was to not change anything in the TEXT mode,
but I wasn't aware it had this issue that you found when
\. is not in a line by itself.

>  Alternatively, we could fix it so that \. at the end of a line draws
> "end-of-copy marker corrupt"
> which would at least make things consistent, but I'm not sure that has
> any great advantage.  I surely don't want to document the current
> behavioral details as being the right thing that we're going to keep
> doing.

Agreed we don't want to document that, but also why doesn't \. in the
contents represent just a dot  (as opposed to being an error),
just like \a is a?

I mean if eofdata contains

  foobar\a
  foobaz\aother

then we get after import:
  f1  
--
 foobara
 foobazaother
(2 rows)

Reading the current doc on the text format, I can't see why
importing:

  foobar\.
  foobar\.other

is not supposed to produce
  f1  
--
 foobar.
 foobaz.other
(2 rows)


I see these rules in [1] about backslash:

#1. 
  "End of data can be represented by a single line containing just
   backslash-period (\.)."

foobar\. and foobar\.other do not match that so #1 does not describe
how they're interpreted.

#2.
  "Backslash characters (\) can be used in the COPY data to quote data
  characters that might otherwise be taken as row or column
  delimiters."

Dot is not a column delimiter (it's forbidden anyway), so #2 does
not apply.

#3.
  "In particular, the following characters must be preceded by a
  backslash if they appear as part of a column value: backslash itself,
  newline, carriage return, and the current delimiter character"

Dot is not in that list so #3 does not apply.

#4.
  "The following special backslash sequences are recognized by COPY
  FROM:" (followed by the table with \b \f, ...,)

Dot is not mentioned.

#5.
  "Any other backslashed character that is not mentioned in the above
  table will be taken to represent itself"

Here we say that backslash dot represents a dot (unless other
rules apply)

  foobar\. => foobar. 
  foobar\.other => foobar.other

#6.
  "However, beware of adding backslashes unnecessarily, since that
   might accidentally produce a string matching the end-of-data marker
   (\.) or the null string (\N by default)."

So we *recommend* not to use \. but as I understand it, the warning
with the EOD marker is about accidentally creating a line that matches #1,
that is, \. alone on a line.

#7
  "These strings will be recognized before any other backslash
  processing is done."

TBH I don't understand what #7 implies. The order in backslash
processing looks like an implementation detail that should not
matter in understanding the format?


Considering this, it seems to me that #5 says that
backslash-dot represents a dot unless #1 applies, and the
other #2 #3 #4 #6 #7 rules do not state anything that would
contradict that.


[1] https://www.postgresql.org/docs/current/sql-copy.html


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Fixing backslash dot for COPY FROM...CSV

2024-04-05 Thread Daniel Verite
Tom Lane wrote:

> Not sure what to do here.  One idea is to install just the psql-side
> fix, which should break nothing now that version-2 protocol is dead,
> and then wait a few years before introducing the server-side change.
> That seems kind of sad though.

Wouldn't backpatching solve this?  Then only the users who don't
apply the minor updates would have non-matching server and psql.
Initially I though that obsoleting the v2 protocol was a recent
move, but reading older messages from the list I've got the
impression that it was more or less in the pipeline since way
before version 10.

Also one of the cases the patch fixes, the one when imported
data are silently truncated at the point of \., is quite nasty
IMO.
I can imagine an app where user-supplied data would be
appended row-by-row into a CSV file, and would be processed
periodically by batch. Under some conditions, in particular
if newlines in the first column are allowed, a malevolent user could
submit a \. sequence  to cause the batch to miss the rest of the data
without any error being raised.


[1]
https://www.postgresql.org/message-id/11648.1403147417%40sss.pgh.pa.us

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Fixing backslash dot for COPY FROM...CSV

2024-04-05 Thread Daniel Verite
Tom Lane wrote:

> I've looked over this patch and I generally agree that this is a
> reasonable solution.

Thanks for reviewing this!

> I'm also wondering why the patch adds a test for
> "PQprotocolVersion(conn) >= 3" in handleCopyIn. 

I've removed this in the attached update.

> I concur with Robert's doubts about some of the doc changes though.
> In particular, since we're not changing the behavior for non-CSV
> mode, we shouldn't remove any statements that apply to non-CSV mode.

I don't quite understand the issues with the doc changes. Let me
detail the changes.

The first change is under 
  
   CSV Format
so it does no concern non-CSV modes.

--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -761,11 +761,7 @@ COPY count
format, \., the end-of-data marker, could also appear
as a data value.  To avoid any misinterpretation, a \.
data value appearing as a lone entry on a line is automatically
-quoted on output, and on input, if quoted, is not interpreted as the
-end-of-data marker.  If you are loading a file created by another
-application that has a single unquoted column and might have a
-value of \., you might need to quote that value in
the
-input file.
+quoted on output.
   


The part about quoting output is kept because the code still does that.

About this bit:  
 "and on input, if quoted, is not interpreted as the end-of-data marker."

So the patch removes that part. The reason is \. is now not interpreted as
EOD in both cases, quoted or unquoted, conforming to spec.
Previously, what the reader should have understood by "if quoted, ..."
is that it implies "if not quoted, then .\ will be interpreted as EOD
even though that behavior does not conform to the CSV spec".
If we documented the new behavior, it would be something like:
when quoted, it works as expected, and when unquoted, it works as
expected too. Isn't it simpler not to say anything?

About the next sentence:
  "If you are loading a file created by another application
  that has a single unquoted column and might have a value of \., you
  might need to quote that value in the input file."

It's removed as well because it's not longer necessary
to do that.

The other hunk is in psql doc:

--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1119,10 +1119,6 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value'
'second value' \g
 destination, because all data must pass through the client/server
 connection.  For large amounts of data the SQL
 command might be preferable.
-Also, because of this pass-through method, \copy
-... from in CSV mode will erroneously
-treat a \. data value alone on a line as an
-end-of-input marker.

That behavior no longer happens, so this gets removed as well.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
From 5a16b81f10e47fe1ac3842abd34e8bef7e639e26 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= 
Date: Fri, 5 Apr 2024 14:22:49 +0200
Subject: [PATCH v4] Support backslash-dot on a line by itself as valid data in
 COPY FROM...CSV

---
 doc/src/sgml/ref/copy.sgml   |  6 +--
 doc/src/sgml/ref/psql-ref.sgml   |  4 --
 src/backend/commands/copyfromparse.c | 57 +++-
 src/bin/psql/copy.c  | 30 ++-
 src/test/regress/expected/copy.out   | 18 +
 src/test/regress/sql/copy.sql| 12 ++
 6 files changed, 65 insertions(+), 62 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 33ce7c4ea6..a63f073c1b 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -814,11 +814,7 @@ COPY count
 format, \., the end-of-data marker, could also appear
 as a data value.  To avoid any misinterpretation, a \.
 data value appearing as a lone entry on a line is automatically
-quoted on output, and on input, if quoted, is not interpreted as the
-end-of-data marker.  If you are loading a file created by another
-application that has a single unquoted column and might have a
-value of \., you might need to quote that value in the
-input file.
+quoted on output.

 

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index cc7d797159..d94e3cacfc 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1119,10 +1119,6 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 
'second value' \g
 destination, because all data must pass through the client/server
 connection.  For large amounts of data the SQL
 command might be preferable.
-Also, because of this pass-through method, \copy
-... from in CSV mode will erroneously
-treat a \. data value alone on a line as an
-end-of-input marker.
 
 
 
diff --git 

Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2024-04-02 Thread Daniel Verite
Tom Lane wrote:

> > I should say that I've noticed significant latency improvements with
> > FETCH_COUNT retrieving large resultsets, such that it would benefit
> > non-interactive use cases.
> 
> Do you have a theory for why that is?  It's pretty counterintuitive
> that it would help at all.

I've been thinking that it's a kind of pipeline/parallelism effect.
When libpq accumulates all rows in one resultset, if the network
or the server are not fast enough, it spends a certain amount of
time waiting for the data to come in.
But when it accumulates fewer rows and gives back control
to the app to display intermediate results, during that time the
network buffers can fill in, resulting, I assume, in less time waiting
overall.

I think the benefit is similar to what we get with \copy. In fact
with the  above-mentioned test, the execution times with
FETCH_COUNT=1000 look very close to \copy of the same query.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2024-04-02 Thread Daniel Verite
Tom Lane wrote:

> I do not buy that psql's FETCH_COUNT mode is a sufficient reason
> to add it.  FETCH_COUNT mode is not something you'd use
> non-interactively

I should say that I've noticed significant latency improvements with
FETCH_COUNT retrieving large resultsets, such that it would benefit
non-interactive use cases.

For instance, with the current v7 patch, a query like the OP's initial
case and batches of 1000 rows:

$ cat fetchcount-test.sql

select repeat('a', 100) || '-' ||
i || '-' || repeat('b', 500) as total_pat
from generate_series(1, 500) as i
\g /dev/null

$ export TIMEFORMAT=%R

$ for s in $(seq 1 10); do time /usr/local/pgsql/bin/psql -At \
   -v FETCH_COUNT=1000 -f fetchcount-test.sql;  done

3.597
3.413
3.362
3.612
3.377
3.416
3.346
3.368
3.504
3.413

=> Average elapsed time = 3.44s

Now without FETCH_COUNT, fetching the 5 million rows in one resultset:

$ for s in $(seq 1 10); do time /usr/local/pgsql/bin/psql -At \
-f fetchcount-test.sql;  done

4.200
4.178
4.200
4.169
4.195
4.217
4.197
4.234
4.225
4.242

=> Average elapsed time = 4.20s

By comparison the unpatched version (cursor-based method)
gives these execution times with FETCH_COUNT=1000:

4.458
4.448
4.476
4.455
4.450
4.466
4.395
4.429
4.387
4.473

=> Average elapsed time = 4.43s

Now that's just one test, but don't these numbers look good?


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2024-04-01 Thread Daniel Verite
Laurenz Albe wrote:

> Here is the code review for patch number 2:

> +static void
> +CloseGOutput(FILE *gfile_fout, bool is_pipe)
> 
> It makes sense to factor out this code.
> But shouldn't these functions have a prototype at the beginning of the file?

Looking at the other static functions in psql/common.c,  there
are 22 of them but only 3 have prototypes at the top of the file.
These 3 functions are called before being defined, so these prototypes
are mandatory.
The other static functions that are defined before being called happen
not to have forward declarations, so SetupGOutput() and CloseGOutput()
follow that model.

> Here is a suggestion for a consolidated comment:
> 
> Fetch the result in chunks if FETCH_COUNT is set.  We don't enable chunking
> if SHOW_ALL_RESULTS is false, since that requires us to accumulate all rows
>   before we can tell what should be displayed, which would counter the idea
>   of FETCH_COUNT.  Chunk fetching is also disabled if \gset, \crosstab,
>   \gexec and \watch are used.

OK, done like that.

> > +   if (fetch_count > 0 && result_status == PGRES_TUPLES_CHUNK)
> 
> Could it be that result_status == PGRES_TUPLES_CHUNK, but fetch_count is 0?
> if not, perhaps there should be an Assert that verifies that, and the "if"
> statement should only check for the latter condition.

Good point. In fact it can be simplified to
 if (result_status == PGRES_TUPLES_CHUNK),
and fetch_count as a variable can be removed from the function.
Done that way.


> > --- a/src/bin/psql/t/001_basic.pl
> > +++ b/src/bin/psql/t/001_basic.pl
> > @@ -184,10 +184,10 @@ like(
> > "\\set FETCH_COUNT 1\nSELECT error;\n\\errverbose",
> > on_error_stop => 0))[2],
> > qr/\A^psql::2: ERROR:  .*$
> > -^LINE 2: SELECT error;$
> > +^LINE 1: SELECT error;$

> >  ^ *^.*$
> >  ^psql::3: error: ERROR:  [0-9A-Z]{5}: .*$
> > -^LINE 2: SELECT error;$
> > +^LINE 1: SELECT error;$
> 
> Why does the output change?  Perhaps there is a good and harmless
> explanation, but the naïve expectation would be that it doesn't.

Unpatched, psql builds this query:
DECLARE _psql_cursor NO SCROLL CURSOR FOR  \n
   
therefore the user query starts at line 2.

With the patch, the user query is sent as-is, starting at line1,
hence the different error location.


> After fixing the problem manually, it builds without warning.
> The regression tests pass, and the feature works as expected.

Thanks for testing.
Updated patches are attached.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
From 0fefaee7d5b3003ad0d089ea9e92675c6f50245f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= 
Date: Mon, 1 Apr 2024 19:46:20 +0200
Subject: [PATCH v7 1/2] Implement retrieval of results in chunks with libpq.

This mode is similar to the single-row mode except that chunks
of results contain up to N rows instead of a single row.
It is meant to reduce the overhead of the row-by-row allocations
for large result sets.
The mode is selected with PQsetChunkedRowsMode(int maxRows) and results
have the new status code PGRES_TUPLES_CHUNK.
---
 doc/src/sgml/libpq.sgml   |  98 +++
 .../libpqwalreceiver/libpqwalreceiver.c   |   1 +
 src/bin/pg_amcheck/pg_amcheck.c   |   1 +
 src/interfaces/libpq/exports.txt  |   1 +
 src/interfaces/libpq/fe-exec.c| 117 +++---
 src/interfaces/libpq/libpq-fe.h   |   4 +-
 src/interfaces/libpq/libpq-int.h  |   7 +-
 7 files changed, 185 insertions(+), 44 deletions(-)

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index d3e87056f2..1814921d5a 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -3545,7 +3545,20 @@ ExecStatusType PQresultStatus(const PGresult *res);
 The PGresult contains a single result 
tuple
 from the current command.  This status occurs only when
 single-row mode has been selected for the query
-(see ).
+(see ).
+   
+  
+ 
+
+ 
+  PGRES_TUPLES_CHUNK
+  
+   
+The PGresult contains several tuples
+from the current command. The count of tuples cannot exceed
+the maximum passed to .
+This status occurs only when the chunked mode has been selected
+for the query (see ).

   
  
@@ -5197,8 +5210,8 @@ PGresult *PQgetResult(PGconn *conn);
   
Another frequently-desired feature that can be obtained with
 and 
-   is retrieving large query results a row at a time.  This is discussed
-   in .
+   is retrieving large query results a limited number of rows at a time.  This 
is discussed
+   in .
   
 
   
@@ -5562,12 +5575,13 @@ int PQflush(PGconn *conn);
 
 
 
- To enter single-row mode, call PQsetSingleRowMode
- before 

Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2024-04-01 Thread Daniel Verite
Laurenz Albe wrote:

> I had a look at patch 0001 (0002 will follow).

Thanks for reviewing this!

I've implemented the suggested doc changes. A patch update
will follow with the next part of the review.

> > --- a/src/interfaces/libpq/fe-exec.c
> > +++ b/src/interfaces/libpq/fe-exec.c
> > @@ -41,7 +41,8 @@ char *const pgresStatus[] = {
> > "PGRES_COPY_BOTH",
> > "PGRES_SINGLE_TUPLE",
> > "PGRES_PIPELINE_SYNC",
> > -   "PGRES_PIPELINE_ABORTED"
> > +   "PGRES_PIPELINE_ABORTED",
> > +   "PGRES_TUPLES_CHUNK"
> >  };
> 
> I think that PGRES_SINGLE_TUPLE and PGRES_TUPLES_CHUNK should be next to
> each other, but that's no big thing.
> The same applies to the change in src/interfaces/libpq/libpq-fe.h

I assume we can't renumber/reorder existing values, otherwise it would be
an ABI break. We can only add new values.

> I understand that we need to keep the single-row mode for compatibility
> reasons.  But I think that under the hood, "single-row mode" should be the
> same as "chunk mode with chunk size one".

I've implemented it like that at first, and wasn't thrilled with the result.
libpq still has to return PGRES_SINGLE_TUPLE in single-row
mode and PGRES_TUPLES_CHUNK with chunks of size 1, so
the mutualization did not work that well in practice.
I also contemplated not creating PGRES_TUPLES_CHUNK
and instead using PGRES_SINGLE_TUPLE for N rows, but I found
it too ugly.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Built-in CTYPE provider

2024-03-27 Thread Daniel Verite
Jeff Davis wrote:

> The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8
> collation vs '123Abc' in PG_UNICODE_FAST.
> 
> The reason for the latter behavior is that the Unicode Default Case
> Conversion algorithm for toTitlecase() advances to the next Cased
> character before mapping to titlecase, and digits are not Cased. ICU
> has a configurable adjustment, and defaults in a way that produces
> '123abc'.

Even aside from ICU, there's a different behavior between glibc
and pg_c_utf8 glibc for codepoints in the decimal digit category 
outside of the US-ASCII range '0'..'9',

select initcap(concat(chr(0xff11), 'a') collate "C.utf8");   -- glibc 2.35
 initcap 
-
 1a

select initcap(concat(chr(0xff11), 'a') collate "pg_c_utf8");
 initcap 
-
 1A

Both collations consider that chr(0xff11) is not a digit
(isdigit()=>false) but C.utf8 says that it's alpha, whereas pg_c_utf8
says it's neither digit nor alpha.

AFAIU this is why in the above initcap() call, pg_c_utf8 considers
that 'a' is the first alphanumeric, whereas C.utf8 considers that '1'
is the first alphanumeric, leading to different capitalizations.

Comparing the 3 providers:

WITH v(provider,type,result) AS (values
 ('ICU', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "unicode"),
 ('glibc', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "C.utf8"),
 ('builtin', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "pg_c_utf8"),
 ('ICU', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "unicode"),
 ('glibc', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "C.utf8"),
 ('builtin', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "pg_c_utf8")
 )
select * from v
\crosstabview


 provider | isalpha | isdigit 
--+-+-
 ICU  | f   | t
 glibc| t   | f
 builtin  | f   | f


Are we fine with pg_c_utf8 differing from both ICU's point of view
(U+ff11 is digit and not alpha) and glibc point of view (U+ff11 is not
digit, but it's alpha)?

Aside from initcap(), this is going to be significant for regular
expressions.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2024-02-12 Thread Daniel Verite
Jakub Wartak wrote:

> when I run with default pager (more or less):
> \set FETCH_COUNT 1000
> WITH data AS (SELECT  generate_series(1, 2000) as Total) select
> repeat('a',100) || data.Total || repeat('b', 800) as total_pat from
> data;
> -- it enters pager, a skip couple of pages and then "q"
> 
> .. then - both backend and psql - go into 100% CPU as it were still
> receiving

Thanks for looking into this patch!

What's happening after the pager has quit is that psql continues
to pump results from the server until there are no more results.

If the user wants to interrupt that, they should hit Ctrl+C to
cancel the query. I think psql should not cancel it implicitly
on their behalf, as it also cancels the transaction.

The behavior differs from the cursor implementation, because in
the cursor case, when the pager is displaying results, no query is
running. The previous FETCH results have been entirely
read, and the next FETCH has not been sent to the server yet.
This is why quitting the pager in the middle of this can
be dealt with instantly.

> (that doesn't happen e.g. with export PAGER=cat).  So I'm
> not sure, maybe ExecQueryAndProcessResults() should somewhat
> faster abort when the $PAGER is exiting normally(?).

I assume that when using PAGER=cat, you cancel the display
with Ctrl+C, which propagates to psql and have the effect
to also cancel the query. In that case it displays
"Cancel request sent",
and then shortly after it gets back from the server:
"ERROR:  canceling statement due to user request".
That case corresponds to the generic query canceling flow.

OTOH if killing the "cat" process with kill -TERM I see the same
behavior than with "more" or "less", that is postgres running
the query to completion and psql pumping the results.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2024-01-30 Thread Daniel Verite
vignesh C wrote:

> patching file src/interfaces/libpq/exports.txt
> Hunk #1 FAILED at 191.
> 1 out of 1 hunk FAILED -- saving rejects to file
> src/interfaces/libpq/exports.txt.rej
> 
> Please post an updated version for the same.

PFA a rebased version.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
From 8cfb82b1e36e96996637948a231ae35b9af1e074 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= 
Date: Tue, 30 Jan 2024 14:38:21 +0100
Subject: [PATCH v6 1/2] Implement retrieval of results in chunks with libpq.

This mode is similar to the single-row mode except that chunks
of results contain up to N rows instead of a single row.
It is meant to reduce the overhead of the row-by-row allocations
for large result sets.
The mode is selected with PQsetChunkedRowsMode(int maxRows) and results
have the new status code PGRES_TUPLES_CHUNK.
---
 doc/src/sgml/libpq.sgml   |  96 ++
 .../libpqwalreceiver/libpqwalreceiver.c   |   1 +
 src/bin/pg_amcheck/pg_amcheck.c   |   1 +
 src/interfaces/libpq/exports.txt  |   1 +
 src/interfaces/libpq/fe-exec.c| 117 +++---
 src/interfaces/libpq/libpq-fe.h   |   4 +-
 src/interfaces/libpq/libpq-int.h  |   7 +-
 7 files changed, 184 insertions(+), 43 deletions(-)

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index d0d5aefadc..f7f5a04df6 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -3537,7 +3537,20 @@ ExecStatusType PQresultStatus(const PGresult *res);
 The PGresult contains a single result 
tuple
 from the current command.  This status occurs only when
 single-row mode has been selected for the query
-(see ).
+(see ).
+   
+  
+ 
+
+ 
+  PGRES_TUPLES_CHUNK
+  
+   
+The PGresult contains several tuples
+from the current command. The count of tuples cannot exceed
+the maximum passed to .
+This status occurs only when the chunked mode has been selected
+for the query (see ).

   
  
@@ -5189,8 +5202,8 @@ PGresult *PQgetResult(PGconn *conn);
   
Another frequently-desired feature that can be obtained with
 and 
-   is retrieving large query results a row at a time.  This is discussed
-   in .
+   is retrieving large query results a limited number of rows at a time.  This 
is discussed
+   in .
   
 
   
@@ -5554,12 +5567,13 @@ int PQflush(PGconn *conn);
 
 
 
- To enter single-row mode, call PQsetSingleRowMode
- before retrieving results with PQgetResult.
- This mode selection is effective only for the query currently
- being processed. For more information on the use of
- PQsetSingleRowMode,
- refer to .
+ To enter single-row or chunked modes, call
+ respectively PQsetSingleRowMode
+ or PQsetChunkedRowsMode before retrieving results
+ with PQgetResult.  This mode selection is effective
+ only for the query currently being processed. For more information on the
+ use of these functions refer
+ to .
 
 
 
@@ -5926,10 +5940,10 @@ UPDATE mytable SET x = x + 1 WHERE id = 42;
   
  
 
- 
-  Retrieving Query Results Row-by-Row
+ 
+  Retrieving Query Results by chunks
 
-  
+  
libpq
single-row mode
   
@@ -5940,13 +5954,15 @@ UPDATE mytable SET x = x + 1 WHERE id = 42;
PGresult.  This can be unworkable for commands
that return a large number of rows.  For such cases, applications can use
 and  
in
-   single-row mode.  In this mode, the result row(s) are
-   returned to the application one at a time, as they are received from the
-   server.
+   single-row mode or chunked 
mode.
+   In these modes, the result row(s) are returned to the application one at a
+   time for the single-row mode and by chunks for the chunked mode, as they
+   are received from the server.
   
 
   
-   To enter single-row mode, call 
+   To enter these modes, call 
+or 
immediately after a successful call of 
(or a sibling function).  This mode selection is effective only for the
currently executing query.  Then call 
@@ -5954,7 +5970,8 @@ UPDATE mytable SET x = x + 1 WHERE id = 42;
linkend="libpq-async"/>.  If the query returns any rows, they are returned
as individual PGresult objects, which look like
normal query results except for having status code
-   PGRES_SINGLE_TUPLE instead of
+   PGRES_SINGLE_TUPLE for the single-row mode and
+   PGRES_TUPLES_CHUNK for the chunked mode, instead of
PGRES_TUPLES_OK.  After the last row, or immediately if
the query returns zero rows, a zero-row object with status
PGRES_TUPLES_OK is returned; this is the signal that no
@@ -5967,9 +5984,9 @@ UPDATE mytable SET x = x + 1 WHERE id = 42;
   
 
   
-   When using pipeline mode, 

Re: Fixing backslash dot for COPY FROM...CSV

2024-01-24 Thread Daniel Verite
Robert Haas wrote:

> Those links unfortunately seem not to be entirely specific to this
> issue. Other, related things seem to be discussed there, and it's not
> obvious that everyone agrees on what to do, or really that anyone
> agrees on what to do. The best link that I found for this exact issue
> is
> https://www.postgresql.org/message-id/e1tdnvq-0001ju...@wrigleys.postgresql.org
> but the thread isn't very conclusive and is so old that any
> conclusions reached then might no longer be considered valid today.

To refresh the problem statement, 4 cases that need fixing as
of HEAD can be distinguished:

#1. copy csv from file, single column, no quoting involved.
COPY will stop at \. and ignore the rest of the file without
any error or warning.

$ cat >/tmp/file.csv 

Re: Built-in CTYPE provider

2024-01-18 Thread Daniel Verite
Peter Eisentraut wrote:

> > If the Postgres default was bytewise sorting+locale-agnostic
> > ctype functions directly derived from Unicode data files,
> > as opposed to libc/$LANG at initdb time, the main
> > annoyance would be that "ORDER BY textcol" would no
> > longer be the human-favored sort.
> 
> I think that would be a terrible direction to take, because it would 
> regress the default sort order from "correct" to "useless".  Aside from 
> the overall message this sends about how PostgreSQL cares about
> locales and Unicode and such.

Well, offering a viable solution to avoid as much as possible
the dreaded:

"WARNING: collation "xyz" has version mismatch
... HINT: Rebuild all objects affected by this collation..."

that doesn't sound like a bad message to send. 

Currently, to have in codepoint order the indexes that don't need a
linguistic order, you're supposed to use collate "C", which then means
that upper(), lower() etc.. don't work beyond ASCII.
Here our Unicode support is not good enough, and the proposal
addresses that.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Fixing backslash dot for COPY FROM...CSV

2024-01-16 Thread Daniel Verite
Robert Haas wrote:

> Part of my hesitancy, I suppose, is that I don't
> understand why we even have this strange convention of making \.
> terminate the input in the first place -- I mean, why wouldn't that be
> done in some kind of out-of-band way, rather than including a special
> marker in the data?

The v3 protocol added the out-of-band method, but the v2 protocol
did not have it, and as far as I understand, this is the reason why
CopyReadLineText() must interpret \. as an end-of-data marker.

The v2 protocol was removed in pg14
https://www.postgresql.org/docs/release/14.0/

  Remove server and libpq support for the version 2 wire protocol (Heikki
Linnakangas)
  This was last used as the default in PostgreSQL 7.3 (released in 2002).


Also I hadnt' noticed this before, but the current doc has this mention
that is relevant to this patch:

https://www.postgresql.org/docs/current/protocol-changes.html
"Summary of Changes since Protocol 2.0"

  COPY data is now encapsulated into CopyData and CopyDone
  messages. There is a well-defined way to recover from errors during
  COPY. The special “\.” last line is not needed anymore, and is not
  sent during COPY OUT. (It is still recognized as a terminator during
  COPY IN, but its use is deprecated and will eventually be removed.)


What the present patch does is essentially, for the server-side part,
stop recognizing "\." as as terminator, like this paragraph says, but
it does that for CSV only, not for TEXT.

> Hmm. Looking at the rest of the patch, it seems like you're removing
> the logic that prevents us from interpreting
> 
> \. lksdghksdhgjskdghjs
> 
> as an end-of-file while in CSV mode. But I would have thought based on
> what problem you're trying to fix that you would have wanted to keep
> that logic and further restrict it so that it only applies when not
> within a quoted string.
> 
> Maybe I'm misunderstanding what bug you're trying to fix?

The fix is that \. is no longer recognized as special in CSV, whether
alone on a line or not, and whether in a quoted section or not.
It's always interpreted as data, like it would have been in
the first place, I imagine, if the v2 protocol could have handled
it. This is why the patch consists mostly of removing code and
simplifying comments.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Built-in CTYPE provider

2024-01-15 Thread Daniel Verite
Jeff Davis wrote:

> New version attached.

[v16]

Concerning the target category_test, it produces failures with
versions of ICU with Unicode < 15. The first one I see with Ubuntu
22.04 (ICU 70.1) is:

category_test: Postgres Unicode version:15.1
category_test: ICU Unicode version: 14.0
category_test: FAILURE for codepoint 0x000c04
category_test: Postgres property   
alphabetic/lowercase/uppercase/white_space/hex_digit/join_control:
1/0/0/0/0/0
category_test: ICU  property   
alphabetic/lowercase/uppercase/white_space/hex_digit/join_control:
0/0/0/0/0/0

U+0C04 is a codepoint added in Unicode 11.
https://en.wikipedia.org/wiki/Telugu_(Unicode_block)

In Unicode.txt:
0C04;TELUGU SIGN COMBINING ANUSVARA ABOVE;Mn;0;NSM;N;

In Unicode 15, it has been assigned Other_Alphabetic in PropList.txt
$ grep 0C04 PropList.txt 
0C04  ; Other_Alphabetic # Mn   TELUGU SIGN COMBINING ANUSVARA
ABOVE

But in Unicode 14 it was not there.
As a result its binary property UCHAR_ALPHABETIC has changed from
false to true in ICU 72 vs previous versions.

As I understand, the stability policy says that such things happen.
From https://www.unicode.org/policies/stability_policy.html

   Once a character is encoded, its properties may still be changed,
   but not in such a way as to change the fundamental identity of the
   character.

   The Consortium will endeavor to keep the values of the other
   properties as stable as possible, but some circumstances may arise
   that require changing them. Particularly in the situation where
   the Unicode Standard first encodes less well-documented characters
   and scripts, the exact character properties and behavior initially
   may not be well known.

   As more experience is gathered in implementing the characters,
   adjustments in the properties may become necessary. Examples of
   such properties include, but are not limited to, the following:

- General_Category
- Case mappings
- Bidirectional properties
[...]

I've commented the exit(1) in category_test to collect all errors, and
built it with versions of ICU from 74 down to 60 (that is Unicode 10.0).
Results are attached. As expected, the older the ICU version, the more
differences are found against Unicode 15.1.

I find these results interesting because they tell us what contents
can break regexp-based check constraints on upgrades.

But about category_test as a pass-or-fail kind of test, it can only be
used when the Unicode version in ICU is the same as in Postgres.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite


results-category-tests-multiple-icu-versions.tar.bz2
Description: Binary data


Re: Built-in CTYPE provider

2024-01-12 Thread Daniel Verite
Jeff Davis wrote:

> > Jeremy also raised a problem with old versions of psql connecting to
> > a
> > new server: the \l and \dO won't work. Not sure exactly what to do
> > there, but I could work around it by adding a new field rather than
> > renaming (though that's not ideal).
> 
> I did a bit of research for a precedent, and the closest I could find
> is that \dp was broken between 14 and 15 by commit 07eee5a0dc.

Another one is that version 12 broke \d in older psql by
removing pg_class.relhasoids.

ISTM that in general the behavior of old psql vs new server does
not weight much against choosing optimal catalog changes.

There's also that warning at start informing users about it:
WARNING: psql major version X, server major version Y.
 Some psql features might not work.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Built-in CTYPE provider

2024-01-10 Thread Daniel Verite
Jeff Davis wrote:

> Attached a more complete version that fixes a few bugs

[v15 patch]

When selecting the builtin provider with initdb, I'm getting the
following setup:

$ bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata
 
  The database cluster will be initialized with this locale configuration:
default collation provider:  builtin
default collation locale:C.UTF-8
LC_COLLATE:  C.UTF-8
LC_CTYPE:C.UTF-8
LC_MESSAGES: C.UTF-8
LC_MONETARY: C.UTF-8
LC_NUMERIC:  C.UTF-8
LC_TIME: C.UTF-8
  The default database encoding has accordingly been set to "UTF8".
  The default text search configuration will be set to "english".

This is from an environment where LANG=fr_FR.UTF-8

I would expect all LC_* variables to be fr_FR.UTF-8, and the default
text search configuration to be "french".  It is what happens
when selecting ICU as the provider in the same environment:

$ bin/initdb --icu-locale=en --locale-provider=icu -D/tmp/pgdata

  Using language tag "en" for ICU locale "en".
  The database cluster will be initialized with this locale configuration:
default collation provider:  icu
default collation locale:en
LC_COLLATE:  fr_FR.UTF-8
LC_CTYPE:fr_FR.UTF-8
LC_MESSAGES: fr_FR.UTF-8
LC_MONETARY: fr_FR.UTF-8
LC_NUMERIC:  fr_FR.UTF-8
LC_TIME: fr_FR.UTF-8
  The default database encoding has accordingly been set to "UTF8".
  The default text search configuration will be set to "french".

The collation setup does not influence the rest of the localization.
The problem AFAIU is that --locale has two distinct
meanings in the v15 patch:
--locale-provider=X --locale=Y means use "X" as the provider
with "Y" as datlocale, and it means use "Y" as the locale for all
localized libc functionalities.

I wonder what would happen if invoking
 bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata
on a system where C.UTF-8 does not exist as a libc locale.
Would it fail? (I don't have an OS like this to test ATM, will try later).

A related comment is about naming the builtin locale C.UTF-8, the same
name as in libc. On one hand this is semantically sound, but on the
other hand, it's likely to confuse people. What about using completely
different names, like "pg_unicode" or something else prefixed by "pg_"
both for the locale name and the collation name (currently
C.UTF-8/c_utf8)?



Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2024-01-02 Thread Daniel Verite
  Hi,

PFA a rebased version.

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
From cd0fe1d517a0e31e031fbbea1e603a715c77ea97 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= 
Date: Tue, 2 Jan 2024 14:15:48 +0100
Subject: [PATCH v5 1/2] Implement retrieval of results in chunks with libpq.

This mode is similar to the single-row mode except that chunks
of results contain up to N rows instead of a single row.
It is meant to reduce the overhead of the row-by-row allocations
for large result sets.
The mode is selected with PQsetChunkedRowsMode(int maxRows) and results
have the new status code PGRES_TUPLES_CHUNK.
---
 doc/src/sgml/libpq.sgml   |  96 ++
 .../libpqwalreceiver/libpqwalreceiver.c   |   1 +
 src/bin/pg_amcheck/pg_amcheck.c   |   1 +
 src/interfaces/libpq/exports.txt  |   1 +
 src/interfaces/libpq/fe-exec.c| 117 +++---
 src/interfaces/libpq/libpq-fe.h   |   4 +-
 src/interfaces/libpq/libpq-int.h  |   7 +-
 7 files changed, 184 insertions(+), 43 deletions(-)

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index ed88ac001a..8007bf67d8 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -3537,7 +3537,20 @@ ExecStatusType PQresultStatus(const PGresult *res);
 The PGresult contains a single result 
tuple
 from the current command.  This status occurs only when
 single-row mode has been selected for the query
-(see ).
+(see ).
+   
+  
+ 
+
+ 
+  PGRES_TUPLES_CHUNK
+  
+   
+The PGresult contains several tuples
+from the current command. The count of tuples cannot exceed
+the maximum passed to .
+This status occurs only when the chunked mode has been selected
+for the query (see ).

   
  
@@ -5187,8 +5200,8 @@ PGresult *PQgetResult(PGconn *conn);
   
Another frequently-desired feature that can be obtained with
 and 
-   is retrieving large query results a row at a time.  This is discussed
-   in .
+   is retrieving large query results a limited number of rows at a time.  This 
is discussed
+   in .
   
 
   
@@ -5551,12 +5564,13 @@ int PQflush(PGconn *conn);
 
 
 
- To enter single-row mode, call PQsetSingleRowMode
- before retrieving results with PQgetResult.
- This mode selection is effective only for the query currently
- being processed. For more information on the use of
- PQsetSingleRowMode,
- refer to .
+ To enter single-row or chunked modes, call
+ respectively PQsetSingleRowMode
+ or PQsetChunkedRowsMode before retrieving results
+ with PQgetResult.  This mode selection is effective
+ only for the query currently being processed. For more information on the
+ use of these functions refer
+ to .
 
 
 
@@ -5895,10 +5909,10 @@ UPDATE mytable SET x = x + 1 WHERE id = 42;
   
  
 
- 
-  Retrieving Query Results Row-by-Row
+ 
+  Retrieving Query Results by chunks
 
-  
+  
libpq
single-row mode
   
@@ -5909,13 +5923,15 @@ UPDATE mytable SET x = x + 1 WHERE id = 42;
PGresult.  This can be unworkable for commands
that return a large number of rows.  For such cases, applications can use
 and  
in
-   single-row mode.  In this mode, the result row(s) are
-   returned to the application one at a time, as they are received from the
-   server.
+   single-row mode or chunked 
mode.
+   In these modes, the result row(s) are returned to the application one at a
+   time for the single-row mode and by chunks for the chunked mode, as they
+   are received from the server.
   
 
   
-   To enter single-row mode, call 
+   To enter these modes, call 
+or 
immediately after a successful call of 
(or a sibling function).  This mode selection is effective only for the
currently executing query.  Then call 
@@ -5923,7 +5939,8 @@ UPDATE mytable SET x = x + 1 WHERE id = 42;
linkend="libpq-async"/>.  If the query returns any rows, they are returned
as individual PGresult objects, which look like
normal query results except for having status code
-   PGRES_SINGLE_TUPLE instead of
+   PGRES_SINGLE_TUPLE for the single-row mode and
+   PGRES_TUPLES_CHUNK for the chunked mode, instead of
PGRES_TUPLES_OK.  After the last row, or immediately if
the query returns zero rows, a zero-row object with status
PGRES_TUPLES_OK is returned; this is the signal that no
@@ -5936,9 +5953,9 @@ UPDATE mytable SET x = x + 1 WHERE id = 42;
   
 
   
-   When using pipeline mode, single-row mode needs to be activated for each
-   query in the pipeline before retrieving results for that query
-   with PQgetResult.
+   When using pipeline mode, the single-row or chunked mode need to be
+   activated for each query 

Re: Fixing backslash dot for COPY FROM...CSV

2023-12-31 Thread Daniel Verite
  Hi,

The CI patch tester fails on this patch, because it has a label
at the end of a C block, which I'm learning is a C23 feature
that happens to be supported by gcc 11 [1], but is not portable.

PFA an update fixing this, plus removing an obsolete chunk
in the COPY documentation that v2 left out.


[1] https://gcc.gnu.org/onlinedocs/gcc/Mixed-Labels-and-Declarations.html


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
From 98f38aff440ad683aa1bd7a30fd960f1d0101191 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= 
Date: Sun, 31 Dec 2023 14:47:05 +0100
Subject: [PATCH v4] Support backslash-dot on a line by itself as valid data in
 COPY FROM...CSV

---
 doc/src/sgml/ref/copy.sgml   |  6 +--
 doc/src/sgml/ref/psql-ref.sgml   |  4 --
 src/backend/commands/copyfromparse.c | 57 +++-
 src/bin/psql/copy.c  | 32 +++-
 src/test/regress/expected/copy.out   | 18 +
 src/test/regress/sql/copy.sql| 12 ++
 6 files changed, 67 insertions(+), 62 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 18ecc69c33..e719053e3d 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -761,11 +761,7 @@ COPY count
 format, \., the end-of-data marker, could also appear
 as a data value.  To avoid any misinterpretation, a \.
 data value appearing as a lone entry on a line is automatically
-quoted on output, and on input, if quoted, is not interpreted as the
-end-of-data marker.  If you are loading a file created by another
-application that has a single unquoted column and might have a
-value of \., you might need to quote that value in the
-input file.
+quoted on output.

 

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index cc7d797159..d94e3cacfc 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1119,10 +1119,6 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 
'second value' \g
 destination, because all data must pass through the client/server
 connection.  For large amounts of data the SQL
 command might be preferable.
-Also, because of this pass-through method, \copy
-... from in CSV mode will erroneously
-treat a \. data value alone on a line as an
-end-of-input marker.
 
 
 
diff --git a/src/backend/commands/copyfromparse.c 
b/src/backend/commands/copyfromparse.c
index f553734582..67e046d450 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -136,14 +136,6 @@ if (1) \
} \
 } else ((void) 0)
 
-/* Undo any read-ahead and jump out of the block. */
-#define NO_END_OF_COPY_GOTO \
-if (1) \
-{ \
-   input_buf_ptr = prev_raw_ptr + 1; \
-   goto not_end_of_copy; \
-} else ((void) 0)
-
 /* NOTE: there's a copy of this in copyto.c */
 static const char BinarySignature[11] = "PGCOPY\n\377\r\n\0";
 
@@ -1137,7 +1129,6 @@ CopyReadLineText(CopyFromState cstate)
boolresult = false;
 
/* CSV variables */
-   boolfirst_char_in_line = true;
boolin_quote = false,
last_was_esc = false;
charquotec = '\0';
@@ -1335,10 +1326,9 @@ CopyReadLineText(CopyFromState cstate)
}
 
/*
-* In CSV mode, we only recognize \. alone on a line.  This is 
because
-* \. is a valid CSV data value.
+* In CSV mode, backslash is a normal character.
 */
-   if (c == '\\' && (!cstate->opts.csv_mode || first_char_in_line))
+   if (c == '\\' && !cstate->opts.csv_mode)
{
charc2;
 
@@ -1371,21 +1361,15 @@ CopyReadLineText(CopyFromState cstate)
 
if (c2 == '\n')
{
-   if (!cstate->opts.csv_mode)
-   ereport(ERROR,
-   
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
-
errmsg("end-of-copy marker does not match previous newline style")));
-   else
-   NO_END_OF_COPY_GOTO;
+   ereport(ERROR,
+   
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
+
errmsg("end-of-copy marker does not match previous newline style")));
}
else if 

Re: Built-in CTYPE provider

2023-12-22 Thread Daniel Verite
Robert Haas wrote:

> For someone who is currently defaulting to es_ES.utf8 or fr_FR.utf8,
> a change to C.utf8 would be a much bigger problem, I would
> think. Their alphabet isn't in code point order, and so things would
> be alphabetized wrongly.

> That might be OK if they don't care about ordering for any purpose
> other than equality lookups, but otherwise it's going to force them
> to change the default, where today they don't have to do that.

Sure, in whatever collation setup we expose, we need to keep
it possible and even easy to sort properly with linguistic rules.

But some reasons to use $LANG as the default locale/collation
are no longer as good as they used to be, I think.

Starting with v10/ICU we have many pre-created ICU locales with
fixed names, and starting with v16, we can simply write "ORDER BY
textfield COLLATE unicode" which is good enough in most cases. So
the configuration "bytewise sort by default" / "linguistic sort on-demand"
has become more realistic.

By contrast in the pre-v10 days with only libc collations, an
application could have no idea which collations were going to be
available on the server, and how they were named precisely, as this
varies across OSes and across installs even with the same OS.
On Windows, I think that before v16 initdb did not create any libc
collation beyond C/POSIX and the default language/region of the OS.

In that libc context, if a db wants the C locale by default for
performance and truly immutable indexes, but the client app needs to
occasionally do in-db linguistic sorts, the app needs to figure out
which collation name will work for that. This is hard if you don't
target a specific installation that guarantees that such or such
collation is going to be installed.
Whereas if the linguistic locale is the default, the app never needs
to know its name or anything about it. So it's done that way,
linguistic by default. But that leaves databases with many
indexes sorted linguistically instead of bytewise for fields
that semantically never need any linguistic sort.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Fixing backslash dot for COPY FROM...CSV

2023-12-21 Thread Daniel Verite
vignesh C wrote:

> Thanks for the updated patch, any reason why this is handled only in csv.
> postgres=# copy test1 from '/home/vignesh/postgres/inst/bin/copy1.out';
> COPY 1
> postgres=# select * from test1;
>  c1
> ---
> line1
> (1 row)

I believe it's safer to not change anything to the normal "non-csv"
text mode.
The current doc says that \. will not be taken as data in this format.
From https://www.postgresql.org/docs/current/sql-copy.html :

   Any other backslashed character that is not mentioned in the above
   table will be taken to represent itself. However, beware of adding
   backslashes unnecessarily, since that might accidentally produce a
   string matching the end-of-data marker (\.) or the null string (\N
   by default). These strings will be recognized before any other
   backslash processing is done.



Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Built-in CTYPE provider

2023-12-20 Thread Daniel Verite
Jeff Davis wrote:


> But there are a lot of users for whom neither of those things are true,
> and it makes zero sense to order all of the text indexes in the
> database according to any one particular locale. I think these users
> would prioritize stability and performance for the database collation,
> and then use COLLATE clauses with ICU collations where necessary.

+1

> I am also still concerned that we have the wrong defaults. Almost
> nobody thinks libc is a great provider, but that's the default, and
> there were problems trying to change that default to ICU in 16. If we
> had a builtin provider, that might be a better basis for a default
> (safe, fast, always available, and documentable). Then, at least if
> someone picks a different locale at initdb time, they would be doing so
> intentionally, rather than implicitly accepting index corruption risks
> based on an environment variable.

Yes. The introduction of the bytewise-sorting, locale-agnostic
C.UTF-8 in glibc is also a step in the direction of providing better
defaults for apps like Postgres, that need both long-term stability
in sorts and Unicode coverage for ctype-dependent functions.

But C.UTF-8 is not available everywhere, and there's still the
problem that Unicode updates through libc are not aligned
with Postgres releases.

ICU has the advantage of cross-OS compatibility,
but it does not provide any collation with bytewise sorting
like C or C.UTF-8, and we don't allow a combination like
"C" for sorting and ICU for ctype operations. When opting
for a locale provider, it has to be for both sorting
and ctype, so an installation that needs cross-OS
compatibility, good Unicode support and long-term stability
of indexes cannot get that with ICU as we expose it
today.

If the Postgres default was bytewise sorting+locale-agnostic
ctype functions directly derived from Unicode data files,
as opposed to libc/$LANG at initdb time, the main
annoyance would be that "ORDER BY textcol" would no
longer be the human-favored sort.
For the presentation layer, we would have to write for instance
 ORDER BY textcol COLLATE "unicode" for the root collation
or a specific region-country if needed.
But all the rest seems better, especially cross-OS compatibity,
truly immutable and faster indexes for fields that
don't require linguistic ordering, alignment between Unicode
updates and Postgres updates.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Fixing backslash dot for COPY FROM...CSV

2023-12-19 Thread Daniel Verite
vignesh C wrote:

> I noticed that these tests are passing without applying patch too:

> +insert into copytest2(test) values('line1'), ('\.'), ('line2');
> +copy (select test from copytest2 order by test collate "C") to :'filename'
> csv;
> +-- get the data back in with copy
> +truncate copytest2;
> +copy copytest2(test) from :'filename' csv;
> +select test from copytest2 order by test collate "C";
> 
> I was not sure if this was intentional. Can we add a test which fails
> in HEAD and passes with the patch applied.

Thanks for checking this out.
Indeed, that was not intentional. I've been using static files
in my tests and forgot that if the data was produced with
COPY OUT, it would quote backslash-dot so that COPY IN could
reload it without problem.

PFA an updated version that uses \qecho to produce the
data instead of COPY OUT. This test on unpatched HEAD
shows that copytest2 is missing 2 rows after COPY IN.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
From 0137963134ac88b009a8e93d6a39cabfaefe43f5 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= 
Date: Tue, 19 Dec 2023 11:49:35 +0100
Subject: [PATCH v2] Support backslash-dot on a line by itself as valid data in
 COPY FROM...CSV

---
 doc/src/sgml/ref/psql-ref.sgml   |  4 
 src/backend/commands/copyfromparse.c | 13 ++-
 src/bin/psql/copy.c  | 32 
 src/test/regress/expected/copy.out   | 18 
 src/test/regress/sql/copy.sql| 12 +++
 5 files changed, 55 insertions(+), 24 deletions(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index cc7d797159..d94e3cacfc 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1119,10 +1119,6 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 
'second value' \g
 destination, because all data must pass through the client/server
 connection.  For large amounts of data the SQL
 command might be preferable.
-Also, because of this pass-through method, \copy
-... from in CSV mode will erroneously
-treat a \. data value alone on a line as an
-end-of-input marker.
 
 
 
diff --git a/src/backend/commands/copyfromparse.c 
b/src/backend/commands/copyfromparse.c
index f553734582..b4c291b25b 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -1137,7 +1137,6 @@ CopyReadLineText(CopyFromState cstate)
boolresult = false;
 
/* CSV variables */
-   boolfirst_char_in_line = true;
boolin_quote = false,
last_was_esc = false;
charquotec = '\0';
@@ -1335,10 +1334,9 @@ CopyReadLineText(CopyFromState cstate)
}
 
/*
-* In CSV mode, we only recognize \. alone on a line.  This is 
because
-* \. is a valid CSV data value.
+* In CSV mode, \. is a valid CSV data value anywhere in the 
line.
 */
-   if (c == '\\' && (!cstate->opts.csv_mode || first_char_in_line))
+   if (c == '\\' && !cstate->opts.csv_mode)
{
charc2;
 
@@ -1442,14 +1440,7 @@ CopyReadLineText(CopyFromState cstate)
}
}
 
-   /*
-* This label is for CSV cases where \. appears at the start of 
a
-* line, but there is more text after it, meaning it was a data 
value.
-* We are more strict for \. in CSV mode because \. could be a 
data
-* value, while in non-CSV mode, \. cannot be a data value.
-*/
 not_end_of_copy:
-   first_char_in_line = false;
}   /* end of outer 
loop */
 
/*
diff --git a/src/bin/psql/copy.c b/src/bin/psql/copy.c
index dbbbdb8898..f31a7acbb6 100644
--- a/src/bin/psql/copy.c
+++ b/src/bin/psql/copy.c
@@ -620,20 +620,34 @@ handleCopyIn(PGconn *conn, FILE *copystream, bool 
isbinary, PGresult **res)
/* current line is done? */
if (buf[buflen - 1] == '\n')
{
-   /* check for EOF marker, but not on a 
partial line */
-   if (at_line_begin)
+   /*
+* When at the beginning of the line, 
check for EOF marker.
+* If the marker is found and the data 
is inlined,
+* we must stop at this point.  If not, 
the \. line can be
+* sent to the server, and we let it 
decide whether
+

Fixing backslash dot for COPY FROM...CSV

2023-12-18 Thread Daniel Verite
  Hi,

PFA a patch that attempts to fix the bug that \. on a line
by itself is handled incorrectly by COPY FROM ... CSV.
This issue has been discussed several times previously,
for instance in [1] and [2], and mentioned in the
doc for \copy in commit 42d3125.

There's one case that works today: when
the line is part of a multi-line quoted section,
and the data is read from a file, not from the client.
In other situations, an error is raised or the data is cut at
the point of \. without an error.

The patch addresses that issue in the server and in psql,
except for the case of inlined data, where \. cannot be 
both valid data and an EOF marker at the same time, so
it keeps treating it as an EOF marker. 


[1]
https://www.postgresql.org/message-id/10e3eff6-eb04-4b3f-aeb4-b920192b9...@manitou-mail.org
[2]
https://www.postgresql.org/message-id/8aeab305-5e94-4fa5-82bf-6da6baee6e05%40app.fastmail.com


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
From 868b1e065cf714f315bab65129fd05a1d60fc81b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= 
Date: Mon, 18 Dec 2023 20:47:02 +0100
Subject: [PATCH v1] Support backslash-dot on a line by itself as valid data in
 COPY FROM...CSV

---
 doc/src/sgml/ref/psql-ref.sgml   |  4 
 src/backend/commands/copyfromparse.c | 13 ++-
 src/bin/psql/copy.c  | 32 
 src/test/regress/expected/copy.out   | 15 +
 src/test/regress/sql/copy.sql| 11 ++
 5 files changed, 51 insertions(+), 24 deletions(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index cc7d797159..d94e3cacfc 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1119,10 +1119,6 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 
'second value' \g
 destination, because all data must pass through the client/server
 connection.  For large amounts of data the SQL
 command might be preferable.
-Also, because of this pass-through method, \copy
-... from in CSV mode will erroneously
-treat a \. data value alone on a line as an
-end-of-input marker.
 
 
 
diff --git a/src/backend/commands/copyfromparse.c 
b/src/backend/commands/copyfromparse.c
index f553734582..b4c291b25b 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -1137,7 +1137,6 @@ CopyReadLineText(CopyFromState cstate)
boolresult = false;
 
/* CSV variables */
-   boolfirst_char_in_line = true;
boolin_quote = false,
last_was_esc = false;
charquotec = '\0';
@@ -1335,10 +1334,9 @@ CopyReadLineText(CopyFromState cstate)
}
 
/*
-* In CSV mode, we only recognize \. alone on a line.  This is 
because
-* \. is a valid CSV data value.
+* In CSV mode, \. is a valid CSV data value anywhere in the 
line.
 */
-   if (c == '\\' && (!cstate->opts.csv_mode || first_char_in_line))
+   if (c == '\\' && !cstate->opts.csv_mode)
{
charc2;
 
@@ -1442,14 +1440,7 @@ CopyReadLineText(CopyFromState cstate)
}
}
 
-   /*
-* This label is for CSV cases where \. appears at the start of 
a
-* line, but there is more text after it, meaning it was a data 
value.
-* We are more strict for \. in CSV mode because \. could be a 
data
-* value, while in non-CSV mode, \. cannot be a data value.
-*/
 not_end_of_copy:
-   first_char_in_line = false;
}   /* end of outer 
loop */
 
/*
diff --git a/src/bin/psql/copy.c b/src/bin/psql/copy.c
index dbbbdb8898..f31a7acbb6 100644
--- a/src/bin/psql/copy.c
+++ b/src/bin/psql/copy.c
@@ -620,20 +620,34 @@ handleCopyIn(PGconn *conn, FILE *copystream, bool 
isbinary, PGresult **res)
/* current line is done? */
if (buf[buflen - 1] == '\n')
{
-   /* check for EOF marker, but not on a 
partial line */
-   if (at_line_begin)
+   /*
+* When at the beginning of the line, 
check for EOF marker.
+* If the marker is found and the data 
is inlined,
+* we must stop at this point.  If not, 
the \. line can be
+* sent to the server, and we let it 
decide whether
+* it's an 

Re: Built-in CTYPE provider

2023-12-13 Thread Daniel Verite
Jeff Davis wrote:

> While "full" case mapping sounds more complex, there are actually
> very few cases to consider and they are covered in another (small)
> data file. That data file covers ~100 code points that convert to
> multiple code points when the case changes (e.g. "ß" -> "SS"), 7
> code points that have context-sensitive mappings, and three locales
> which have special conversions ("lt", "tr", and "az") for a few code
> points.

But there are CLDR mappings on top of that.

According to the Unicode FAQ

   https://unicode.org/faq/casemap_charprop.html#5

   Q: Does the default case mapping work for every language? What
   about the default case folding?

   [...]

   To make case mapping language sensitive, the Unicode Standard
   specificially allows implementations to tailor the mappings for
   each language, but does not provide the necessary data. The file
   SpecialCasing.txt is included in the Standard as a guide to a few
   of the more important individual character mappings needed for
   specific languages, notably the Greek script and the Turkic
   languages. However, for most language-specific mappings and
   tailoring, users should refer to CLDR and other resources.

In particular "el" (modern greek) has case mapping rules that
ICU seems to implement, but "el" is missing from the list
("lt", "tr", and "az") you identified.

The CLDR case mappings seem to be found in
https://github.com/unicode-org/cldr/tree/main/common/transforms
in *-Lower.xml and *-Upper.xml


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Emitting JSON to file using COPY TO

2023-12-08 Thread Daniel Verite
Dave Cramer wrote:

> > This argument for leaving 3 as the column count makes sense to me.  I
> > agree this content is not meant to facilitate interpreting the contents at
> > a protocol level.
> >
> 
> I'd disagree. From my POV if the data comes back as a JSON Array this is
> one object and this should be reflected in the column count.

The doc says this:
"Int16
The number of columns in the data to be copied (denoted N below)."

and this formulation is repeated in PQnfields() for libpq:

"PQnfields
Returns the number of columns (fields) to be copied."

How to interpret that sentence? 
"to be copied" from what, into what, and by what way?

A plausible interpretation is "to be copied from the source data
into the COPY stream, by the backend".  So the number of columns
to be copied likely refers to the columns of the dataset, not the
"in-transit form" that is text or csv or json.

The interpetation you're proposing also makes sense, that it's just
one json column per row, or even a single-row single-column for the
entire dataset in the force_array case, but then the question is why
isn't that number of columns always 1 for the original "text" format,
since each row is represented in the stream as a single long piece of
text?


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Emitting JSON to file using COPY TO

2023-12-08 Thread Daniel Verite
Joe Conway wrote:

> copyto_json.007.diff

When the source has json fields with non-significant line feeds, the COPY
output has these line feeds too, which makes the output incompatible
with rule #2 at https://jsonlines.org  ("2. Each Line is a Valid JSON
Value").

create table j(f json);

insert into j values('{"a":1,
"b":2
}');

copy j to stdout (format json);

Result:
{"f":{"a":1,
"b":2
}}

Is that expected? copy.sgml in 007 doesn't describe the output
in terms of lines so it's hard to tell from the doc.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Emitting JSON to file using COPY TO

2023-12-07 Thread Daniel Verite
Joe Conway wrote:

> The attached should fix the CopyOut response to say one column. I.e. it 
> ought to look something like:

Spending more time with the doc I came to the opinion that in this bit
of the protocol, in CopyOutResponse (B)
...
Int16
The number of columns in the data to be copied (denoted N below).
...

this number must be the number of columns in the source.
That is for COPY table(a,b,c)   the number is 3, independently
on whether the result is formatted in text, cvs, json or binary.

I think that changing it for json can reasonably be interpreted
as a protocol break and we should not do it.

The fact that this value does not help parsing the CopyData
messages that come next is not a new issue. A reader that
doesn't know the field separator and whether it's text or csv
cannot parse these messages into fields anyway.
But just knowing how much columns there are in the original
data might be useful by itself and we don't want to break that.


The other question for me is, in the CopyData message, this
bit:
" Messages sent from the backend will always correspond to single data rows"

ISTM that considering that the "[" starting the json array is a
"data row" is a stretch.
That might be interpreted as a protocol break, depending
on how strict the interpretation is.



Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Daniel Verite
Andrew Dunstan wrote:

> IMNSHO, we should produce either a single JSON 
> document (the ARRAY case) or a series of JSON documents, one per row 
> (the LINES case).

"COPY Operations" in the doc says:

" The backend sends a CopyOutResponse message to the frontend, followed
   by zero or more CopyData messages (always one per row), followed by
   CopyDone".

In the ARRAY case, the first messages with the copyjsontest
regression test look like this (tshark output):

PostgreSQL
Type: CopyOut response
Length: 13
Format: Text (0)
Columns: 3
Format: Text (0)
PostgreSQL
Type: Copy data
Length: 6
Copy data: 5b0a
PostgreSQL
Type: Copy data
Length: 76
Copy data:
207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031…

The first Copy data message with contents "5b0a" does not qualify
as a row of data with 3 columns as advertised in the CopyOut
message. Isn't that a problem?

At least the json non-ARRAY case ("json lines") doesn't have
this issue, since every CopyData message corresponds effectively
to a row in the table.


[1] https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-COPY


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Make COPY format extendable: Extract COPY TO format implementations

2023-12-06 Thread Daniel Verite
Sutou Kouhei wrote:

> * 2022-04: Apache Arrow [2]
> * 2018-02: Apache Avro, Apache Parquet and Apache ORC [3]
> 
> (FYI: I want to add support for Apache Arrow.)
> 
> There were discussions how to add support for more formats. [3][4]
> In these discussions, we got a consensus about making COPY
> format extendable.


These formats seem all column-oriented whereas COPY is row-oriented
at the protocol level [1].
With regard to the procotol, how would it work to support these formats?


[1] https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-COPY


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: EXCLUDE COLLATE in CREATE/ALTER TABLE document

2023-12-01 Thread Daniel Verite
shihao zhong wrote:

> Thanks for your comments, a new version is attached.

In this hunk:

@@ -1097,8 +1097,8 @@ WITH ( MODULUS numeric_literal, REM
   method index_method.
   The operators are required to be commutative.
   Each exclude_element
-  can optionally specify an operator class and/or ordering options;
-  these are described fully under
+  can optionally specify any of the following: a collation, a
+  operator class, or ordering options; these are described fully under
   .
  
 
"a" should be "an" as it's followed by "operator class".

Also the use of "and/or" in the previous version conveys the fact
that operator class and ordering options are not mutually
exclusive. But when using "any of the following" in the new text,
doesn't it loose that meaning?

In case it does, I would suggest the attached diff.

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
diff --git a/doc/src/sgml/ref/alter_table.sgml 
b/doc/src/sgml/ref/alter_table.sgml
index 2c4138e4e9..43f5f0873e 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -135,7 +135,7 @@ WITH ( MODULUS numeric_literal, REM
 
 exclude_element in an 
EXCLUDE constraint is:
 
-{ column_name | ( expression ) } [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST 
} ]
+{ column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST 
} ]
 
 referential_action in a 
FOREIGN KEY/REFERENCES constraint 
is:
 
diff --git a/doc/src/sgml/ref/create_table.sgml 
b/doc/src/sgml/ref/create_table.sgml
index e04a0692c4..277d292aac 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -105,7 +105,7 @@ WITH ( MODULUS numeric_literal, REM
 
 exclude_element in an 
EXCLUDE constraint is:
 
-{ column_name | ( expression ) } [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST 
} ]
+{ column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST 
} ]
 
 referential_action in a 
FOREIGN KEY/REFERENCES constraint 
is:
 
@@ -1097,9 +1097,8 @@ WITH ( MODULUS numeric_literal, REM
   method index_method.
   The operators are required to be commutative.
   Each exclude_element
-  can optionally specify an operator class and/or ordering options;
-  these are described fully under
-  .
+  can optionally specify a collation, an operator class, and ordering 
options;
+  these are described fully under .
  
 
  


Re: proposal: change behavior on collation version mismatch

2023-11-28 Thread Daniel Verite
Jeremy Schneider wrote:

> 1) "collation changes are uncommon" (which is relatively correct)
> 2) "most users would rather have ease-of-use than 100% safety, since
> it's uncommon"
> 
> And I think this led to the current behavior of issuing a warning rather
> than an error,

There's a technical reason for this being a warning.
If it was an error, any attempt to do anything with the collation
would fail, which includes REINDEX on indexes  using
that collation.
And yet that's precisely what you're supposed to do in that
situation.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: [ psql - review request ] review request for \d+ tablename, \d+ indexname indenting

2023-11-22 Thread Daniel Verite
Shlok Kyal wrote:

> > The error was corrected and a new diff file was created.
> > The diff file was created based on 16 RC1.
> > We confirmed that 5 places where errors occurred when performing
> > make check were changed to ok.

Reviewing the patch, I see these two problems in the current version
(File: psql-slashDplus-partition-indentation.diff, Date: 2023-09-19 00:19:34) 

* There are changes in the regression tests that do not concern this
feature and should not be there.

For instance this hunk:

--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -742,8 +742,6 @@ COMMENT ON COLUMN ft1.c1 IS 'ft1.c1';
 Check constraints:
 "ft1_c2_check" CHECK (c2 <> ''::text)
 "ft1_c3_check" CHECK (c3 >= '01-01-1994'::date AND c3 <=
'01-31-1994'::date)
-Not-null constraints:
-"ft1_c1_not_null" NOT NULL "c1"
 Server: s0
 FDW options: (delimiter ',', quote '"', "be quoted" 'value')

It seems to undo a test for a recent feature adding "Not-null
constraints" to \d, which suggests that you've been running tests
against and older version than the source tree you're diffing
against. These should be the same version, and also the latest
one (git HEAD) or as close as possible to the latest when the
patch is submitted.

* The new query with \d on partitioned tables does not work with
  Postgres servers 12 or 13:


postgres=# CREATE TABLE measurement (
city_id int not null,
logdate date not null,
peaktempint,
unitsales   int
) PARTITION BY RANGE (logdate);

postgres=# \d measurement 
ERROR:  syntax error at or near "."
LINE 2: ...  0 AS level,c.relkind, false AS i.inhdetach...


Setting the CommitFest status to WoA.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2023-11-20 Thread Daniel Verite
 Hi,

Here's a new version to improve the performance of FETCH_COUNT
and extend the cases when it can be used.

Patch 0001 adds a new mode in libpq to allow the app to retrieve
larger chunks of results than the single row of the row-by-row mode.
The maximum number of rows per PGresult is set by the user.

Patch 0002 uses that mode in psql and gets rid of the cursor
implementation as suggested upthread.

The performance numbers look good.
For a query retrieving 50M rows of about 200 bytes:
  select repeat('abc', 200) from generate_series(1,500)
/usr/bin/time -v psql -At -c $query reports these metrics
(medians of 5 runs):

  version  | fetch_count | clock_time | user_time | sys_time | max_rss_size
(kB) 
---+-++---+--+---
 16-stable |   0 |   6.58 |  3.98 | 2.09 |  
3446276
 16-stable | 100 |   9.25 |  4.10 | 1.90 | 
8768
 16-stable |1000 |  11.13 |  5.17 | 1.66 | 
8904
 17-patch  |   0 |6.5 |  3.94 | 2.09 |  
3442696
 17-patch  | 100 |  5 |  3.56 | 0.93 | 
4096
 17-patch  |1000 |   6.48 |  4.00 | 1.55 | 
4344

Interestingly, retrieving by chunks of 100 rows appears to be a bit faster
than the default one big chunk. It means that independently
of using less memory, FETCH_COUNT implemented that way
would be a performance enhancement compared to both
not using it and using it in v16 with the cursor implementation.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
From 766bbe84def2db494f646caeaf29eefeba893c1a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= 
Date: Mon, 20 Nov 2023 17:24:55 +0100
Subject: [PATCH v4 1/2] Implement retrieval of results in chunks with libpq.

This mode is similar to the single-row mode except that chunks
of results contain up to N rows instead of a single row.
It is meant to reduce the overhead of the row-by-row allocations
for large result sets.
The mode is selected with PQsetChunkedRowsMode(int maxRows) and results
have the new status code PGRES_TUPLES_CHUNK.
---
 doc/src/sgml/libpq.sgml  |  96 +++--
 src/bin/pg_amcheck/pg_amcheck.c  |   1 +
 src/interfaces/libpq/exports.txt |   1 +
 src/interfaces/libpq/fe-exec.c   | 118 +--
 src/interfaces/libpq/libpq-fe.h  |   4 +-
 src/interfaces/libpq/libpq-int.h |   7 +-
 6 files changed, 183 insertions(+), 44 deletions(-)

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index ed88ac001a..8007bf67d8 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -3537,7 +3537,20 @@ ExecStatusType PQresultStatus(const PGresult *res);
 The PGresult contains a single result 
tuple
 from the current command.  This status occurs only when
 single-row mode has been selected for the query
-(see ).
+(see ).
+   
+  
+ 
+
+ 
+  PGRES_TUPLES_CHUNK
+  
+   
+The PGresult contains several tuples
+from the current command. The count of tuples cannot exceed
+the maximum passed to .
+This status occurs only when the chunked mode has been selected
+for the query (see ).

   
  
@@ -5187,8 +5200,8 @@ PGresult *PQgetResult(PGconn *conn);
   
Another frequently-desired feature that can be obtained with
 and 
-   is retrieving large query results a row at a time.  This is discussed
-   in .
+   is retrieving large query results a limited number of rows at a time.  This 
is discussed
+   in .
   
 
   
@@ -5551,12 +5564,13 @@ int PQflush(PGconn *conn);
 
 
 
- To enter single-row mode, call PQsetSingleRowMode
- before retrieving results with PQgetResult.
- This mode selection is effective only for the query currently
- being processed. For more information on the use of
- PQsetSingleRowMode,
- refer to .
+ To enter single-row or chunked modes, call
+ respectively PQsetSingleRowMode
+ or PQsetChunkedRowsMode before retrieving results
+ with PQgetResult.  This mode selection is effective
+ only for the query currently being processed. For more information on the
+ use of these functions refer
+ to .
 
 
 
@@ -5895,10 +5909,10 @@ UPDATE mytable SET x = x + 1 WHERE id = 42;
   
  
 
- 
-  Retrieving Query Results Row-by-Row
+ 
+  Retrieving Query Results by chunks
 
-  
+  
libpq
single-row mode
   
@@ -5909,13 +5923,15 @@ UPDATE mytable SET x = x + 1 WHERE id = 42;
PGresult.  This can be unworkable for commands
that return a large number of rows.  For such cases, applications can use
 and  
in
-   single-row mode.  In this mode, the result row(s) are
-   returned to the application 

Re: Does UCS_BASIC have the right CTYPE?

2023-10-26 Thread Daniel Verite
Peter Eisentraut wrote:

> > That seems to suggest the standard answer should be 'Á' regardless of
> > any COLLATE clause (though I could be misreading). I'm a bit confused
> > by that... what's the standard-compatible way to specify the locale for
> > UPPER()/LOWER()? If there is none, then it makes sense that Postgres
> > overloads the COLLATE clause for that purpose so that users can use a
> > different locale if they want.
> 
> The standard doesn't have the notion of locale-dependent case conversion.

Neither does Unicode, which is why the ICU functions like u_isupper()
or u_toupper() don't take a locale argument.

With libc, isupper_l() and the other ctype functions need a locale
argument, but given a locale's value of
"language[_territory][.codeset]", in theory only the codeset part is
actually useful.

To me the question of what we should put in pg_collation.collctype
for the "ucs_basic" collation leads to another question which is:
why do we even consider collctype in the first place?

Within a database, there's only one "codeset", which corresponds
to pg_database.encoding, and there's a value in pg_database.lc_ctype
that is normally compatible with that encoding.
ISTM that UPPER(string COLLATE "whatever") should always give
the same result than UPPER(string COLLATE pg_catalog.default). And
likewise all functions that depend on character categories could
basically ignore the COLLATE specification, given that our
database-wide properties are sufficient to characterize the strings
within.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Pre-proposal: unicode normalized text

2023-10-17 Thread Daniel Verite
Jeff Davis wrote:

> I believe the patch has utility as-is, but I've been brainstorming a
> few more ideas that could build on it:
> 
> * Add a per-database option to enforce only storing assigned unicode
> code points.

There's a problem in the fact that the set of assigned code points is
expanding with every Unicode release, which happens about every year.

If we had this option in Postgres 11 released in 2018 it would use
Unicode 11, and in 2023 this feature would reject thousands of code
points that have been assigned since then.

Aside from that, aborting a transaction because there's an
unassigned code point in a string feels like doing too much,
too late.
The programs that want to filter out unwanted code points
do it before they hit the database, client-side.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: EBCDIC sorting as a use case for ICU rules

2023-08-30 Thread Daniel Verite
Peter Eisentraut wrote:

> Committed with some editing.  I moved the existing rules example from 
> the CREATE COLLATION page into the new section you created, so we have a 
> simple example followed by the complex example.

OK, thanks for pushing this!


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2023-07-07 Thread Daniel Verite
Tom Lane wrote:

> This gives me several "-Wincompatible-pointer-types" warnings
> [...]
> I think what you probably ought to do to avoid all that is to change
> the arguments of PrintQueryResult and nearby routines to be "const
> PGresult *result" not just "PGresult *result".

The const-ness issue that I ignored in the previous patch is that
while C is fine with passing T* to a function expecting const T*, it's
not okay with passing T** to a function expecting const T**,
or more generally converting T** to const T**.

When callers need to pass arrays of PGresult* instead of const
PGresult*, I've opted to remove the const qualifiers for the functions
that are concerned by this change.


PFA an updated patch.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 5973df2e39..476a9770f0 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -372,6 +372,7 @@ AcceptResult(const PGresult *result, bool show_error)
 		{
 			case PGRES_COMMAND_OK:
 			case PGRES_TUPLES_OK:
+			case PGRES_SINGLE_TUPLE:
 			case PGRES_EMPTY_QUERY:
 			case PGRES_COPY_IN:
 			case PGRES_COPY_OUT:
@@ -695,7 +696,7 @@ PrintNotifications(void)
  * Returns true if successful, false otherwise.
  */
 static bool
-PrintQueryTuples(const PGresult *result, const printQueryOpt *opt,
+PrintQueryTuples(PGresult *result, const printQueryOpt *opt,
  FILE *printQueryFout)
 {
 	bool		ok = true;
@@ -1391,6 +1392,47 @@ DescribeQuery(const char *query, double *elapsed_msec)
 	return OK;
 }
 
+/*
+ * Check if an output stream for \g needs to be opened, and if
+ * yes, open it.
+ * Return false if an error occurred, true otherwise.
+ */
+static bool
+SetupGOutput(PGresult *result, FILE **gfile_fout, bool *is_pipe)
+{
+	ExecStatusType status = PQresultStatus(result);
+	if (pset.gfname != NULL &&			/* there is a \g file or program */
+		*gfile_fout == NULL &&   /* and it's not already opened */
+		(status == PGRES_TUPLES_OK ||
+		 status == PGRES_SINGLE_TUPLE ||
+		 status == PGRES_COPY_OUT))
+	{
+		if (openQueryOutputFile(pset.gfname, gfile_fout, is_pipe))
+		{
+			if (is_pipe)
+disable_sigpipe_trap();
+		}
+		else
+			return false;
+	}
+	return true;
+}
+
+static void
+CloseGOutput(FILE *gfile_fout, bool is_pipe)
+{
+	/* close \g file if we opened it */
+	if (gfile_fout)
+	{
+		if (is_pipe)
+		{
+			SetShellResultVariables(pclose(gfile_fout));
+			restore_sigpipe_trap();
+		}
+		else
+			fclose(gfile_fout);
+	}
+}
 
 /*
  * ExecQueryAndProcessResults: utility function for use by SendQuery()
@@ -1422,10 +1464,18 @@ ExecQueryAndProcessResults(const char *query,
 	bool		success;
 	instr_time	before,
 after;
+	int fetch_count = pset.fetch_count;
 	PGresult   *result;
+
 	FILE	   *gfile_fout = NULL;
 	bool		gfile_is_pipe = false;
 
+	PGresult   **result_array = NULL; /* to collect results in single row mode */
+	int64		total_tuples = 0;
+	int			ntuples;
+	int			flush_error = 0;
+	bool		is_pager = false;
+
 	if (timing)
 		INSTR_TIME_SET_CURRENT(before);
 	else
@@ -1448,6 +1498,33 @@ ExecQueryAndProcessResults(const char *query,
 		return -1;
 	}
 
+	/*
+	 * If FETCH_COUNT is set and the context allows it, use the single row
+	 * mode to fetch results and have no more than FETCH_COUNT rows in
+	 * memory.
+	 */
+	if (fetch_count > 0 && !pset.crosstab_flag && !pset.gexec_flag && !is_watch
+		&& !pset.gset_prefix && pset.show_all_results)
+	{
+		/*
+		 * The row-by-row fetch is not enabled when SHOW_ALL_RESULTS is false,
+		 * since we would need to accumulate all rows before knowing
+		 * whether they need to be discarded or displayed, which contradicts
+		 * FETCH_COUNT.
+		 */
+		if (!PQsetSingleRowMode(pset.db))
+		{
+			pg_log_warning("fetching results in single row mode is unavailable");
+			fetch_count = 0;
+		}
+		else
+		{
+			result_array = (PGresult**) pg_malloc(fetch_count * sizeof(PGresult*));
+		}
+	}
+	else
+		fetch_count = 0;		/* disable single-row mode */
+
 	/*
 	 * If SIGINT is sent while the query is processing, the interrupt will be
 	 * consumed.  The user's intention, though, is to cancel the entire watch
@@ -1467,6 +1544,8 @@ ExecQueryAndProcessResults(const char *query,
 		ExecStatusType result_status;
 		PGresult   *next_result;
 		bool		last;
+		/* whether the output starts before results are fully fetched */
+		bool		partial_display = false;
 
 		if (!AcceptResult(result, false))
 		{
@@ -1593,6 +1672,94 @@ ExecQueryAndProcessResults(const char *query,
 			success &= HandleCopyResult(, copy_stream);
 		}
 
+		if (fetch_count > 0 && result_status == PGRES_SINGLE_TUPLE)
+		{
+			FILE	   *tuples_fout = printQueryFout ? printQueryFout : stdout;
+			printQueryOpt my_popt = pset.popt;
+
+			ntuples = 0;
+			total_tuples = 0;
+			partial_display = true;
+
+			success = SetupGOutput(result, _fout, _is_pipe);
+			if (gfile_fout)
+tuples_fout = gfile_fout;
+
+			/* initialize print options for partial 

Re: pg_collation.collversion for C.UTF-8

2023-06-21 Thread Daniel Verite
Thomas Munro wrote:

> What could we do that would be helpful here, without affecting users
> of the "true" C.UTF-8 for the rest of time?  This is a Debian (+
> downstream distro) only problem as far as we know so far, and only
> for Debian 11 and older.

It seems to include RedHat-based distros as well.

According to https://bugzilla.redhat.com/show_bug.cgi?id=902094
C.utf8 was added in 2015 and backported down to Fedora 22.
RHEL8 / CentOS 8 / Rocky8 provide glibc 2.28 with a C.utf8
locale. We can reasonably suspect that they've been using the same
kind of patches as Debian before version 12, with not all codepoints
being sorted bytewise.

RHEL9 comes with glibc 2.34 according to distrowatch [1] and the
announcement [2], so presumably it also lacks the "real" C.utf8
with bytewise sorting that glibc 2.35 upstream added.


[1] https://distrowatch.com/table.php?distribution=redhat
[2]
https://developers.redhat.com/articles/2022/05/18/whats-new-red-hat-enterprise-linux-9


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




EBCDIC sorting as a use case for ICU rules

2023-06-21 Thread Daniel Verite
 Hi,

In the "Order changes in PG16 since ICU introduction" discussion, one
sub-thread [1] was about having a credible use case for tailoring collations
with custom rules, a new feature in v16.

At a conference this week I was asked if ICU could be able to
sort like EBCDIC [2]. It turns out it has been already  asked on
-general a few years ago [3] with no satisfactory answer at the time ,
and that it can be implemented with rules in v16.

A collation like the following this seems to work (the rule simply enumerates
US-ASCII letters in the EBCDIC alphabet order, with adequate quoting)

CREATE COLLATION ebcdic (provider='icu', locale='und',
rules=$$&'
'<'.'<'<'<'('<'+'<\|<'&'<'!'<'$'<'*'<')'<';'<'-'<'/'<','<'%'<'_'<'>'<'?'<'`'<':'<'#'<'@'<\'<'='<'"'?`:#@'="abcdefghijklmnopqr~stuvwxyz[^]{ABCDEFGHI}JKLMNOPQR\ST
UVWXYZ0123456789

Maybe this example could be added to the documentation except for
the problem that the rule is very long and dollar-quoting cannot be split
into several lines. Literals enclosed by single quotes can be split that
way, but would require escaping the single quotes in the rule, which
would lead to scary-looking over-quoted contents.

I'm open to suggestions on whether this EBCDIC example is worth being in the
doc in some form or putting this in the wiki would be good enough.



[1]
https://www.postgresql.org/message-id/flat/a28aba5fa6bf1abfff96e40b6d6acff8412edb15.camel%40j-davis.com

[2] https://en.wikipedia.org/wiki/EBCDIC

[3]
https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F84A7AD%40G01JPEXMBYT05


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Order changes in PG16 since ICU introduction

2023-06-12 Thread Daniel Verite
Jeff Davis wrote:

> I guess where I'm confused is: why would a user actually want their
> database collation to be C.UTF-8? It's slower than C, our
> implementation doesn't properly version it (as you pointed out), and
> the semantics don't seem great ('Z' < 'a').

Because when LC_CTYPE=C, characters outside of US ASCII are not
categorized properly. upper/lower/regexp matching/... produce
incorrect results.

> But if they don't specify the provider, isn't it much more likely they
> just don't care much about the locale, and would be happier with C? 

Consider a pre-existing script doing initdb --locale=C.UTF-8
Surely it does care about the locale, otherwise it would not specify
it.
Assuming that it would be better off with C is assuming that a
non-Unicode aware locale is better than the Unicode-aware locale
they're asking. I don't think it's reasonable.

> The user can easily get libc behavior by specifying --locale-
> provider=libc, so I don't see how you reached this conclusion.

What would be user hostile is forcing users that don't need an ICU
locale to change their invocations of initdb/createdb to avoid
regressions with v16. Most people would discover this after
it breaks their apps.

> It looks like you are fine with 0003 applying LOCALE to whatever
> provider is chosen, but you'd like to be smarter about choosing the
> provider and to choose libc in at least some cases.
> 
> That is actually very much like option #2 in the list I presented
> here[2], and has the same problems. How should the following behave?
> 
>   initdb --locale=C --lc-collate=fr_FR.utf8
>   initdb --locale=en --lc-collate=fr_FR.utf8

The same as v15.

> If we switch to libc in the first case, then --locale will be ignored
> and the collation will be fr_FR.utf8.

$ initdb --locale=C --lc-collate=fr_FR.utf8
v15 does that:

  The database cluster will be initialized with this locale configuration:
provider:libc
LC_COLLATE:  fr_FR.utf8
LC_CTYPE:C
LC_MESSAGES: C
LC_MONETARY: C
LC_NUMERIC:  C
LC_TIME: C
  The default database encoding has accordingly been set to "SQL_ASCII".

--locale is not ignored, it's overriden for LC_COLLATE only.

> But we will leave the second case as ICU and the collation will be
> "en".

Yes. To me the rule for "ICU is the default" in v16 should be: if the
--locale argument points to a locale that we know ICU does not provide,
we fall back to the v15 behavior down to every detail, otherwise we let
ICU be the provider.

> You also suggested that we consider switching the provider to libc any
> time ICU doesn't support something. I'm not sure whether you meant a
> static list (C, C.UTF-8, POSIX, ...?) or some kind of dynamic test.

C, C.*, POSIX. I'm not sure if there are other cases.

> I'm also not clear whether you think we should abandon the built-in
> provider, or still select it for C/POSIX.

I see it as going in v17, because it came after feature freeze and
is not strictly necessary in v16.



Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Order changes in PG16 since ICU introduction

2023-06-09 Thread Daniel Verite
Jeff Davis wrote:

>  I implemented a compromise where initdb will 
>  change C.UTF-8 to the built-in provider

This handling of C.UTF-8 would be felt by users as simply broken.

With the v10 patches:

$ initdb --locale=C.UTF-8

initdb: using locale provider "builtin" for ICU locale "C.UTF-8"
The database cluster will be initialized with this locale configuration:
  default collation provider:  builtin
  default collation locale:C
  LC_COLLATE:  C.UTF-8
  LC_CTYPE:C.UTF-8

This setup is not what the user has asked for and leads to that kind of
wrong results:

$ psql -c "select upper('é')"
 ?column? 
--
 é

whereas in v15 we would get the correct result 'É'.


Then once inside that cluster, trying to create a database:

  postgres=# create database test locale='C.UTF-8';
  ERROR:  locale provider "builtin" does not support locale "C.UTF-8"
  HINT:  The built-in locale provider only supports the "C" and "POSIX"
locales.


That hardly makes sense considering that initdb stated the opposite,
that the "built-in provider" was adequate for C.UTF-8


In general about the evolution of the patchset, your interpretation
of "defaulting to ICU" seems to be "avoid libc at any cost", which IMV
is unreasonably user-hostile.



Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Inconsistent results with libc sorting on Windows

2023-06-09 Thread Daniel Verite
Juan José Santamaría Flecha wrote:

> Just to make sure we are all seeing the same problem, does the attached
> patch fix your test?

The problem of the random changes in sorting disappears for all libc
locales in pg_collation, so this is very promising.

However it persists for the default collation.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Order changes in PG16 since ICU introduction

2023-06-08 Thread Daniel Verite
Jeff Davis wrote:

> As I replied in that subthread, that creates a worse problem: if you
> only change the provider when the locale is C, then what about when the
> locale is *not* C?
> 
>  export LANG=en_US.UTF-8
>  initdb -D data --locale=fr_FR.UTF-8
>  ...
>provider:icu
>ICU locale:  en-US

What you're proposing with the 0003 patch still applies.

In the above case I think we would end up with:

provider=icu
ICU locale=fr-FR
lc_collate=fr_FR.UTF-8
lc_lctype=fr_FR.UTF-8

which is reasonable.


In the following cases we would initialize a libc cluster instead of an
ICU cluster:

- initdb --locale=C
- initdb --locale=POSIX
- LANG=C initdb
- LANG=C.UTF-8 initdb
- LANG=POSIX initdb
- ... possibly other locales that we find are unsuitable for ICU

That is, the rule "ICU by default" really means "ICU unless the locale
that we're being passed or getting from the environment
has semantics that ICU does not provide but we know libc provides,
in which case we fall back to libc".

The user who wants ICU imperatively should invoke
--icu-locale=something or --locale=something --locale-provider=icu
in which case we should not fallback to libc.
We still have to determine lc_collate and lc_ctype either from the
environment or from the locale argument (I think we should
favor the environment), except if the user specifies
--lc-collate=... lc-ctype=...


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Order changes in PG16 since ICU introduction

2023-06-08 Thread Daniel Verite
Tatsuo Ishii wrote:

> >> Yes it's a special case but when doing initdb --locale=C, a user does
> >> not need or want an ICU locale. They want the same thing than what v15
> >> does with the same arguments: a template0 database with
> >> datlocprovider='c', datcollate='C', datctype='C', dateiculocale=NULL.
> 
> So in this case the only way to keep the same behavior in v16 with "initdb
> --locale=C" (--no-locale) in v15 is, bulding PostgreSQL --without-icu?

AFAIK the --no-locale case in v16 is fixed since:

commit 5cd1a5af4d17496a58678c8eb7ab792119c2d723
Author: Jeff Davis 
Date:   Fri Apr 21 13:11:18 2023 -0700

Fix initdb --no-locale.

Discussion: https://postgr.es/m/878relf7cb@news-spur.riddles.org.uk
Reported-by: Andrew Gierth


The --locale=C case is still being discussed. To me it should
produce the same result than --no-locale and --locale=C in v15, that is,
"ICU is the default" does not apply to that case, but currently
it initializes the cluster with an ICU locale.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Order changes in PG16 since ICU introduction

2023-06-07 Thread Daniel Verite
Jeff Davis wrote:

> The locale "C" is a special case, documented as a non-locale. So, if
> LOCALE/--locale apply to ICU, then either ICU needs to handle locale
> "C" in the expected way (v8 patch series); or when we see locale "C" we
> need to somehow change the provider into something that can handle it
> (v6 patch series changes it to the "none" provider).

Yes it's a special case but when doing initdb --locale=C, a user does
not need or want an ICU locale. They want the same thing than what v15
does with the same arguments: a template0 database with
datlocprovider='c', datcollate='C', datctype='C', dateiculocale=NULL.

The simplest way to obtain that in v16 is to teach initdb that
--locale=C without the --locale-provider option implies that
--locale-provider=libc ([1])

OTOH what you're proposing with the 0001 patch is much more involved
in terms of tweaking the ICU code so that dateiculocale='C' and
datlocprovider='i' becomes a combination that provides the C semantics
that ICU doesn't have natively.

I don't agree with the reasoning that to make progress with
the other uses of --locale, we need to start by either making ICU
support C/POSIX (0001/0002), or add a new "none/builtin" provider
(previous set of patches).
v16 should not need it any more than v15 did, if v16 does the same as
v15 on locale='C', that is not involve ICU at all.

> Then that enables us to change LOCALE/--locale to apply to ICU, which
> means that a simple command like "initdb --locale=en_US" does a
> sensible thing regardless of the default provider.
>
> I understand you are skeptical of trying to apply an arbitrary locale
> name to ICU, but if they don't specify the provider, what do you expect
> to happen?

It's a hard question because it depends on what people have in their 
locale environment combined with what they try to do.
I think that initdb without any locale option should work well in
the majority of environments, but specifying a locale alone will not work
well in a number of cases, so users might end up concluding that they
need to specify not only the provider but lc_collate/lc_ctype.


[1]
https://www.postgresql.org/message-id/360c90b9-7c20-4cec-aade-38e6e3351...@manitou-mail.org


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: pg_collation.collversion for C.UTF-8

2023-06-07 Thread Daniel Verite
I wrote:

> Consider matching '\d' in a regexp. With C.UTF-8 (glibc-2.35), we
> only match ASCII characters 0-9, or 10 codepoints.  With
> "en-US-u-va-posix-x-icu" we match 660 codepoints comprising all the
> digit characters in all languages, plus a bunch of variants for
> mathematical symbols.

BTW this not specifically a C.UTF-8 versus "en-US-u-va-posix-x-icu"
difference.
If think that any glibc-based locale will consider that \d
in a regexp means [0-9], and that any ICU locale
will make \d match a much larger variety of characters.

While moving to ICU by default, we should expect that 
differences like that will affect apps in a way that might be
more or less disruptive.

Another known difference it that upper() with ICU does not do a
character-by-character conversion, for instance:

WITH words(w) as  (values('muß'),('final'))
 SELECT
  w,
  length(w),
  upper(w collate "C.utf8") as "upper (libc)",
  length(upper(w collate "C.utf8")),
  upper(w collate "en-x-icu") as "upper (ICU)",
  length(upper(w collate "en-x-icu"))
FROM words;

  w   | length | upper libc | length | upper ICU | length 
--++++---+
 muß  |  3 | MUß|  3 | MUSS  |  4
 final |  4 | fiNAL   |  4 | FINAL |  5


The fact that the resulting string is larger that the original
might cause problems.

In general, we can't abstract from the fact that ICU semantics
are different.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: pg_collation.collversion for C.UTF-8

2023-06-07 Thread Daniel Verite
Jeff Davis wrote:

> What about ICU? How should provider=icu locale=C.UTF-8 behave? We
> could:
> 
> a. Just pass it to the provider and see what happens (older versions of
> ICU would interpret it as en-US-u-va-posix; newer versions would give
> the root locale).
> 
> b. Consistently interpret it as en-US-u-va-posix.
> 
> c. Don't pass it to the provider at all and treat it with memcmp
> semantics.


I think b) and c) are quite problematic.


First, en-US-u-va-posix does not sort like C.UTF-8 in glibc.
For one thing it seems that en-US-u-va-posix assigns zero weights to
some codepoints, which makes it semantically definitely different.
For instance consider ZERO WIDTH SPACE (U+200B):

postgres=# select 'ab' < E'a\u200Ba' COLLATE "C.utf8";
 ?column? 
--
 t


postgres=# select 'ab' < E'a\u200Ba' COLLATE "en-US-u-va-posix-x-icu";
 ?column? 
--
 f

Even if ICU folks refer to u-va-posix as approximating POSIX (as in [1]),
for our purpose, either it sorts by codepoints or it does not,
and it clearly does not. One consequence is that 
en-US-u-va-posix-x-icu needs to be versioned and indexes
depending on it need to be rebuilt on upgrades.
OTOH the goal with C.UTF-8, that is achieved in glibc>=2.35,
is to not need that.

Also it's not just about sorting. The semantics for the ctype-kind
functions are also different.

Consider matching '\d' in a regexp. With C.UTF-8 (glibc-2.35), we only match
ASCII characters 0-9, or 10 codepoints.
With "en-US-u-va-posix-x-icu" we match 660 codepoints comprising
all the digit characters in all languages, plus a bunch of variants
for mathematical symbols.

For instance consider U+FF10 (Fullwidth Digit Zero):

postgres=# select E'\uff10' collate "C.utf8" ~ '\d';
 ?column? 
--
 f

postgres=# select E'\uff10' collate "en-US-u-va-posix-x-icu" ~ '\d';
 ?column? 
--
 t

If someone dumps their C.UTF-8 database to reload into an
ICU/en-US-u-va-posix database, there is no guarantee that it
even reloads because of semantic differences occuring
in constraints. In general it will surely reload,  but the apps
might not behave the same with the new database
in a way that might be problematic.
It's fine if that's what they want and they explicitly ask for this
conversion, but it's not fine if it's postgres that has quietly
decided that for them.


About c) "don't pass it to the operators", it would be doable for
sorting (ignoring the "glibc before 2.35 does not sort like that" issue)
but not for the ctype-kind functions, where postgres' own code
doesn't have the Unicode knowledge.


About a) "just pass it to the provider", that seems better than b) or
c), but still, when a user asks for provider=icu locale=C.UTF-8, 
it's a very probably a pilot error.

To me the user would be best served by a warning, if not an error,
informing them that it's quite probably not the combination they want.




[1]
https://sourceforge.net/p/icu/mailman/icu-support/thread/CAN49p6pvQKP93j8LMn3zBWhpk-T0qYD0TCuiHMv6Z3UPGFh3QQ%40mail.gmail.com/#msg35638356

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Inconsistent results with libc sorting on Windows

2023-06-07 Thread Daniel Verite
Thomas Munro wrote:

> > > Also, it does not occur at all if parallel scan is disabled.
> >
> > Could this be a clue that it is failing to be transitive?
> 
> That vaguely rang a bell for me...  and then I remembered this thread:
> 
> https://www.postgresql.org/message-id/flat/20191206063401.GB1629883%40rfd.leadboat.com

Thanks for the pointer, non-transitive comparisons seem a likely cause
indeed.

The parallel scan appears to imply some randomness in the sequence of
comparisons, which makes the problem more visible.
After changing the test to shuffle the rows before each sort,
non-parallel scans also produce outputs that differ, proving that
parallelism is not a root cause.

Running the test with all the libc collations with collencoding in
(-1,6) shows that the only ones not affected are C/POSIX/ucs_basic.
Otherwise the 569 other pre-created libc collations that can be used
with UTF-8 are affected, plus the default collation
(French_France.1252 in my case).


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Order changes in PG16 since ICU introduction

2023-06-06 Thread Daniel Verite
Jeff Davis wrote:

> New patch series attached. I plan to commit 0001 and 0002 soon, unless
> there are objections.
> 
> 0001 causes the "C" and "POSIX" locales to be treated with
> memcmp/pg_ascii semantics in ICU, just like in libc. We also
> considered a new "none" provider, but it's more invasive, and we can
> always reconsider that in the v17 cycle.

FWIW I don't quite see how 0001 improve things or what problem it's
trying to solve.

0001 creates exceptions throughout the code so that when an ICU
collation has a locale name "C" or "POSIX" then it does not behave
like an ICU collation, even though pg_collation.collprovider='i'
To me it's neither desirable nor necessary that a collation that
has collprovider='i' is diverted to non-ICU semantics.

Also in the current state, this diversion does not apply to initdb.

"initdb --icu-locale=C" with 0001 applied reports this:

   Using language tag "en-US-u-va-posix" for ICU locale "C".
   The database cluster will be initialized with this locale configuration:
 provider:icu
 ICU locale:  en-US-u-va-posix
 LC_COLLATE:  fr_FR.UTF-8
 [...]

and "initdb --locale=C" reports this:

   Using default ICU locale "fr_FR".
   Using language tag "fr-FR" for ICU locale "fr_FR".
   The database cluster will be initialized with this locale configuration:
 provider:icu
 ICU locale:  fr-FR
 LC_COLLATE:  C
 [...]

Could you elaborate a bit more on what 0001 is meant to achieve, from
the point of view of the user?


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Inconsistent results with libc sorting on Windows

2023-06-05 Thread Daniel Verite
Hi,

While trying pg16beta1 libc collations on Windows, I noticed that UTF-8
text sorts sometimes differently across invocations with the same
locales, which is wrong since these collations are deterministic.

The OS is Windows 10 Home, version 10.0.19045 Build 19045,
self-built 16beta1 with VS Community 2022, without ICU, default
configuration in postgresql.conf.

It seems to occur more or less randomly with all libc locales except
C/POSIX, with the probability of getting differences being seemingly
much higher when the data gets larger in number of rows and uses
higher codepoints (like if all character are in [U+0001,U+0400] the
sorts never differ with 40k rows, but they do if there are much more
rows or if the range is [U+0001,U+2000]).

Also, it does not occur at all if parallel scan is disabled.

I've come up with a self-contained script that generates random words
and repeatedly sorts and feed them to md5sum. It takes the number of
rows and the highest Unicode codepoint as arguments, and shows when the
checksums differ across consecutive invocations.

Here's a typical run showing how it goes wrong after the 14th sort:

$ bash repro-coll-windows.sh 4 16383
NOTICE:  relation "random_words" already exists, skipping
CREATE TABLE
TRUNCATE TABLE
CREATE FUNCTION
DROP COLLATION
CREATE COLLATION
INSERT 0 4
ANALYZE
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
35050d858f4c590788132627e74f62c8 -> e746b626fcc848cbbc67570a7dde03bb
(iter=15)
16 
e746b626fcc848cbbc67570a7dde03bb -> 35050d858f4c590788132627e74f62c8
(iter=16)
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 
35050d858f4c590788132627e74f62c8 -> 6bf38563d1267339122154bd7d4fbfce
(iter=38)
39 
6bf38563d1267339122154bd7d4fbfce -> 35050d858f4c590788132627e74f62c8
(iter=39)
40 41 42 43 44 45 46 47 48 49 50 51 
35050d858f4c590788132627e74f62c8 -> 3d2072698054d0bd57beefea0248b7e6
(iter=51)
52 
3d2072698054d0bd57beefea0248b7e6 -> 35050d858f4c590788132627e74f62c8
(iter=52)
53 54 55 56 57 58 59 ^C

Would anyone be able to reproduce this? That might be a local problem
although there's nothing special installed AFAICS.
Initially I saw this with a larger dataset that I can't share, and the diffs
between outputs showed that only a few lines out of 2 million lines
were getting displaced across sorts.
It also happens on the same OS  with Pg15.3 (EDB build) and the default
libc collation, so I would not immediately suspect new code in Pg16.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite


repro-coll-windows.sh
Description: Binary data


Re: pg_collation.collversion for C.UTF-8

2023-06-05 Thread Daniel Verite
Jeff Davis wrote:

> >   For libc: this change may affect any user who happened to have
> > LANG=C.UTF-8 in their environment at initdb time, which is probably a
> > lot of users, and some buildfarm members. However, the average risk
> > seems to be much lower, because we've gone a long time with the
> > assumption that C.UTF-8 has the same behavior as C, and this only
> > recently came up.

Currently, neither lc_collate_is_c() nor lookup_collation_cache()
think that C.UTF-8 is a C collation, since they do that kind of test:

if (strcmp(localeptr, "C") == 0)
result = true;
else if (strcmp(localeptr, "POSIX") == 0)
result = true;
else
result = false;

What is relatively new (v15) is that we compute a version for libc
collations in get_collation_actual_version(), with code that assumes
that C.* does not need a version, implying that it's immune to
Unicode changes. What came up in this thread is that this assumption
is not true for at least one major platform: Debian/Ubuntu for
releases occurring before 2022 (glibc < 2.35).


> We can avoid this risk by converting C.anything or POSIX.anything to
> plain "C" or "POSIX", respectively, for new collations before storing
> the string in the catalog. For upgraded collations, we can preserve the
> existing locale name. When opening the locale, we would still only
> recognize plain "C" and "POSIX" as the C locale.


Then Postgres would not sort the same as the operating system with the
same locale, at least on some OS. Concerning glibc, after waiting a
few years, glibc<2.35 will be obsolete, and C.UTF-8 sorting like C
will happen by itself.
But in the meantime, personally I don't quite see why Postgres should
start forcing C.UTF-8 to sort differently in the database than in the
OS.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Simplify pg_collation.collversion for Windows libc

2023-06-05 Thread Daniel Verite
Hi,

Currently the libc collation version for Windows has two components
coming from the NLSVERSIONINFOEX structure [1]
dwNLSVersion and dwDefinedVersion

So we get version numbers looking like this (with 16 beta1):

postgres=# select collversion,count(*) from pg_collation group by
collversion;
  collversion  | count 
---+---
   | 5
 1539.5,1539.5 |  1457
(2 rows)

According to [1] the second number is obsolete, and AFAICS we should
expose only the first.


dwDefinedVersion

Defined version. This value is used to track changes in the repertoire
of Unicode code points. The value increments when the Unicode
repertoire is extended, for example, if more characters are defined.

Starting with Windows 8: Deprecated. Use dwNLSVersion instead.


PFA a patch implementing that suggestion.


[1]
https://learn.microsoft.com/en-us/windows/win32/api/winnls/ns-winnls-nlsversioninfoex


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
diff --git a/src/backend/utils/adt/pg_locale.c 
b/src/backend/utils/adt/pg_locale.c
index 31e3b16ae0..3a6b00ab14 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1717,11 +1717,9 @@ get_collation_actual_version(char collprovider, const 
char *collcollate)
collcollate,
GetLastError(;
}
-   collversion = psprintf("%lu.%lu,%lu.%lu",
+   collversion = psprintf("%lu.%lu",
   
(version.dwNLSVersion >> 8) & 0x,
-  version.dwNLSVersion 
& 0xFF,
-  
(version.dwDefinedVersion >> 8) & 0x,
-  
version.dwDefinedVersion & 0xFF);
+  version.dwNLSVersion 
& 0xFF);
 #endif
}
 


Re: Order changes in PG16 since ICU introduction

2023-05-26 Thread Daniel Verite
Jeff Davis wrote:

> > #1
> > 
> > postgres=# create database test1 locale='fr_FR.UTF-8';
> > NOTICE:  using standard form "fr-FR" for ICU locale "fr_FR.UTF-8"
> > ERROR:  new ICU locale (fr-FR) is incompatible with the ICU locale of
> 
> I don't see a problem here. If you specify LOCALE to CREATE DATABASE,
> you should either be using "TEMPLATE template0", or you should be
> expecting an error if the LOCALE doesn't match exactly.
> 
> What would you like to see happen here?

What's odd is that initdb starting in an fr_FR.UTF-8 environment
found that "fr" was the default ICU locale to use, whereas
"create database" reports that "fr" and "fr_FR.UTF-8" refer to
incompatible locales.

To me initdb is wrong when coming up with the less precise "fr"
instead of "fr-FR".

I suggest the attached patch to call uloc_getDefault() instead of the
current code that somehow leaves out the country/region component.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 31156e863b..09a5c98cc0 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2354,42 +2354,13 @@ icu_validate_locale(const char *loc_str)
 }
 
 /*
- * Determine default ICU locale by opening the default collator and reading
- * its locale.
- *
- * NB: The default collator (opened using NULL) is different from the collator
- * for the root locale (opened with "", "und", or "root"). The former depends
- * on the environment (useful at initdb time) and the latter does not.
+ * Determine the default ICU locale
  */
 static char *
 default_icu_locale(void)
 {
 #ifdef USE_ICU
-	UCollator  *collator;
-	UErrorCode	status;
-	const char *valid_locale;
-	char	   *default_locale;
-
-	status = U_ZERO_ERROR;
-	collator = ucol_open(NULL, );
-	if (U_FAILURE(status))
-		pg_fatal("could not open collator for default locale: %s",
- u_errorName(status));
-
-	status = U_ZERO_ERROR;
-	valid_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE,
-		);
-	if (U_FAILURE(status))
-	{
-		ucol_close(collator);
-		pg_fatal("could not determine default ICU locale");
-	}
-
-	default_locale = pg_strdup(valid_locale);
-
-	ucol_close(collator);
-
-	return default_locale;
+	return pg_strdup(uloc_getDefault());
 #else
 	pg_fatal("ICU is not supported in this build");
 #endif


Re: Should CSV parsing be stricter about mid-field quotes?

2023-05-22 Thread Daniel Verite
Kirk Wolak wrote:

> We do NOT do "CSV", we mimic pg_dump.

pg_dump uses the text format (as opposed to csv), where
\. on a line by itself cannot appear in the data, so there's
no problem. The problem is limited to the csv format.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Order changes in PG16 since ICU introduction

2023-05-22 Thread Daniel Verite
Jeff Davis wrote:

> If we special case locale=C, but do nothing for locale=fr_FR, then I'm
> not sure we've solved the problem. Andrew Gierth raised the issue here,
> which he called "maximally confusing":
> 
> https://postgr.es/m/874jp9f5jo@news-spur.riddles.org.uk
> 
> That's why I feel that we need to make locale apply to whatever the
> provider is, not just when it happens to be C.

While I agree that the LOCALE option in CREATE DATABASE is
counter-intuitive, I find it questionable that blending ICU
and libc locales into it helps that much with the user experience.

Trying the lastest v6-* patches applied on top of 722541ead1
(before the pgindent run), here are a few examples when I
don't think it goes well.

The OS is Ubuntu 22.04 (glibc 2.35, ICU 70.1)

initdb:

  Using default ICU locale "fr".
  Using language tag "fr" for ICU locale "fr".
  The database cluster will be initialized with this locale configuration:
provider:icu
ICU locale:  fr
LC_COLLATE:  fr_FR.UTF-8
LC_CTYPE:fr_FR.UTF-8
LC_MESSAGES: fr_FR.UTF-8
LC_MONETARY: fr_FR.UTF-8
LC_NUMERIC:  fr_FR.UTF-8
LC_TIME: fr_FR.UTF-8
  The default database encoding has accordingly been set to "UTF8".


#1

postgres=# create database test1 locale='fr_FR.UTF-8';
NOTICE:  using standard form "fr-FR" for ICU locale "fr_FR.UTF-8"
ERROR:  new ICU locale (fr-FR) is incompatible with the ICU locale of the
template database (fr)
HINT:  Use the same ICU locale as in the template database, or use template0
as template.


That looks like a fairly generic case that doesn't work seamlessly.


#2

postgres=# create database test2 locale='C.UTF-8' template='template0';
NOTICE:  using standard form "en-US-u-va-posix" for ICU locale "C.UTF-8"
CREATE DATABASE


en-US-u-va-posix does not sort like C.UTF-8 in glibc 2.35, so
this interpretation is arguably not what a user would expect.

I would expect the ICU warning or error (icu_validation_level) to kick
in instead of that transliteration.


#3

$ grep french /etc/locale.alias
french  fr_FR.ISO-8859-1

postgres=# create database test3 locale='french' template='template0'
encoding='LATIN1';
WARNING:  ICU locale "french" has unknown language "french"
HINT:  To disable ICU locale validation, set parameter icu_validation_level
to DISABLED.
CREATE DATABASE


In practice we're probably getting the "und" ICU locale whereas "fr" would
be appropriate.


I assume that we would find more cases like that if testing on many
operating systems.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Should CSV parsing be stricter about mid-field quotes?

2023-05-22 Thread Daniel Verite
Joel Jacobson wrote:

> Is there a valid reason why \. is needed for COPY FROM filename?
> It seems to me it would only be necessary for the COPY FROM STDIN case,
> since files have a natural end-of-file and a known file size.

Looking at CopyReadLineText() over at [1], I don't see a reason why
the unquoted \. could not be handled with COPY FROM file.
Even COPY FROM STDIN looks like it could be benefit, so that
\copy from file csv would hopefully not choke or truncate the data.
There's still the case when the CSV data is embedded in a psql script
(psql is unable to know where it ends), but for that, "don't do that"
might be a reasonable answer.


[1]
https://doxygen.postgresql.org/copyfromparse_8c.html#a90201f711221dd82d0c08deedd91e1b3



Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Order changes in PG16 since ICU introduction

2023-05-19 Thread Daniel Verite
Jeff Davis wrote:

> 2) Automatically change the provider to libc when locale=C.
> 
> Almost works, but it's not clear how we handle the case "provider=icu
> lc_collate='fr_FR.utf8' locale=C".
> 
> If we change it to "provider=libc lc_collate=C", we've overridden the
> specified lc_collate. If we ignore the locale=C, that would be
> surprising to users. If we throw an error, that would be a backwards
> compatibility issue.

This thread started with a report illustrating that when users mention
the locale "C", they implicitly mean "C" from the libc provider, as
when libc was the default. The problem is that as soon as ICU is the
default, any reference to a libc collation should mention explicitly
that the provider is libc.

It seems what we're set on the idea to create an exception for "C"
(and I assume also "POSIX") to avoid too much confusion, and because
"C" is quite special anyway, and has no equivalent in ICU (the switch
in v16 to ICU as the default provider is based on the premise that the
locales with the same name will behave pretty much the same with ICU
as they did with libc, but it's absolutely not the case with "C").

ISTM that if we want to go that route, we need the make the minimum
changes at the user interface level and not any deeper, so that when
(locale="C" OR locale="POSIX") AND the provider has not been specified,
then the command (initdb and create database) act as if the user had
specified provider=libc.


> (3) Support iculocale=C in the ICU provider using the memcmp() path.

> In other words, if provider=icu and iculocale=C, lc_collate_is_c() and
> lc_ctpye_is_c() would both return true.

ICU does not provide a locale that behaves like that, and it doesn't
feel right to pretend it does. It feels like attacking the problem
at the wrong level.

> (4) Create a new "none" provider (which has no locale and always memcmp
> semantics), and automatically change the provider to "none" if
> provider=icu and iculocale=C.

It still uses libc/C for character classification and case changing,
so "no locale" is technically not true. Personally I don't see
the benefit of adding a "none" provider. C is a libc locale
and libc is not disappearing. I also think  that when users explicitly
indicate provider=icu, they should get icu.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Should CSV parsing be stricter about mid-field quotes?

2023-05-19 Thread Daniel Verite
Joel Jacobson wrote:

> I understand its necessity for STDIN, given that the end of input needs to
> be explicitly defined.
> However, for files, we have a known file size and the end-of-file can be
> detected without the need for special markers.
> 
> Also, is the difference in how server-side COPY CSV is capable of dealing
> with \. but apparently not the client-side \COPY CSV documented somewhere?

psql implements the client-side "\copy table from file..." with
COPY table FROM STDIN ...

COPY FROM file CSV somewhat differs as your example shows,
but it still mishandle \. when unquoted. For instance, consider this
file to load with COPY  t FROM '/tmp/t.csv' WITH CSV
$ cat /tmp/t.csv
line 1
\.
line 3
line 4

It results in having only "line 1" being imported.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Should CSV parsing be stricter about mid-field quotes?

2023-05-18 Thread Daniel Verite
Joel Jacobson wrote:

> I've been using that trick myself many times in the past, but thanks to this
> deep-dive into this topic, it looks to me like TEXT would be a better format
> fit when dealing with unquoted TSV files, or?
> 
> OTOH, one would then need to inspect the TSV file doesn't contain \. on an
> empty line...

Note that this is the case for valid CSV contents, since backslash-dot
on a line by itself is both an end-of-data marker for COPY FROM and a
valid CSV line.
Having this line in the data results in either an error or having the
rest of the data silently discarded, depending on the context. There
is some previous discussion about this in [1].
Since the TEXT format doesn't have this kind of problem, one solution
is to filter the data through PROGRAM with an [untrusted CSV]->TEXT
filter. This is to be preferred over direct CSV loading when
strictness or robustness are more important than convenience.


[1]
https://www.postgresql.org/message-id/10e3eff6-eb04-4b3f-aeb4-b920192b9...@manitou-mail.org

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Order changes in PG16 since ICU introduction

2023-04-27 Thread Daniel Verite
Jeff Davis wrote:

> Attached are a few small patches:
> 
>   0001: don't convert C to en-US-u-va-posix
>   0002: handle locale C the same regardless of the provider, as you
> suggest above
>   0003: make LOCALE (or --locale) apply to everything including ICU

Testing this briefly I noticed two regressions

1) all pg_collation.collversion are empty due to a trivial bug in 0002:

@ -1650,6 +1686,10 @@ get_collation_actual_version(char collprovider, const
char *collcollate)
 {
char   *collversion = NULL;

+   if (pg_strcasecmp("C", collcollate) ||
+   pg_strcasecmp("POSIX", collcollate))
+   return NULL;
+

This should be pg_strcasecmp(...) == 0

2) The following works with HEAD (default provider=icu) but errors out with
the patches:

postgres=# create database lat9 locale 'fr_FR@euro' encoding LATIN9 template
'template0';
ERROR:  could not convert locale name "fr_FR@euro" to language tag:
U_ILLEGAL_ARGUMENT_ERROR

fr_FR@euro is a libc locale name 

$ locale -a|grep fr_FR
fr_FR
fr_FR@euro
fr_FR.iso88591
fr_FR.iso885915@euro
fr_FR.utf8

I understand that fr_FR@euro is taken as an ICU locale name, with the idea
that the locale
syntax being more or less compatible between both providers, this should work
smoothly.  0003 seems to go further in the interpretation and fail on it.
TBH the assumption that it's OK to feed libc locale names to ICU feels quite
uncomfortable.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Add standard collation UNICODE

2023-04-27 Thread Daniel Verite
Peter Eisentraut wrote:

>  COLLATE UNICODE
> 
> instead of
> 
>  COLLATE "und-x-icu"
> 
> or whatever it is, is pretty useful.
> 
> So, attached is a small patch to add this.

This collation has an empty pg_collation.collversion column, instead
of being set to the same value as "und-x-icu" to track its version.


postgres=# select * from pg_collation  where collname='unicode' \gx
-[ RECORD 1 ]---+
oid | 963
collname| unicode
collnamespace   | 11
collowner   | 10
collprovider| i
collisdeterministic | t
collencoding| -1
collcollate | 
collctype   | 
colliculocale   | und
collicurules| 
collversion | 

The original patch implements this as an INSERT in which it would be easy to
fix I guess, but in current HEAD it comes as an entry in
include/catalog/pg_collation.dat:

{ oid => '963',
  descr => 'sorts using the Unicode Collation Algorithm with default
settings',
  collname => 'unicode', collprovider => 'i', collencoding => '-1',
  colliculocale => 'und' },

Should it be converted back into an INSERT or better left
in this file and collversion being updated afterwards?


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Order changes in PG16 since ICU introduction

2023-04-25 Thread Daniel Verite
Jeff Davis wrote:

> > (I'm not sure whether those operations can get redirected to ICU
> > today
> > or whether they still always go to libc, but we'll surely want to fix
> > it eventually if the latter is still true.)
> 
> Those operations do get redirected to ICU today. 

FTR the full text search parser still uses the libc functions
is[w]space/alpha/digit...  that depend on lc_ctype, whether the db
collation provider is ICU or not.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: pg_collation.collversion for C.UTF-8

2023-04-22 Thread Daniel Verite
Thomas Munro wrote:

> It looks like for technical reasons
> inside glibc, that couldn't be done before 2.35:
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=17318
> 
> That strengthens my opinion that C.UTF-8 (the real C.UTF-8 supplied
> by the glibc project) isn't supposed to be versioned, but it's
> extremely unfortunate that a bunch of OSes (Debian and maybe more)
> have been sorting text in some other order under that name for
> years.

Yes. This is consistent with Debian/Ubuntu patches in 
glibc/localedata/locales/C

glibc-2.35 is not patched, and upstream has this:
  LC_COLLATE
  % The keyword 'codepoint_collation' in any part of any LC_COLLATE
  % immediately discards all collation information and causes the
  % locale to use strcmp/wcscmp for collation comparison.  This is
  % exactly what is needed for C (ASCII) or C.UTF-8.
  codepoint_collation
  END LC_COLLATE

But in older versions, glibc doesn't have the locales/C data file.
Debian adds it in debian/patches/localedata/C with that kind of
content:

* glibc 2.31  Debian 11
  LC_COLLATE
  order_start forward
  
  ..
  
  
  ..
  
  etc...

But as explained in the above-linked bugzilla entry, that did not
result in true byte-comparison semantics, for several reasons
that got fixed in 2.35.

So this looks like a solved problem for anyone starting to use these
collation with glibc 2.35 or newer (or other OSes that don't have a
compatibility issue with them in the first place).
But Debian/Ubuntu users upgrading from the older C.* to 2.35+ will not
be having the normal warning about the need to reindex.

I understand that my proposal to version C.* like any other collation
might be erring on the side of caution, but ignoring these collation
changes on at least one major OS does not feel right either.
Maybe we should consider doing platform-dependent checks?



Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




pg_collation.collversion for C.UTF-8

2023-04-18 Thread Daniel Verite
 Hi,

get_collation_actual_version() in pg_locale.c currently 
excludes C.UTF-8 (and more generally C.*) from versioning,
which makes pg_collation.collversion being empty for these
collations.

char *
get_collation_actual_version(char collprovider, const char *collcollate)
{

if (collprovider == COLLPROVIDER_LIBC &&
pg_strcasecmp("C", collcollate) != 0 &&
pg_strncasecmp("C.", collcollate, 2) != 0 &&
pg_strcasecmp("POSIX", collcollate) != 0)

This seems to be based on the idea that C.* collations provide an
immutable sort like "C", but it appears that it's not the case.

For instance, consider how these C.UTF-8 comparisons differ between
recent linux systems:

U+1D400 = Mathematical Bold Capital A

Debian 9.13 (glibc 2.24)
=> select  'A' < E'\U0001D400' collate "C.UTF-8";
 ?column? 
--
 t

Debian 10.13 (glibc 2.28)
=> select  'A' < E'\U0001D400' collate "C.UTF-8";
 ?column? 
--
 f

Debian 11.6 (glibc 2.31)
=> select  'A' < E'\U0001D400' collate "C.UTF-8";
 ?column? 
--
 f

Ubuntu 22.04 (glibc 2.35)
=> select  'A' < E'\U0001D400' collate "C.UTF-8";
 ?column? 
--
 t

So I suggest the attached patch to no longer exclude these collations
from the generic versioning.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
diff --git a/src/backend/utils/adt/pg_locale.c 
b/src/backend/utils/adt/pg_locale.c
index 092b620673..dd9c3a0c10 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1727,7 +1727,6 @@ get_collation_actual_version(char collprovider, const 
char *collcollate)
 #endif
if (collprovider == COLLPROVIDER_LIBC &&
pg_strcasecmp("C", collcollate) != 0 &&
-   pg_strncasecmp("C.", collcollate, 2) != 0 &&
pg_strcasecmp("POSIX", collcollate) != 0)
{
 #if defined(__GLIBC__)


Re: TAP tests for psql \g piped into program

2023-03-29 Thread Daniel Verite
Peter Eisentraut wrote:

> So for your patch, I would just do the path adjustment ad hoc in-line. 
> It's just one additional line.

Here's the patch updated that way.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
diff --git a/src/bin/psql/t/001_basic.pl b/src/bin/psql/t/001_basic.pl
index 64ce012062..c8780d0a52 100644
--- a/src/bin/psql/t/001_basic.pl
+++ b/src/bin/psql/t/001_basic.pl
@@ -367,4 +367,33 @@ psql_fails_like(
qr/incorrect interval value '10e400'/,
'\watch out-of-range interval');
 
+# Test \g output piped into a program.
+# The program is perl -pe '' to simply copy the input to the output.
+my $g_file = "$tempdir/g_file_1.out";
+my $perlbin = $^X;
+$perlbin =~ s!\\!/!g if $PostgreSQL::Test::Utils::windows_os;
+my $pipe_cmd = "$perlbin -pe '' >$g_file";
+
+psql_like($node, "SELECT 'one' \\g | $pipe_cmd", qr//, "one command \\g");
+my $c1 = slurp_file($g_file);
+like($c1, qr/one/);
+
+psql_like($node, "SELECT 'two' \\; SELECT 'three' \\g | $pipe_cmd", qr//, "two 
commands \\g");
+my $c2 = slurp_file($g_file);
+like($c2, qr/two.*three/s);
+
+
+psql_like($node, "\\set SHOW_ALL_RESULTS 0\nSELECT 'four' \\; SELECT 'five' 
\\g | $pipe_cmd", qr//,
+  "two commands \\g with only last result");
+my $c3 = slurp_file($g_file);
+like($c3, qr/five/);
+unlike($c3, qr/four/);
+
+psql_like($node, "copy (values ('foo'),('bar')) to stdout \\g | $pipe_cmd",
+ qr//,
+ "copy output passed to \\g pipe");
+my $c4 = slurp_file($g_file);
+like($c4, qr/foo.*bar/s);
+
+
 done_testing();


Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2023-03-01 Thread Daniel Verite
  I wrote:

> Here's a POC patch implementing row-by-row fetching.

PFA an updated patch.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index f907f5d4e8..ad5e8a5de9 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -372,6 +372,7 @@ AcceptResult(const PGresult *result, bool show_error)
 		{
 			case PGRES_COMMAND_OK:
 			case PGRES_TUPLES_OK:
+			case PGRES_SINGLE_TUPLE:
 			case PGRES_EMPTY_QUERY:
 			case PGRES_COPY_IN:
 			case PGRES_COPY_OUT:
@@ -675,13 +676,13 @@ PrintNotifications(void)
  * Returns true if successful, false otherwise.
  */
 static bool
-PrintQueryTuples(const PGresult *result, const printQueryOpt *opt,
+PrintQueryTuples(const PGresult **result, int nresults, const printQueryOpt *opt,
  FILE *printQueryFout)
 {
 	bool		ok = true;
 	FILE	   *fout = printQueryFout ? printQueryFout : pset.queryFout;
 
-	printQuery(result, opt ? opt : , fout, false, pset.logfile);
+	printQueryChunks(result, nresults, opt ? opt : , fout, false, pset.logfile);
 	fflush(fout);
 	if (ferror(fout))
 	{
@@ -958,7 +959,7 @@ PrintQueryResult(PGresult *result, bool last,
 			else if (last && pset.crosstab_flag)
 success = PrintResultInCrosstab(result);
 			else if (last || pset.show_all_results)
-success = PrintQueryTuples(result, opt, printQueryFout);
+success = PrintQueryTuples((const PGresult**), 1, opt, printQueryFout);
 			else
 success = true;
 
@@ -1371,6 +1372,47 @@ DescribeQuery(const char *query, double *elapsed_msec)
 	return OK;
 }
 
+/*
+ * Check if an output stream for \g needs to be opened, and if
+ * yes, open it.
+ * Return false if an error occurred, true otherwise.
+ */
+static bool
+SetupGOutput(PGresult *result, FILE **gfile_fout, bool *is_pipe)
+{
+	ExecStatusType status = PQresultStatus(result);
+	if (pset.gfname != NULL &&			/* there is a \g file or program */
+		*gfile_fout == NULL &&   /* and it's not already opened */
+		(status == PGRES_TUPLES_OK ||
+		 status == PGRES_SINGLE_TUPLE ||
+		 status == PGRES_COPY_OUT))
+	{
+		if (openQueryOutputFile(pset.gfname, gfile_fout, is_pipe))
+		{
+			if (is_pipe)
+disable_sigpipe_trap();
+		}
+		else
+			return false;
+	}
+	return true;
+}
+
+static void
+CloseGOutput(FILE *gfile_fout, bool is_pipe)
+{
+	/* close \g file if we opened it */
+	if (gfile_fout)
+	{
+		if (is_pipe)
+		{
+			pclose(gfile_fout);
+			restore_sigpipe_trap();
+		}
+		else
+			fclose(gfile_fout);
+	}
+}
 
 /*
  * ExecQueryAndProcessResults: utility function for use by SendQuery()
@@ -1402,10 +1444,16 @@ ExecQueryAndProcessResults(const char *query,
 	bool		success;
 	instr_time	before,
 after;
+	int fetch_count = pset.fetch_count;
 	PGresult   *result;
+
 	FILE	   *gfile_fout = NULL;
 	bool		gfile_is_pipe = false;
 
+	PGresult   **result_array = NULL; /* to collect results in single row mode */
+	int64		total_tuples = 0;
+	int			ntuples;
+
 	if (timing)
 		INSTR_TIME_SET_CURRENT(before);
 	else
@@ -1428,6 +1476,33 @@ ExecQueryAndProcessResults(const char *query,
 		return -1;
 	}
 
+	/*
+	 * If FETCH_COUNT is set and the context allows it, use the single row
+	 * mode to fetch results and have no more than FETCH_COUNT rows in
+	 * memory.
+	 */
+	if (fetch_count > 0 && !pset.crosstab_flag && !pset.gexec_flag && !is_watch
+		&& !pset.gset_prefix && pset.show_all_results)
+	{
+		/*
+		 * The row-by-row fetch is not enabled when SHOW_ALL_RESULTS is false,
+		 * since we would need to accumulate all rows before knowing
+		 * whether they need to be discarded or displayed, which contradicts
+		 * FETCH_COUNT.
+		 */
+		if (!PQsetSingleRowMode(pset.db))
+		{
+			pg_log_warning("fetching results in single row mode is unavailable");
+			fetch_count = 0;
+		}
+		else
+		{
+			result_array = (PGresult**) pg_malloc(fetch_count * sizeof(PGresult*));
+		}
+	}
+	else
+		fetch_count = 0;		/* disable single-row mode */
+
 	/*
 	 * If SIGINT is sent while the query is processing, the interrupt will be
 	 * consumed.  The user's intention, though, is to cancel the entire watch
@@ -1447,6 +1522,7 @@ ExecQueryAndProcessResults(const char *query,
 		ExecStatusType result_status;
 		PGresult   *next_result;
 		bool		last;
+		bool		partial_display = false;
 
 		if (!AcceptResult(result, false))
 		{
@@ -1573,6 +1649,96 @@ ExecQueryAndProcessResults(const char *query,
 			success &= HandleCopyResult(, copy_stream);
 		}
 
+		if (fetch_count > 0 && result_status == PGRES_SINGLE_TUPLE)
+		{
+			FILE	   *tuples_fout = printQueryFout;
+			printQueryOpt my_popt = pset.popt;
+			bool		is_pager = false;
+			int			flush_error = 0;
+
+			ntuples = 0;
+			total_tuples = 0;
+			partial_display = true;
+
+			success = SetupGOutput(result, _fout, _is_pipe);
+			if (gfile_fout)
+tuples_fout = gfile_fout;
+
+			/* initialize print options for partial table output */
+			my_popt.topt.start_table = true;
+			my_popt.topt.stop_table = false;
+			

Re: Allow tailoring of ICU locales with custom rules

2023-02-20 Thread Daniel Verite
Peter Eisentraut wrote:

[patch v5]

Two quick comments:

- pg_dump support need to be added for CREATE COLLATION / DATABASE

- there doesn't seem to be a way to add rules to template1.
If someone wants to have icu rules and initial contents to their new
databases, I think they need to create a custom template database
(from template0) for that purpose, in addition to template1.
From a usability standpoint, this is a bit cumbersome, as it's
normally the role of template1.
To improve on that, shouldn't initdb be able to create template0 with
rules too?


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: proposal: psql: psql variable BACKEND_PID

2023-02-06 Thread Daniel Verite
I wrote:

> > In the varlistentry, I suggest we add "This variable is unset when the
> > connection is lost." after "but can be changed or unset.
> 
> Personally I'd much rather have BACKEND_PID set to 0 rather than being unset
> when not connected. For one thing it allows safely using \if :BACKEND_PID.

Oops it turns out that was wishful thinking from me.
\if does not interpret a non-zero integer as true, except for the
value "1". 
I'd still prefer BACKEND_PID being 0 when not connected, though.

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: proposal: psql: psql variable BACKEND_PID

2023-02-06 Thread Daniel Verite
Corey Huinker wrote:

> Manually testing confirms that it works, at least for the connected state. I
> don't actually know how get psql to invoke DISCONNECT, so I killed the dev
> server and can confirm

Maybe something like this could be used, with no external action:

  postgres=# \echo :BACKEND_PID 
  10805
  postgres=# create user tester superuser;
  CREATE ROLE
  postgres=# \c postgres tester
  You are now connected to database "postgres" as user "tester".
  postgres=# alter user tester nosuperuser connection limit 0;
  ALTER ROLE
  postgres=# select pg_terminate_backend(pg_backend_pid());
  FATAL:  terminating connection due to administrator command
  server closed the connection unexpectedly
  This probably means the server terminated abnormally
  before or while processing the request.
  The connection to the server was lost. Attempting reset: Failed.
  The connection to the server was lost. Attempting reset: Failed.

  !?> \echo :BACKEND_PID
  :BACKEND_PID


> In the varlistentry, I suggest we add "This variable is unset when the
> connection is lost." after "but can be changed or unset.

Personally I'd much rather have BACKEND_PID set to 0 rather than being unset
when not connected. For one thing it allows safely using \if :BACKEND_PID.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Allow tailoring of ICU locales with custom rules

2023-02-04 Thread Daniel Verite
Laurenz Albe wrote:

> Cool so far.  Now I created a database with that locale:
> 
>  CREATE DATABASE teutsch LOCALE_PROVIDER icu ICU_LOCALE german_phone
> LOCALE "de_AT.utf8" TEMPLATE template0;
> 
> Now the rules are not in "pg_database":

The parameter after ICU_LOCALE is passed directly to ICU as a locale
ID, as opposed to refering a collation name in the current database.
This CREATE DATABASE doesn't fail because ICU accepts pretty much
anything as a locale ID, ignoring what it can't parse instead of
erroring out.

I think the way to express what you want should be:

CREATE DATABASE teutsch 
 LOCALE_PROVIDER icu
 ICU_LOCALE 'de_AT'
 LOCALE 'de_AT.utf8'
 ICU_RULES ' < g';

However it still leaves "daticurules" empty in the destination db,
because of an actual bug in the current patch.

Looking at createdb() in commands.c, it creates this variable:

@@ -711,6 +714,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
char   *dbcollate = NULL;
char   *dbctype = NULL;
char   *dbiculocale = NULL;
+   char   *dbicurules = NULL;
chardblocprovider = '\0';
char   *canonname;
int encoding = -1;

and then reads it later

@@ -1007,6 +1017,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
dblocprovider = src_locprovider;
if (dbiculocale == NULL && dblocprovider == COLLPROVIDER_ICU)
dbiculocale = src_iculocale;
+   if (dbicurules == NULL && dblocprovider == COLLPROVIDER_ICU)
+   dbicurules = src_icurules;
 
/* Some encodings are client only */
if (!PG_VALID_BE_ENCODING(encoding))

but it forgets to assign it in between, so it stays NULL and src_icurules
is taken instead.

> I guess that it is not the fault of this patch that the collation
> isn't there, but I think it is surprising.  What good is a database
> collation that does not exist in the database?

Even if the above invocation of CREATE DATABASE worked as you
intuitively expected, by getting the characteristics from the
user-defined collation for the destination db, it still wouldn't work to
refer
to COLLATE "german_phone" in the destination database.
That's because there would be no "german_phone" entry in the
pg_collation of the destination db, as it's cloned from the template
db, which has no reason to have this collation.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2023-01-12 Thread Daniel Verite
Tom Lane wrote:

> I agree that it seems like a good idea to try.
> There will be more per-row overhead, but the increase in flexibility
> is likely to justify that.

Here's a POC patch implementing row-by-row fetching.

If it wasn't for the per-row overhead, we could probably get rid of
ExecQueryUsingCursor() and use row-by-row fetches whenever
FETCH_COUNT is set, independently of the form of the query.

However the difference in processing time seems to be substantial: on
some quick tests with FETCH_COUNT=1, I'm seeing almost a 1.5x
increase on large datasets. I assume it's the cost of more allocations.
I would have hoped that avoiding the FETCH queries and associated
round-trips with the cursor method would compensate for that, but it
doesn't appear to be the case, at least with a fast local connection.

So in this patch, psql still uses the cursor method if the
query starts with "select", and falls back to the row-by-row in
the main code (ExecQueryAndProcessResults) otherwise.
Anyway it solves the main issue of the over-consumption of memory
for CTE and update/insert queries returning large resultsets.


Best regards,

-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 00627830c4..d3de9d8336 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -372,6 +372,7 @@ AcceptResult(const PGresult *result, bool show_error)
 		{
 			case PGRES_COMMAND_OK:
 			case PGRES_TUPLES_OK:
+			case PGRES_SINGLE_TUPLE:
 			case PGRES_EMPTY_QUERY:
 			case PGRES_COPY_IN:
 			case PGRES_COPY_OUT:
@@ -675,13 +676,13 @@ PrintNotifications(void)
  * Returns true if successful, false otherwise.
  */
 static bool
-PrintQueryTuples(const PGresult *result, const printQueryOpt *opt,
+PrintQueryTuples(const PGresult **result, int nresults, const printQueryOpt *opt,
  FILE *printQueryFout)
 {
 	bool		ok = true;
 	FILE	   *fout = printQueryFout ? printQueryFout : pset.queryFout;
 
-	printQuery(result, opt ? opt : , fout, false, pset.logfile);
+	printQueryChunks(result, nresults, opt ? opt : , fout, false, pset.logfile);
 	fflush(fout);
 	if (ferror(fout))
 	{
@@ -958,7 +959,7 @@ PrintQueryResult(PGresult *result, bool last,
 			else if (last && pset.crosstab_flag)
 success = PrintResultInCrosstab(result);
 			else if (last || pset.show_all_results)
-success = PrintQueryTuples(result, opt, printQueryFout);
+success = PrintQueryTuples((const PGresult**), 1, opt, printQueryFout);
 			else
 success = true;
 
@@ -1369,6 +1370,47 @@ DescribeQuery(const char *query, double *elapsed_msec)
 	return OK;
 }
 
+/*
+ * Check if an output stream for \g needs to be opened, and if
+ * yes, open it.
+ * Return false if an error occurred, true otherwise.
+ */
+static bool
+SetupGOutput(PGresult *result, FILE **gfile_fout, bool *is_pipe)
+{
+	ExecStatusType status = PQresultStatus(result);
+	if (pset.gfname != NULL &&			/* there is a \g file or program */
+		*gfile_fout == NULL &&   /* and it's not already opened */
+		(status == PGRES_TUPLES_OK ||
+		 status == PGRES_SINGLE_TUPLE ||
+		 status == PGRES_COPY_OUT))
+	{
+		if (openQueryOutputFile(pset.gfname, gfile_fout, is_pipe))
+		{
+			if (is_pipe)
+disable_sigpipe_trap();
+		}
+		else
+			return false;
+	}
+	return true;
+}
+
+static void
+CloseGOutput(FILE *gfile_fout, bool is_pipe)
+{
+	/* close \g file if we opened it */
+	if (gfile_fout)
+	{
+		if (is_pipe)
+		{
+			pclose(gfile_fout);
+			restore_sigpipe_trap();
+		}
+		else
+			fclose(gfile_fout);
+	}
+}
 
 /*
  * ExecQueryAndProcessResults: utility function for use by SendQuery()
@@ -1400,10 +1442,16 @@ ExecQueryAndProcessResults(const char *query,
 	bool		success;
 	instr_time	before,
 after;
+	int fetch_count = pset.fetch_count;
 	PGresult   *result;
+
 	FILE	   *gfile_fout = NULL;
 	bool		gfile_is_pipe = false;
 
+	PGresult   **result_array = NULL; /* to collect results in single row mode */
+	int64		total_tuples = 0;
+	int			ntuples;
+
 	if (timing)
 		INSTR_TIME_SET_CURRENT(before);
 
@@ -1424,6 +1472,33 @@ ExecQueryAndProcessResults(const char *query,
 		return -1;
 	}
 
+	/*
+	 * If FETCH_COUNT is set and the context allows it, use the single row
+	 * mode to fetch results and have no more than FETCH_COUNT rows in
+	 * memory.
+	 */
+	if (fetch_count > 0 && !pset.crosstab_flag && !pset.gexec_flag && !is_watch
+		&& !pset.gset_prefix && pset.show_all_results)
+	{
+		/*
+		 * The row-by-row fetch is not enabled when SHOW_ALL_RESULTS is false,
+		 * since we would need to accumulate all rows before knowing
+		 * whether they need to be discarded or displayed, which contradicts
+		 * FETCH_COUNT.
+		 */
+		if (!PQsetSingleRowMode(pset.db))
+		{
+			pg_log_warning("fetching results in single row mode is unavailable");
+			fetch_count = 0;
+		}
+		else
+		{
+			result_array = (PGresult**) pg_malloc(fetch_count * sizeof(PGresult*));
+		}
+	}
+	else
+		fetch_count = 0;		/* disable single-row mode */

Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2023-01-04 Thread Daniel Verite
Jakub Wartak wrote:

> It might be a not so well known fact (?) that CTEs are not executed
> with cursor when asked to do so, but instead silently executed with
> potential huge memory allocation going on. Patch is attached. My one
> doubt is that not every statement starting with "WITH" is WITH(..)
> SELECT of course.

Yes, that's why WITH queries are currently filtered out by the
FETCH_COUNT feature.

Case in point:

test=> begin;
BEGIN

test=> create table tbl(i int);
CREATE TABLE

test=> declare psql_cursor cursor for
 with r(i) as (values (1))
 insert into tbl(i) select i from r;
ERROR:  syntax error at or near "insert"
LINE 3: insert into tbl(i) select i from r;


So the fix you're proposing would fail on that kind of queries.

A solution would be for psql to use PQsetSingleRowMode() to retrieve
results row-by-row, as opposed to using a cursor, and then allocate
memory for only FETCH_COUNT rows at a time. Incidentally it solves
other problems like queries containing multiple statements, that also
fail to work properly with cursors, or UPDATE/INSERT... RETURNING.. on
large number of rows that could also benefit from pagination in
memory.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




TAP tests for psql \g piped into program

2023-01-02 Thread Daniel Verite
Hi,

This is a follow-up to commit d2a44904 from the 2022-11 CF [1]
The TAP tests were left out with the suggestion to use Perl instead of
cat (Unix) / findstr (Windows) as the program to pipe into.

PFA a patch implementing that suggestion.


[1] https://commitfest.postgresql.org/40/4000/

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
diff --git a/src/bin/psql/t/001_basic.pl b/src/bin/psql/t/001_basic.pl
index f447845717..9b908171e0 100644
--- a/src/bin/psql/t/001_basic.pl
+++ b/src/bin/psql/t/001_basic.pl
@@ -325,4 +325,31 @@ is($row_count, '10',
'client-side error commits transaction, no ON_ERROR_STOP and multiple 
-c switches'
 );
 
+# Test \g output piped into a program.
+# The program is perl -pe '' to simply copy the input to the output.
+my $g_file = "$tempdir/g_file_1.out";
+my $perlbin = PostgreSQL::Test::Utils::perl_binary();
+my $pipe_cmd = "$perlbin -pe '' >$g_file";
+
+psql_like($node, "SELECT 'one' \\g | $pipe_cmd", qr//, "one command \\g");
+my $c1 = slurp_file($g_file);
+like($c1, qr/one/);
+
+psql_like($node, "SELECT 'two' \\; SELECT 'three' \\g | $pipe_cmd", qr//, "two 
commands \\g");
+my $c2 = slurp_file($g_file);
+like($c2, qr/two.*three/s);
+
+
+psql_like($node, "\\set SHOW_ALL_RESULTS 0\nSELECT 'four' \\; SELECT 'five' 
\\g | $pipe_cmd", qr//,
+  "two commands \\g with only last result");
+my $c3 = slurp_file($g_file);
+like($c3, qr/five/);
+unlike($c3, qr/four/);
+
+psql_like($node, "copy (values ('foo'),('bar')) to stdout \\g | $pipe_cmd",
+ qr//,
+ "copy output passed to \\g pipe");
+my $c4 = slurp_file($g_file);
+like($c4, qr/foo.*bar/s);
+
 done_testing();
diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm 
b/src/test/perl/PostgreSQL/Test/Utils.pm
index b139190cc8..4e97c49dca 100644
--- a/src/test/perl/PostgreSQL/Test/Utils.pm
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
@@ -74,6 +74,7 @@ our @EXPORT = qw(
   run_log
   run_command
   pump_until
+  perl_binary
 
   command_ok
   command_fails
@@ -448,6 +449,23 @@ sub pump_until
 
 =pod
 
+=item perl_binary()
+
+Return the location of the currently running Perl interpreter.
+
+=cut
+
+sub perl_binary
+{
+   # Note: use of forward slashes here avoids any escaping problems
+   # that arise from use of backslashes.
+   my $perlbin = $^X;
+   $perlbin =~ s!\\!/!g if $windows_os;
+   return $perlbin;
+}
+
+=pod
+
 =item generate_ascii_string(from_char, to_char)
 
 Generate a string made of the given range of ASCII characters.
diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl 
b/src/test/recovery/t/025_stuck_on_old_timeline.pl
index fd821242e8..9b04175ef2 100644
--- a/src/test/recovery/t/025_stuck_on_old_timeline.pl
+++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl
@@ -25,12 +25,11 @@ my $node_primary = 
PostgreSQL::Test::Cluster->new('primary');
 # get there.
 $node_primary->init(allows_streaming => 1, has_archiving => 1);
 
+my $perlbin = PostgreSQL::Test::Utils::perl_binary();
+my $archivedir_primary = $node_primary->archive_dir;
 # Note: consistent use of forward slashes here avoids any escaping problems
 # that arise from use of backslashes. That means we need to double-quote all
 # the paths in the archive_command
-my $perlbin = $^X;
-$perlbin =~ s!\\!/!g if $PostgreSQL::Test::Utils::windows_os;
-my $archivedir_primary = $node_primary->archive_dir;
 $archivedir_primary =~ s!\\!/!g if $PostgreSQL::Test::Utils::windows_os;
 $node_primary->append_conf(
'postgresql.conf', qq(


Re: Tests for psql \g and \o

2022-11-30 Thread Daniel Verite
Michael Paquier wrote:

> Thanks, the tests part of the main regression test suite look good to
> me, so I have applied them after fixing a few typos and tweaking the
> style of the test.

Thanks!

> Regarding the tests with pipes, I had cold feet with the
> dependencies on cat for non-WIN32 or findstr for WIN32.

OK. If the issue is that these programs might be missing, I guess
we could check that beforehand with IPC::Run and skip the
corresponding psql tests if they're not available or not working
as expected.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Tests for psql \g and \o

2022-11-23 Thread Daniel Verite
Michael Paquier wrote:

> +psql_like($node, "SELECT 'one' \\g | cat >$g_file", qr//, "one command
> \\g");
> +my $c1 = slurp_file($g_file);
> +like($c1, qr/one/);
> 
> Windows may not have an equivalent for "cat", no?  Note that psql's
> 001_basic.pl has no restriction in place for Windows.  Perhaps you
> could use the same trick as basebackup_to_shell, where GZIP is used to
> write some arbitrary data..  Anyway, this has some quoting issues
> especially if the file's path has whitespaces?  This is located in
> File::Temp::tempdir, still it does not sound like a good thing to rely
> on this assumption on portability grounds.

PFA a new patch addressing these issues.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
diff --git a/src/bin/psql/t/001_basic.pl b/src/bin/psql/t/001_basic.pl
index f447845717..6fcbdeff7d 100644
--- a/src/bin/psql/t/001_basic.pl
+++ b/src/bin/psql/t/001_basic.pl
@@ -325,4 +325,24 @@ is($row_count, '10',
'client-side error commits transaction, no ON_ERROR_STOP and multiple 
-c switches'
 );
 
+# Test \g output piped into a program
+my $g_file = "$tempdir/g_file_1.out";
+# on Windows, findstr "^" with any text on stdin will copy it to stdout
+my $pipe_cmd = $windows_os ? "findstr \"^\" > \"$g_file\"" : "cat > 
\"$g_file\"";
+
+psql_like($node, "SELECT 'one' \\g | $pipe_cmd", qr//, "one command \\g");
+my $c1 = slurp_file($g_file);
+like($c1, qr/one/);
+
+psql_like($node, "SELECT 'two' \\; SELECT 'three' \\g | $pipe_cmd", qr//, "two 
commands \\g");
+my $c2 = slurp_file($g_file);
+like($c2, qr/two.*three/s);
+
+
+psql_like($node, "\\set SHOW_ALL_RESULTS 0\nSELECT 'four' \\; SELECT 'five' 
\\g | $pipe_cmd", qr//,
+  "two commands \\g with only last result");
+my $c3 = slurp_file($g_file);
+like($c3, qr/five/);
+unlike($c3, qr/four/);
+
 done_testing();
diff --git a/src/test/regress/expected/psql.out 
b/src/test/regress/expected/psql.out
index 5bdae290dc..29bc07c1b2 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5433,6 +5433,136 @@ CONTEXT:  PL/pgSQL function warn(text) line 2 at RAISE
 \set SHOW_ALL_RESULTS on
 DROP FUNCTION warn(TEXT);
 --
+-- \g file
+--
+\getenv abs_builddir PG_ABS_BUILDDIR
+\set outfile1 :abs_builddir '/results/psql-output1'
+-- this table is used to load back the output data from files
+CREATE TEMPORARY TABLE reload_output(
+ lineno int not null generated always as identity,
+ line text
+);
+SELECT 1 AS a \g :outfile1
+COPY reload_output(line) FROM :'outfile1';
+SELECT 2 AS b\; SELECT 3 AS c\; SELECT 4 AS d \g :outfile1
+COPY reload_output(line) FROM :'outfile1';
+COPY (select 'foo') to stdout \; COPY (select 'bar') to stdout \g :outfile1
+COPY reload_output(line) FROM :'outfile1';
+SELECT line FROM reload_output ORDER BY lineno;
+  line   
+-
+  a 
+ ---
+  1
+ (1 row)
+ 
+  b 
+ ---
+  2
+ (1 row)
+ 
+  c 
+ ---
+  3
+ (1 row)
+ 
+  d 
+ ---
+  4
+ (1 row)
+ 
+ foo
+ bar
+(22 rows)
+
+TRUNCATE TABLE reload_output;
+--
+-- \o file
+--
+\set outfile2 :abs_builddir '/results/psql-output2'
+\o :outfile2
+select max(unique1) from onek;
+SELECT 1 AS a\; SELECT 2 AS b\; SELECT 3 AS c;
+-- COPY TO file
+-- the data goes into :outfile1 and the command status into :outfile2
+\set QUIET false
+COPY (select unique1 from onek order by unique1 limit 10) TO :'outfile1';
+-- DML command status
+UPDATE onek SET unique1=unique1 WHERE false;
+\set QUIET true
+\o
+COPY reload_output(line) FROM :'outfile1';
+SELECT line FROM reload_output ORDER BY lineno;
+ line 
+--
+ 0
+ 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+ 9
+(10 rows)
+
+TRUNCATE TABLE reload_output;
+COPY reload_output(line) FROM :'outfile2';
+SELECT line FROM reload_output ORDER BY lineno;
+   line   
+--
+  max 
+ -
+  999
+ (1 row)
+ 
+  a 
+ ---
+  1
+ (1 row)
+ 
+  b 
+ ---
+  2
+ (1 row)
+ 
+  c 
+ ---
+  3
+ (1 row)
+ 
+ COPY 10
+ UPDATE 0
+(22 rows)
+
+TRUNCATE TABLE reload_output;
+\o :outfile2
+-- multiple COPY TO stdout
+-- the data go into :outfile2 and the status is not output
+COPY (select 'foo1') to stdout \; COPY (select 'bar1') to stdout;
+-- combine \o and \g file with multiple COPY queries
+COPY (select 'foo2') to stdout \; COPY (select 'bar2') to stdout \g :outfile1
+\o
+COPY reload_output(line) FROM :'outfile1';
+SELECT line FROM reload_output ORDER BY lineno;
+ line 
+--
+ foo2
+ bar2
+(2 rows)
+
+TRUNCATE TABLE reload_output;
+COPY reload_output(line) FROM :'outfile2';
+SELECT line FROM reload_output ORDER BY lineno;
+ line 
+--
+ foo1
+ bar1
+(2 rows)
+
+TRUNCATE TABLE reload_output;
+--
 -- AUTOCOMMIT and combined queries
 --
 \set AUTOCOMMIT off
diff --git a/src/test/regress/sql/psql.sql b/src/test/regress/sql/psql.sql
index 8732017e51..ab1f71aa77 100644
--- a/src/test/regress/sql/psql.sql
+++ b/src/test/regress/sql/psql.sql
@@ -1379,6 +1379,72 @@ SELECT 1 AS one \; SELECT warn('1.5') \; SELECT 2 AS two 
;
 \set SHOW_ALL_RESULTS on
 DROP FUNCTION warn(TEXT);
 
+--
+-- \g file

Re: psql: Add command to use extended query protocol

2022-11-14 Thread Daniel Verite
Peter Eisentraut wrote:

> > I assume that we may sometimes want to use the
> > extended protocol on all queries of a script, like
> > pgbench does with --protocol=extended.
> 
> But is there an actual use case for this in psql?  In pgbench, there are 
> scenarios where you want to test aspects of prepared statements, plan 
> caching, and so on.  Is there something like that for psql?

If we set aside "exercising the protocol" as not an interesting use case
for psql, then no, I can't think of any benefit.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: psql: Add command to use extended query protocol

2022-11-09 Thread Daniel Verite
Peter Eisentraut wrote:

> Is there a use case for a global setting?

I assume that we may sometimes want to use the
extended protocol on all queries of a script, like
pgbench does with --protocol=extended.
Outside of psql, it's too complicated to parse a SQL script to
replace the end-of-query semicolons with \gp, whereas
a psql setting solves this effortlessly.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: psql: Add command to use extended query protocol

2022-11-08 Thread Daniel Verite
David G. Johnston wrote:

> I would keep the \gp meta-command to force extended mode regardless
> of whether the query itself requires it.

+1

> A pset variable to control the default seems reasonable as well.
> The implication would be that if you set that pset variable there is
> no way to have individual commands use simple query mode directly.

+1 except that it would be a \set variable for consistency with the
other execution-controlling variables. \pset variables control only
the display.

BTW if we wanted to auto-detect that a query requires binding or the
extended query protocol, we need to keep in mind that for instance
"PREPARE stmt AS $1" must pass without binding, with both the simple
and the extended query protocol.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: psql: Add command to use extended query protocol

2022-11-02 Thread Daniel Verite
Jehan-Guillaume de Rorthais wrote:

> As I wrote in my TCE review, would it be possible to use psql vars to set
> some
> named parameters for the prepared query? This would looks like:
> 
>  \set p1 foo
>  \set p2 bar
>  SELECT :'p1', :'p2' \gp

As I understand the feature, variables would be passed like this:

\set var1 'foo bar'
\set var2 'baz''qux'

select $1, $2 \gp :var1 :var2

 ?column? | ?column? 
--+--
 foo bar  | baz'qux

It appears to work fine with the current patch.

This is consistent with the fact that PQexecParams passes $N
parameters ouf of the SQL query (versus injecting them in the text of
the query) which is also why no quoting is needed.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: [patch] \g with multiple result sets and \watch with copy queries

2022-11-01 Thread Daniel Verite
Corey Huinker wrote:

> I think that because it is more complicated than usual psql, we may want to
> comment on the intention of the tests and some of the less-than-common psql
> elements (\set concatenation, resetting \o, etc). If you see value in that
> I can amend the patch.

If the intentions of some tests appear to be unclear, then yes, sure.
I don't feel the need to explain the "how" though. The other
comments in these files say why we're testing such or such case, but
don't go beyond that.

> Are there any options on COPY (header, formats) that we think we should
> test as well?

There are COPY tests already in src/test/regress/sql/copy*.sql, which
hopefully cover the many combination of options.

For \g and \o the intention behind the tests is to check that the
query output goes where it should in all cases. The options that can't
affect where the results go are not really in scope.

FTR I started a followup thread on this at [1], to be associated to a
new CF entry [2]

[1]
https://www.postgresql.org/message-id/flat/25c2bb5b-9012-40f8-8088-774cb764046d%40manitou-mail.org

[2] https://commitfest.postgresql.org/40/4000/


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Tests for psql \g and \o

2022-11-01 Thread Daniel Verite
Hi,

Here's a patch adding regression tests for \g and \o, and TAP tests
for \g | program,

It's a follow up to the discussion at [1]. Since this discussion
already has a slot in the CF [2] with a committed patch, let's start a
new separate thread.

[1]
https://www.postgresql.org/message-id/4333844c-2244-4d6e-a49a-1d483fbe3...@manitou-mail.org

[2]  https://commitfest.postgresql.org/40/3923/


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
diff --git a/src/bin/psql/t/001_basic.pl b/src/bin/psql/t/001_basic.pl
index f447845717..2621f55a17 100644
--- a/src/bin/psql/t/001_basic.pl
+++ b/src/bin/psql/t/001_basic.pl
@@ -325,4 +325,24 @@ is($row_count, '10',
'client-side error commits transaction, no ON_ERROR_STOP and multiple 
-c switches'
 );
 
+# Test \g output piped into a program
+my $g_file = "$tempdir/g_file_1.out";
+# on Windows, findstr "^" with any text on stdin will copy it to stdout
+my $pipe_cmd = $windows_os ? "findstr \"^\" > $g_file" : "cat > $g_file";
+
+psql_like($node, "SELECT 'one' \\g | cat >$g_file", qr//, "one command \\g");
+my $c1 = slurp_file($g_file);
+like($c1, qr/one/);
+
+psql_like($node, "SELECT 'two' \\; SELECT 'three' \\g | cat >$g_file", qr//, 
"two commands \\g");
+my $c2 = slurp_file($g_file);
+like($c2, qr/two.*three/s);
+
+
+psql_like($node, "\\set SHOW_ALL_RESULTS 0\nSELECT 'four' \\; SELECT 'five' 
\\g | cat >$g_file", qr//,
+  "two commands \\g with only last result");
+my $c3 = slurp_file($g_file);
+like($c3, qr/five/);
+unlike($c3, qr/four/);
+
 done_testing();
diff --git a/src/test/regress/expected/psql.out 
b/src/test/regress/expected/psql.out
index a7f5700edc..90beedde58 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5402,6 +5402,136 @@ CONTEXT:  PL/pgSQL function warn(text) line 2 at RAISE
 \set SHOW_ALL_RESULTS on
 DROP FUNCTION warn(TEXT);
 --
+-- \g file
+--
+\getenv abs_builddir PG_ABS_BUILDDIR
+\set outfile1 :abs_builddir '/results/psql-output1'
+-- this table is used to load back the output data from files
+CREATE TEMPORARY TABLE reload_output(
+ lineno int not null generated always as identity,
+ line text
+);
+SELECT 1 AS a \g :outfile1
+COPY reload_output(line) FROM :'outfile1';
+SELECT 2 AS b\; SELECT 3 AS c\; SELECT 4 AS d \g :outfile1
+COPY reload_output(line) FROM :'outfile1';
+COPY (select 'foo') to stdout \; COPY (select 'bar') to stdout \g :outfile1
+COPY reload_output(line) FROM :'outfile1';
+SELECT line FROM reload_output ORDER BY lineno;
+  line   
+-
+  a 
+ ---
+  1
+ (1 row)
+ 
+  b 
+ ---
+  2
+ (1 row)
+ 
+  c 
+ ---
+  3
+ (1 row)
+ 
+  d 
+ ---
+  4
+ (1 row)
+ 
+ foo
+ bar
+(22 rows)
+
+TRUNCATE TABLE reload_output;
+--
+-- \o file
+--
+\set outfile2 :abs_builddir '/results/psql-output2'
+\o :outfile2
+select max(unique1) from onek;
+SELECT 1 AS a\; SELECT 2 AS b\; SELECT 3 AS c;
+-- COPY TO file
+-- the data goes into :outfile1 and the command status into :outfile2
+\set QUIET false
+COPY (select unique1 from onek order by unique1 limit 10) TO :'outfile1';
+-- DML command status
+UPDATE onek SET unique1=unique1 WHERE false;
+\set QUIET true
+\o
+COPY reload_output(line) FROM :'outfile1';
+SELECT line FROM reload_output ORDER BY lineno;
+ line 
+--
+ 0
+ 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+ 9
+(10 rows)
+
+TRUNCATE TABLE reload_output;
+COPY reload_output(line) FROM :'outfile2';
+SELECT line FROM reload_output ORDER BY lineno;
+   line   
+--
+  max 
+ -
+  999
+ (1 row)
+ 
+  a 
+ ---
+  1
+ (1 row)
+ 
+  b 
+ ---
+  2
+ (1 row)
+ 
+  c 
+ ---
+  3
+ (1 row)
+ 
+ COPY 10
+ UPDATE 0
+(22 rows)
+
+TRUNCATE TABLE reload_output;
+\o :outfile2
+-- multiple COPY TO stdout
+-- the data go into :outfile2 and the status is not output
+COPY (select 'foo1') to stdout \; COPY (select 'bar1') to stdout;
+-- combine \o and \g file with multiple COPY queries
+COPY (select 'foo2') to stdout \; COPY (select 'bar2') to stdout \g :outfile1
+\o
+COPY reload_output(line) FROM :'outfile1';
+SELECT line FROM reload_output ORDER BY lineno;
+ line 
+--
+ foo2
+ bar2
+(2 rows)
+
+TRUNCATE TABLE reload_output;
+COPY reload_output(line) FROM :'outfile2';
+SELECT line FROM reload_output ORDER BY lineno;
+ line 
+--
+ foo1
+ bar1
+(2 rows)
+
+TRUNCATE TABLE reload_output;
+--
 -- AUTOCOMMIT and combined queries
 --
 \set AUTOCOMMIT off
diff --git a/src/test/regress/sql/psql.sql b/src/test/regress/sql/psql.sql
index 1149c6a839..c3ae21633d 100644
--- a/src/test/regress/sql/psql.sql
+++ b/src/test/regress/sql/psql.sql
@@ -1365,6 +1365,72 @@ SELECT 1 AS one \; SELECT warn('1.5') \; SELECT 2 AS two 
;
 \set SHOW_ALL_RESULTS on
 DROP FUNCTION warn(TEXT);
 
+--
+-- \g file
+--
+\getenv abs_builddir PG_ABS_BUILDDIR
+\set outfile1 :abs_builddir '/results/psql-output1'
+
+-- this table is used to load back the output data from files
+CREATE TEMPORARY TABLE reload_output(
+ lineno int not null generated always as identity,
+ line text
+);
+
+SELECT 1 AS a \g 

Re: [patch] \g with multiple result sets and \watch with copy queries

2022-10-07 Thread Daniel Verite
Tom Lane wrote:

> > Currently, test/regress/sql/psql.sql doesn't AFAICS write anything
> > outside of stdout, but \g, \o, \copy need  to write to external
> > files to be tested properly.
> 
> Yeah, I don't think we can usefully test these in psql.sql, because
> file-system side effects are bad in that context.  But maybe a TAP
> test could cope?

I've came up with the attached using psql.sql only, at least for
\g and \o writing to files.
This is a bit more complicated than the usual tests, but not
that much.
Any opinions on this?


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
diff --git a/src/test/regress/expected/psql.out 
b/src/test/regress/expected/psql.out
index a7f5700edc..90beedde58 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5402,6 +5402,136 @@ CONTEXT:  PL/pgSQL function warn(text) line 2 at RAISE
 \set SHOW_ALL_RESULTS on
 DROP FUNCTION warn(TEXT);
 --
+-- \g file
+--
+\getenv abs_builddir PG_ABS_BUILDDIR
+\set outfile1 :abs_builddir '/results/psql-output1'
+-- this table is used to load back the output data from files
+CREATE TEMPORARY TABLE reload_output(
+ lineno int not null generated always as identity,
+ line text
+);
+SELECT 1 AS a \g :outfile1
+COPY reload_output(line) FROM :'outfile1';
+SELECT 2 AS b\; SELECT 3 AS c\; SELECT 4 AS d \g :outfile1
+COPY reload_output(line) FROM :'outfile1';
+COPY (select 'foo') to stdout \; COPY (select 'bar') to stdout \g :outfile1
+COPY reload_output(line) FROM :'outfile1';
+SELECT line FROM reload_output ORDER BY lineno;
+  line   
+-
+  a 
+ ---
+  1
+ (1 row)
+ 
+  b 
+ ---
+  2
+ (1 row)
+ 
+  c 
+ ---
+  3
+ (1 row)
+ 
+  d 
+ ---
+  4
+ (1 row)
+ 
+ foo
+ bar
+(22 rows)
+
+TRUNCATE TABLE reload_output;
+--
+-- \o file
+--
+\set outfile2 :abs_builddir '/results/psql-output2'
+\o :outfile2
+select max(unique1) from onek;
+SELECT 1 AS a\; SELECT 2 AS b\; SELECT 3 AS c;
+-- COPY TO file
+-- the data goes into :outfile1 and the command status into :outfile2
+\set QUIET false
+COPY (select unique1 from onek order by unique1 limit 10) TO :'outfile1';
+-- DML command status
+UPDATE onek SET unique1=unique1 WHERE false;
+\set QUIET true
+\o
+COPY reload_output(line) FROM :'outfile1';
+SELECT line FROM reload_output ORDER BY lineno;
+ line 
+--
+ 0
+ 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+ 9
+(10 rows)
+
+TRUNCATE TABLE reload_output;
+COPY reload_output(line) FROM :'outfile2';
+SELECT line FROM reload_output ORDER BY lineno;
+   line   
+--
+  max 
+ -
+  999
+ (1 row)
+ 
+  a 
+ ---
+  1
+ (1 row)
+ 
+  b 
+ ---
+  2
+ (1 row)
+ 
+  c 
+ ---
+  3
+ (1 row)
+ 
+ COPY 10
+ UPDATE 0
+(22 rows)
+
+TRUNCATE TABLE reload_output;
+\o :outfile2
+-- multiple COPY TO stdout
+-- the data go into :outfile2 and the status is not output
+COPY (select 'foo1') to stdout \; COPY (select 'bar1') to stdout;
+-- combine \o and \g file with multiple COPY queries
+COPY (select 'foo2') to stdout \; COPY (select 'bar2') to stdout \g :outfile1
+\o
+COPY reload_output(line) FROM :'outfile1';
+SELECT line FROM reload_output ORDER BY lineno;
+ line 
+--
+ foo2
+ bar2
+(2 rows)
+
+TRUNCATE TABLE reload_output;
+COPY reload_output(line) FROM :'outfile2';
+SELECT line FROM reload_output ORDER BY lineno;
+ line 
+--
+ foo1
+ bar1
+(2 rows)
+
+TRUNCATE TABLE reload_output;
+--
 -- AUTOCOMMIT and combined queries
 --
 \set AUTOCOMMIT off
diff --git a/src/test/regress/sql/psql.sql b/src/test/regress/sql/psql.sql
index 1149c6a839..c3ae21633d 100644
--- a/src/test/regress/sql/psql.sql
+++ b/src/test/regress/sql/psql.sql
@@ -1365,6 +1365,72 @@ SELECT 1 AS one \; SELECT warn('1.5') \; SELECT 2 AS two 
;
 \set SHOW_ALL_RESULTS on
 DROP FUNCTION warn(TEXT);
 
+--
+-- \g file
+--
+\getenv abs_builddir PG_ABS_BUILDDIR
+\set outfile1 :abs_builddir '/results/psql-output1'
+
+-- this table is used to load back the output data from files
+CREATE TEMPORARY TABLE reload_output(
+ lineno int not null generated always as identity,
+ line text
+);
+
+SELECT 1 AS a \g :outfile1
+COPY reload_output(line) FROM :'outfile1';
+
+SELECT 2 AS b\; SELECT 3 AS c\; SELECT 4 AS d \g :outfile1
+COPY reload_output(line) FROM :'outfile1';
+
+COPY (select 'foo') to stdout \; COPY (select 'bar') to stdout \g :outfile1
+COPY reload_output(line) FROM :'outfile1';
+
+SELECT line FROM reload_output ORDER BY lineno;
+TRUNCATE TABLE reload_output;
+
+--
+-- \o file
+--
+\set outfile2 :abs_builddir '/results/psql-output2'
+
+\o :outfile2
+select max(unique1) from onek;
+SELECT 1 AS a\; SELECT 2 AS b\; SELECT 3 AS c;
+
+-- COPY TO file
+-- the data goes into :outfile1 and the command status into :outfile2
+\set QUIET false
+COPY (select unique1 from onek order by unique1 limit 10) TO :'outfile1';
+-- DML command status
+UPDATE onek SET unique1=unique1 WHERE false;
+\set QUIET true
+\o
+
+COPY reload_output(line) FROM :'outfile1';
+SELECT line FROM reload_output ORDER BY lineno;
+TRUNCATE TABLE reload_output;
+

Re: [patch] \g with multiple result sets and \watch with copy queries

2022-10-04 Thread Daniel Verite
Tom Lane wrote:

> Pushed after making some corrections.

Thanks!

> Given the time pressure, I did not worry about installing regression
> test coverage for this stuff, but I wonder if we shouldn't add some.

Currently, test/regress/sql/psql.sql doesn't AFAICS write anything
outside of stdout, but \g, \o, \copy need  to write to external
files to be tested properly.

Looking at nearby tests, I see that commit d1029bb5a26 brings
interesting additions in test/regress/sql/misc.sql that could be used
as a model to handle output files. psql.sql could write
into PG_ABS_BUILDDIR, then read the files back with \copy I guess,
then output that again to stdout for comparison. I'll see if I can get
that to work.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




[patch] \g with multiple result sets and \watch with copy queries

2022-09-29 Thread Daniel Verite
Hi,

The psql improvement in v15 to output multiple result sets does not
behave as one might expect with \g: the output file or program
to pipe into is opened/closed on each result set, overwriting the
previous ones in the case of \g file.

Example:

psql -At 

Re: Question about user/database-level parameters

2022-08-04 Thread Daniel Verite
Japin Li wrote:

> However, if the database is in production, we cannot go into single-user
> mode, should we provide an option to change this behavior on the fly?

It already exists, through PGOPTIONS, which appears to work
for local_preload_libraries, in a quick test.

That is, you can log in by invoking psql with:
PGOPTIONS="-c local_preload_libraries="
and issue the ALTER USER to reset things back to normal
without stopping the instance.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Add header support to text format and matching feature

2022-06-15 Thread Daniel Verite
Julien Rouhaud wrote:

> Maybe that's just me but I understand "not supported" as "this makes
> sense, but this is currently a limitation that might be lifted
> later".
 
Looking at ProcessCopyOptions(), there are quite a few invalid
combinations of options that produce
ERRCODE_FEATURE_NOT_SUPPORTED currently:

- HEADER in binary mode
- FORCE_QUOTE outside of csv
- FORCE_QUOTE outside of COPY TO
- FORCE_NOT_NULL outside of csv
- FORCE_NOT_NULL outside of COPY FROM
- ESCAPE outside of csv
- delimiter appearing in the NULL specification
- csv quote appearing in the NULL specification

FORCE_QUOTE and FORCE_NOT_NULL are options that only make sense in one
direction, so the errors when using these in the wrong direction are
comparable to the "HEADER MATCH outside of COPY FROM" error that we
want to add. In that sense, ERRCODE_FEATURE_NOT_SUPPORTED would be
consistent.

The other errors in the list above are more about the format itself,
with options that make sense only for one format. So the way we're
supposed to understand ERRCODE_FEATURE_NOT_SUPPORTED in these
other cases is that such format does not support such feature,
but without implying that it's a limitation.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Collation version tracking for macOS

2022-06-08 Thread Daniel Verite
Tom Lane wrote:

> Yeah, and it's exactly at the level of quirks that things are likely
> to change.  Nobody's going to suddenly start sorting B before A.
> They might, say, change their minds about where the digram "cz"
> sorts relative to single letters, in languages where special rules
> for that are a thing.

Independently of these rules, all Unicode collations change frequently
because each release of Unicode adds new characters. Any string
that contains a code point that was previously unassigned is going
to be sorted differently by all collations when that code point gets
assigned to a character.
Therefore the versions of all collations need to be bumped at every
Unicode release. This is what ICU does.

If the libc in macOS doesn't follow Unicode, that's not relevant
to macOS, but let's assume an OS that tries to be up-to-date.
If major OS upgrades happen every year or less frequently,
each OS upgrade is likely to imply an upgrade of all the collations,
since the interval between Unicode releases tends to be a year
or less:
https://www.unicode.org/history/publicationdates.html



Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Unicode Variation Selector and Combining character

2022-06-01 Thread Daniel Verite
Thomas Munro wrote:

> Looking around a bit, it might be interesting to check if the
> icu_character_boundaries() function in Daniel Vérité's icu_ext treats
> IVSs as single grapheme clusters.

It does.

with strings(s) as (
 values (U&'\+0066FE' || U&'\+0E0103'),
(U&'\+00304B' || U&'\+00309A')
)
select s,
  octet_length(s),
  char_length(s),
  (select count(*) from icu_character_boundaries(s,'en')) as graphemes
from strings;


  s  | octet_length | char_length | graphemes 
-+--+-+---
 曾ă |   7 |   2 | 1
 か゚  |   6 |   2 | 1



Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: ICU_LOCALE set database default icu collation but not working as intended.

2022-05-28 Thread Daniel Verite
jian he wrote:

>   - dbicu3, ICU_LOCALE 'en-u-kr-latn-digit-kf-upper-kn-true'  seems
>   'kf-upper' not grouped strings beginning with character 'A' together?

You seem to expect that the sort algorithm takes characters
from left to right, and when it compares 'A' and 'a', it will
sort the string with the 'A' before, no matter what other
characters are in the rest of the string.

I don't think that's what kf-upper does. I think kf-upper kicks in
only for strings that are identical at the secondary level.
In your example, its effect is to make 'A 70' sort before
'a 70' . The other strings are unaffected.

>   - dbicu4, ICU_LOCALE 'en-u-kr-latn-digit-kn-true'  since upper/lower not
>   explicitly mentioned, and since the collation is deterministic, so
>   character  'A' should be grouped together first then do the numeric value

The deterministic property is only relevant when strings are compared equal
by ICU. Since your collations use the default strength setting (tertiary)
and the strings in your example are all different at this level,
the fact that the collation is deterministic does not play a role in
the results.

Besides, the TR35 doc says for "kn" (numeric ordering)

  "If set to on, any sequence of Decimal Digits (General_Category = Nd in
   the [UAX44]) is sorted at a primary level with its numeric value"

which means that the order of numbers (7, 11, 19, 70, 117) is "stronger"
(primary level) than the relative order of the 'a' and 'A' 
(case difference=secondary level) that precede them.
That's why these numbers drive the sort for these strings that are
otherwise identical at the primary level.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: ICU_LOCALE set database default icu collation but not working as intended.

2022-05-26 Thread Daniel Verite
jian he wrote:

> CREATE
> DATABASE dbicu1 LOCALE_PROVIDER icu LOCALE 'en_US.UTF-8' ICU_LOCALE
> 'en-u-kf-upper' TEMPLATE 'template0';
> CREATE DATABASE dbicu2 LOCALE_PROVIDER icu LOCALE 'en_US.UTF-8' ICU_LOCALE
> 'en-u-kr-latn-digit' TEMPLATE 'template0';
> [...]
> I am not sure this is my personal misunderstanding.
> In the above examples, the first character of column *test_kr*
> is so different that the comparison is based on the first letter.
> If the first letter is the same then compute the second letter..
> So for whatever collation, I should expect 'A 19' to be adjacent with 'A
> 11'?

The query "SELECT test_kr FROM icu  ORDER BY def;"
does not order by test_kr, so the contents of test_kr have no bearing
on the order of the results.

If you order by test_kr, the results look like what you're expecting:

dbicu1=# SELECT test_kr,def FROM icu  ORDER BY test_kr;
 test_kr | def 
-+-
 1 a | a
 8 p | B
 A 11| b
 A 19| A
 a 7 | a
 Π1 | a

dbicu2=# SELECT test_kr,def FROM icu  ORDER BY test_kr ;
 test_kr | def 
-+-
 A 11| b
 A 19| A
 a 7 | a
 Π1 | a
 1 a | a
 8 p | B



Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: variable filename for psql \copy

2022-04-25 Thread Daniel Verite
Jiří Fejfar wrote:

> I have found maybe buggy behaviour (of psql parser?) when using psql \copy
> with psql variable used for filename.

While it's annoying that it doesn't work as you tried it, this behavior is 
documented, so in that sense it's not a bug.
The doc also suggests a workaround in a tip section:

From psql manpage:

  The syntax of this command is similar to that of the SQL COPY
   command. All options other than the data source/destination are as
   specified for COPY. Because of this, special parsing rules apply to
   the \copy meta-command. Unlike most other meta-commands, the entire
   remainder of the line is always taken to be the arguments of \copy,
   and neither variable interpolation nor backquote expansion are
   performed in the arguments.

   Tip
   Another way to obtain the same result as \copy ... to is to use
   the SQL COPY ... TO STDOUT command and terminate it with \g
   filename or \g |program. Unlike \copy, this method allows the
   command to span multiple lines; also, variable interpolation
   and backquote expansion can be used.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Add header support to text format and matching feature

2022-03-25 Thread Daniel Verite
Peter Eisentraut wrote:

> - The DefGetCopyHeader() function seems very bulky and might not be 
> necessary.  I think you can just check for the string "match" first and 
> then use defGetBoolean() as before if it didn't match.

The problem is that defGetBoolean() ends like this in the non-matching
case:

ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
 errmsg("%s requires a Boolean value",
def->defname)));

We don't want this error message when the string does not match
since it's really a tri-state option, not a boolean.

To avoid duplicating the code that recognizes true/false/on/off/0/1,
probably defGetBoolean()'s guts should be moved into another function
that does the matching and report to the caller instead of throwing an
error. Then DefGetCopyHeader() could call that non-throwing function.


> - If you define COPY_HEADER_ABSENT = 0 in the enum, then most of the 
> existing truth checks don't need to be changed.

It's already 0 by default. Not changing some truth checks does work, but
then we get some code that treat CopyFromState.header_line like
a boolean and some other code like an enum, which I found not
ideal wrt clarity in an earlier round of review [1]

It's not a major issue though, as it  concerns only 3 lines of code in the
v12
patch:
-   if (opts_out->binary && opts_out->header_line)
+   if (opts_out->binary && opts_out->header_line != COPY_HEADER_ABSENT)


+   /* on input check that the header line is correct if needed */
+   if (cstate->cur_lineno == 0 && cstate->opts.header_line !=
COPY_HEADER_ABSENT)

-   if (cstate->opts.header_line)
+   if (cstate->opts.header_line != COPY_HEADER_ABSENT)




> - In NextCopyFromRawFields(), it looks like you have code that replaces 
> the null_print value if the supplied column name is null.  I don't think 
> we should allow null column values.  Someone, this should be an error. 

+1



[1]
https://www.postgresql.org/message-id/80a9b594-01d6-4ee4-a612-9ae9fd3c1194%40manitou-mail.org


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: ICU for global collation

2022-03-15 Thread Daniel Verite
Finnerty, Jim wrote:

> In ICU, the "locale" is just the first part of what we can pass to the
> "locale" parameter in CREATE COLLATION - the part before the optional '@'
> delimiter.  The ICU locale does not include the secondary or tertiary
> properties,

Why not? Please see
https://unicode-org.github.io/icu/userguide/locale

It says that "a locale consists of one or more pieces of ordered
information", the pieces being a language code, a script code, a
country code, a variant code, and keywords.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: ICU for global collation

2022-03-15 Thread Daniel Verite
Julien Rouhaud wrote:

> > > While on that topic, the doc should probably mention that default ICU
> > > collations can only be deterministic.
> >
> > Well, there is no option to do otherwise, so I'm not sure where/how to
> > mention that.  We usually don't document options that don't exist. ;-)
> 
> Sure, but I'm afraid that users may still be tempted to use ICU locales like
> und-u-ks-level2 from the case_insensitive example in the doc and hope that
> it will work accordingly.

+1.

The CREATE DATABASE doc says this currently:

icu_locale

Specifies the ICU locale ID if the ICU locale provider is used.


ISTM that we need to say explicitly that this locale will be used by
default to compare all collatable strings, except that it's overruled
by a bytewise comparison to break ties in case of equality.

The idea is to describe what the backend will do with the setting
rather than saying that we don't have a nondeterministic option.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: ICU for global collation

2022-01-11 Thread Daniel Verite
Julien Rouhaud wrote:

> > I guess there's still the possibility of requiring that the ICU db-wide
> > collation of the new database does exist in the template database,
> > and then the CREATE DATABASE would refer to that collation instead of
> > an independent locale string.
> 
> That could work.  However if having the collation in the template database a
> strict requirement the something should also be done for initdb, and it will
> probably be a bigger problem.

If CREATE DATABASE referred to a collation in the template db,
either that collation already exists, or the user would have to add it
to the template db with CREATE COLLATION.
initdb already populates the template databases with a full set of
ICU collations through pg_import_system_collations().
I don't quite see what change you're seeing that would be needed in
initdb.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite




Re: ICU for global collation

2022-01-10 Thread Daniel Verite
Julien Rouhaud wrote:

> > The lack of these fields overall suggest the idea that when CREATE
> > DATABASE is called with a global ICU collation, what if it somehow
> > inserted the collation into pg_collation in the new database?
> > Then pg_database would just store the collation oid and no other
> > collation-related field would need to be added into it, now
> > or in the future.
> 
> I don't think it would be doable given the single-database-per-backend
> restriction.

By that I understand that CREATE DATABASE is limited to copying a template
database and then not write anything into it beyond that, as it's
not even connected to it.
I guess there's still the possibility of requiring that the ICU db-wide
collation of the new database does exist in the template database,
and then the CREATE DATABASE would refer to that collation instead of
an independent locale string.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite




Re: ICU for global collation

2022-01-10 Thread Daniel Verite
Peter Eisentraut wrote:

> Unlike in the previous patch, where the ICU 
> collation name was written in datcollate, there is now a third column 
> (daticucoll), so we can store all three values.

I think some users would want their db-wide ICU collation to be
case/accent-insensitive. Postgres users are trained to expect
case-sensitive comparisons, but some apps initially made for
e.g. MySQL or MS-SQL that use such collations natively would be easier
to port to Postgres.
IIRC, that was the context for some questions where people were
enquiring about db-wide ICU collations.

With the current patch, it's not possible, AFAICS, because the user
can't tell that the collation is non-deterministic. Presumably this
would require another option to CREATE DATABASE and another
column to store that bit of information.

The "daticucol" column also suggests that we don't expect to add
other collation providers in the future. Maybe a pair of columns like
(datcollprovider, datcolllocale) would be more future-proof,
or a (datcollprovider, datcolllocale, datcollisdeterministic)
triplet if non-deterministic collations are allowed.

Also, pg_collation has "collversion" to detect a mismatch between
the ICU runtime and existing indexes. I don't see that field
for the db-wide ICU collation, so maybe we currently miss the capability
to detect the mismatch for the db-wide collation?

The lack of these fields overall suggest the idea that when CREATE
DATABASE is called with a global ICU collation, what if it somehow
inserted the collation into pg_collation in the new database?
Then pg_database would just store the collation oid and no other
collation-related field would need to be added into it, now
or in the future.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite




Re: ICU for global collation

2022-01-07 Thread Daniel Verite
Julien Rouhaud wrote:

> If you want a database with an ICU default collation the lc_collate
> and lc_ctype should inherit what's in the template database and not
> what was provided in the LOCALE I think.  You could still probably
> overload them in some scenario, but without a list of what isn't
> ICU-aware I can't really be sure of how often one might have to do
> it.

I guess we'd need that when creating a database with a different
encoding than the template databases, at least.

About what's not ICU-aware, I believe the most significant part in
core is the Full Text Search parser.
It doesn't care about sorting strings, but it relies on the functions
from POSIX , which depend on LC_CTYPE
(it looks however that this could be improved by following
what has been done in backend/regex/regc_pg_locale.c, which has
comparable needs and calls ICU functions when applicable).

Also, any extension is potentially concerned. Surely many
extensions call functions from ctype.h assuming that
the current value of LC_CTYPE works with the data they handle.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite




  1   2   3   >