Re: EBCDIC sorting as a use case for ICU rules

2023-08-30 Thread Daniel Verite
Peter Eisentraut wrote:

> Committed with some editing.  I moved the existing rules example from 
> the CREATE COLLATION page into the new section you created, so we have a 
> simple example followed by the complex example.

OK, thanks for pushing this!


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: EBCDIC sorting as a use case for ICU rules

2023-08-24 Thread Jehan-Guillaume de Rorthais
Hi,

Sorry to chime in so lately, I was waiting for some customer feedback.

On Wed, 21 Jun 2023 15:28:38 +0200
"Daniel Verite"  wrote:

> At a conference this week I was asked if ICU could be able to
> sort like EBCDIC [2].
> It turns out it has been already  asked on
> -general a few years ago [3] with no satisfactory answer at the time ,
> and that it can be implemented with rules in v16.

We worked with a customer few months ago about this question and end up with a
procedure to build new locale/collation for glibc and load them in PostgreSQL
[1].

Our customer built the fr_ebcdic locale file themselves, based on the EBCDIC
IBM500 codepage (including about the same characters than iso 8859-1) and share
it under the BY-CC licence. See in attachment.

The procedure is quite simple:

1. copy this file under "/usr/share/i18n/locales/fr_ebcdic"
2. build it using "localedef -c -i fr_ebcdic -f UTF-8 fr_ebcdic.UTF-8"
3. restart your PostgreSQL instance (because of localeset weird behavior)
4. "pg_import_system_collations('schema')" or create the collation, eg.:
   CREATE COLLATION fr_ebcdic (
 PROVIDER = libc,
 LC_COLLATE = fr_ebcdic.utf8,
 LC_CTYPE = fr_ebcdic.utf8
   );

Now, same question than for the ICU: do we want to provide documentation about
this? Online documentation about such feature are quite arid. In fact, this
could be useful in various other way than just EBCDIC.

Regards,

[1] https://www.postgresql.org/message-id/20230209144947.1dfad6c0%40karst


fr_ebcdic
Description: Binary data


Re: EBCDIC sorting as a use case for ICU rules

2023-08-23 Thread Peter Eisentraut

On 17.07.23 10:10, Daniel Verite wrote:

Peter Eisentraut wrote:


You can use whitespace in the rules.  For example,

CREATE COLLATION ebcdic (provider='icu', locale='und',
rules=$$


Nice, it's clearly better that the piece of code I had in the
previous patch.
It can also be made more compact by grouping consecutive
code points, for instance <*a-r for 'a' to 'r'
I changed it that way, and also moved '^' before '[' and ']',
since according to [1], '^' is at location 0xB0 and '[' and ']'
at 0xBA and 0xBB.

Updated patch attached.


Committed with some editing.  I moved the existing rules example from 
the CREATE COLLATION page into the new section you created, so we have a 
simple example followed by the complex example.






Re: EBCDIC sorting as a use case for ICU rules

2023-07-06 Thread Jeff Davis
On Thu, 2023-07-06 at 11:32 +0200, Peter Eisentraut wrote:
> CREATE COLLATION ebcdic (provider='icu', locale='und',
> rules=$$
> & ' ' < '.' < '<' < '(' < '+' < \|
> < '&' < '!' < '$' < '*' < ')' < ';'
> < '-' < '/' < ',' < '%' < '_' < '>' < '?'
> < '`' < ':' < '#' < '@' < \' < '=' < '"'
> < a < b < c < d < e < f < g < h < i
> < j < k < l < m < n < o < p < q < r
> < '~' < s < t < u < v < w < x < y < z
> < '[' < '^' < ']'
> < '{' < A < B < C < D < E < F < G < H < I
> < '}' < J < K < L < M < N < O < P < Q < R
> < '\'  < S < T < U < V < W < X < Y < Z
> < 0 < 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9
> $$);

That looks much nicer and would go nicely in the documentation along
with some explanation.

Regards,
Jeff Davis





Re: EBCDIC sorting as a use case for ICU rules

2023-07-06 Thread Peter Eisentraut

On 30.06.23 13:08, Daniel Verite wrote:

About making a doc patch from this, I've came up with the attached,
which generates a CREATE COLLATION statement with rules from an
arbitrary strings that just lists characters in whichever order is desired.


I like adding more documentation and links around this.  But I'm not 
sure how this code you are including is supposed to help users 
understand the rules language.  Effectively, this would be adding 
another rules mechanism on top of the existing one, but doesn't explain 
either one.






Re: EBCDIC sorting as a use case for ICU rules

2023-07-06 Thread Peter Eisentraut

On 21.06.23 15:28, Daniel Verite wrote:

A collation like the following this seems to work (the rule simply enumerates
US-ASCII letters in the EBCDIC alphabet order, with adequate quoting)

CREATE COLLATION ebcdic (provider='icu', locale='und',
rules=$$&'
'<'.'<'<'<'('<'+'<\|<'&'<'!'<'$'<'*'<')'<';'<'-'<'/'<','<'%'<'_'<'>'<'?'<'`'<':'<'#'<'@'<\'<'='<'"'?`:#@'="abcdefghijklmnopqr~stuvwxyz[^]{ABCDEFGHI}JKLMNOPQR\ST
UVWXYZ0123456789

Maybe this example could be added to the documentation except for
the problem that the rule is very long and dollar-quoting cannot be split
into several lines. Literals enclosed by single quotes can be split that
way, but would require escaping the single quotes in the rule, which
would lead to scary-looking over-quoted contents.


You can use whitespace in the rules.  For example,

CREATE COLLATION ebcdic (provider='icu', locale='und',
rules=$$
& ' ' < '.' < '<' < '(' < '+' < \|
< '&' < '!' < '$' < '*' < ')' < ';'
< '-' < '/' < ',' < '%' < '_' < '>' < '?'
< '`' < ':' < '#' < '@' < \' < '=' < '"'
< a < b < c < d < e < f < g < h < i
< j < k < l < m < n < o < p < q < r
< '~' < s < t < u < v < w < x < y < z
< '[' < '^' < ']'
< '{' < A < B < C < D < E < F < G < H < I
< '}' < J < K < L < M < N < O < P < Q < R
< '\'  < S < T < U < V < W < X < Y < Z
< 0 < 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9
$$);

(This particular layout is meant to match the rows in
https://en.wikipedia.org/wiki/EBCDIC#Code_page_layout.)




Re: EBCDIC sorting as a use case for ICU rules

2023-06-21 Thread Jonathan S. Katz

On 6/21/23 12:14 PM, Jeff Davis wrote:

On Wed, 2023-06-21 at 15:28 +0200, Daniel Verite wrote:

At a conference this week I was asked if ICU could be able to
sort like EBCDIC [2]. It turns out it has been already  asked on
-general a few years ago [3] with no satisfactory answer at the time
,
and that it can be implemented with rules in v16.


Interesting, thank you!


+1 -- this is very helpful framing the problem, thank you!


This can be useful for people who migrate from mainframes to Postgres
and need their migration tests to produce the same sorted results as
the
original system.
Since rules can be defined at the database level with the icu_rules
option,
they don't even need to tweak their queries to add COLLATE clauses,
which surely is appreciable in that kind of project.


I still had some technical concerns about the ICU rules feature,
unfortunately, and one option is to only allow it for the collation
objects and not the database level collation. How much would that hurt
this use case?



I'm open to suggestions on whether this EBCDIC example is worth being
in the
doc in some form or putting this in the wiki would be good enough.


I like the idea of having a real example. Ideally, we could add some
explanation along the way about how the rule is constructed to match
EBCDIC, which would reduce the shock of a long rule like that.

I wonder why the rule syntax is such that it cannot be broken up? Would
it be incorrect for us to allow some whitespace in there?


I'll give the unhelpful comment of "yes, I agree we should have a real 
world example", especially one that seems relevant to helping more 
people adopt PostgreSQL.




OpenPGP_signature
Description: OpenPGP digital signature


Re: EBCDIC sorting as a use case for ICU rules

2023-06-21 Thread Jeff Davis
On Wed, 2023-06-21 at 15:28 +0200, Daniel Verite wrote:
> At a conference this week I was asked if ICU could be able to
> sort like EBCDIC [2]. It turns out it has been already  asked on
> -general a few years ago [3] with no satisfactory answer at the time
> ,
> and that it can be implemented with rules in v16.

Interesting, thank you!

> This can be useful for people who migrate from mainframes to Postgres
> and need their migration tests to produce the same sorted results as
> the
> original system.
> Since rules can be defined at the database level with the icu_rules
> option,
> they don't even need to tweak their queries to add COLLATE clauses,
> which surely is appreciable in that kind of project.

I still had some technical concerns about the ICU rules feature,
unfortunately, and one option is to only allow it for the collation
objects and not the database level collation. How much would that hurt
this use case?


> I'm open to suggestions on whether this EBCDIC example is worth being
> in the
> doc in some form or putting this in the wiki would be good enough.

I like the idea of having a real example. Ideally, we could add some
explanation along the way about how the rule is constructed to match
EBCDIC, which would reduce the shock of a long rule like that.

I wonder why the rule syntax is such that it cannot be broken up? Would
it be incorrect for us to allow some whitespace in there?

Regards,
Jeff Davis





Re: EBCDIC sorting as a use case for ICU rules

2023-06-21 Thread Joe Conway

On 6/21/23 09:28, Daniel Verite wrote:

In the "Order changes in PG16 since ICU introduction" discussion, one
sub-thread [1] was about having a credible use case for tailoring collations
with custom rules, a new feature in v16.

At a conference this week I was asked if ICU could be able to
sort like EBCDIC [2]. It turns out it has been already  asked on
-general a few years ago [3] with no satisfactory answer at the time ,
and that it can be implemented with rules in v16.


Oh, very cool! I have seen the requirement for EBCDIC come up multiple 
times over the years.





Maybe this example could be added to the documentation except for
the problem that the rule is very long and dollar-quoting cannot be split
into several lines. Literals enclosed by single quotes can be split that
way, but would require escaping the single quotes in the rule, which
would lead to scary-looking over-quoted contents.

I'm open to suggestions on whether this EBCDIC example is worth being in the
doc in some form or putting this in the wiki would be good enough.


I would definitely favor adding to the docs, but no idea how to deal 
with the length issue.



--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





EBCDIC sorting as a use case for ICU rules

2023-06-21 Thread Daniel Verite
 Hi,

In the "Order changes in PG16 since ICU introduction" discussion, one
sub-thread [1] was about having a credible use case for tailoring collations
with custom rules, a new feature in v16.

At a conference this week I was asked if ICU could be able to
sort like EBCDIC [2]. It turns out it has been already  asked on
-general a few years ago [3] with no satisfactory answer at the time ,
and that it can be implemented with rules in v16.

A collation like the following this seems to work (the rule simply enumerates
US-ASCII letters in the EBCDIC alphabet order, with adequate quoting)

CREATE COLLATION ebcdic (provider='icu', locale='und',
rules=$$&'
'<'.'<'<'<'('<'+'<\|<'&'<'!'<'$'<'*'<')'<';'<'-'<'/'<','<'%'<'_'<'>'<'?'<'`'<':'<'#'<'@'<\'<'='<'"'?`:#@'="abcdefghijklmnopqr~stuvwxyz[^]{ABCDEFGHI}JKLMNOPQR\ST
UVWXYZ0123456789

Maybe this example could be added to the documentation except for
the problem that the rule is very long and dollar-quoting cannot be split
into several lines. Literals enclosed by single quotes can be split that
way, but would require escaping the single quotes in the rule, which
would lead to scary-looking over-quoted contents.

I'm open to suggestions on whether this EBCDIC example is worth being in the
doc in some form or putting this in the wiki would be good enough.



[1]
https://www.postgresql.org/message-id/flat/a28aba5fa6bf1abfff96e40b6d6acff8412edb15.camel%40j-davis.com

[2] https://en.wikipedia.org/wiki/EBCDIC

[3]
https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F84A7AD%40G01JPEXMBYT05


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite