Re: Pre-proposal: unicode normalized text

2024-03-14 Thread Jeff Davis
On Thu, 2024-02-29 at 17:02 -0800, Jeff Davis wrote: > Attached is an implementation of a per-database option STRICT_UNICODE > which enforces the use of assigned code points only. The CF app doesn't seem to point at the latest patch:

Re: Pre-proposal: unicode normalized text

2024-02-29 Thread Jeff Davis
On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote: > It seems to me that this overlooks one of the major points of Jeff's > proposal, which is that we don't reject text input that contains > unassigned code points. That decision turns out to be really painful. Attached is an implementation of

Re: Pre-proposal: unicode normalized text

2023-11-03 Thread David Rowley
On Sat, 4 Nov 2023 at 10:57, Thomas Munro wrote: > > On Fri, Nov 3, 2023 at 9:01 PM David Rowley wrote: > > On Fri, 3 Nov 2023 at 20:49, Jeff Davis wrote: > > > I think I just need to add unicode_category.c to @pgcommonallfiles in > > > Mkvcbuild.pm. I'll do a trial commit tomorrow and see if

Re: Pre-proposal: unicode normalized text

2023-11-03 Thread Thomas Munro
On Fri, Nov 3, 2023 at 9:01 PM David Rowley wrote: > On Fri, 3 Nov 2023 at 20:49, Jeff Davis wrote: > > On Fri, 2023-11-03 at 10:51 +1300, Thomas Munro wrote: > > > bowerbird and hammerkop didn't like commit a02b37fc. They're still > > > using the old 3rd build system that is not tested by CI.

Re: Pre-proposal: unicode normalized text

2023-11-03 Thread Phil Krylov
On 2023-10-04 23:32, Chapman Flack wrote: Well, for what reason does anybody run PG now with the encoding set to anything besides UTF-8? I don't really have my finger on that pulse. Could it be that it bloats common strings in their local script, and with enough of those to store, it could

Re: Pre-proposal: unicode normalized text

2023-11-03 Thread Jeff Davis
On Fri, 2023-11-03 at 17:11 +0700, John Naylor wrote: > On Sat, Oct 28, 2023 at 4:15 AM Jeff Davis wrote: > > > > I plan to commit something like v3 early next week unless someone > > else > > has additional comments or I missed a concern. > > Hi Jeff, is the CF entry titled "Unicode character

Re: Pre-proposal: unicode normalized text

2023-11-03 Thread Jeff Davis
On Fri, 2023-11-03 at 21:01 +1300, David Rowley wrote: > Thomas mentioned this to me earlier today. After looking I also > concluded that unicode_category.c needed to be added to > @pgcommonallfiles. After looking at the time, I didn't expect you to > be around so opted just to push that to fix

Re: Pre-proposal: unicode normalized text

2023-11-03 Thread John Naylor
On Sat, Oct 28, 2023 at 4:15 AM Jeff Davis wrote: > > I plan to commit something like v3 early next week unless someone else > has additional comments or I missed a concern. Hi Jeff, is the CF entry titled "Unicode character general category functions" ready to be marked committed?

Re: Pre-proposal: unicode normalized text

2023-11-03 Thread David Rowley
On Fri, 3 Nov 2023 at 20:49, Jeff Davis wrote: > > On Fri, 2023-11-03 at 10:51 +1300, Thomas Munro wrote: > > bowerbird and hammerkop didn't like commit a02b37fc. They're still > > using the old 3rd build system that is not tested by CI. It's due > > for > > removal in the 17 cycle IIUC but in

Re: Pre-proposal: unicode normalized text

2023-11-03 Thread Jeff Davis
On Fri, 2023-11-03 at 10:51 +1300, Thomas Munro wrote: > bowerbird and hammerkop didn't like commit a02b37fc.  They're still > using the old 3rd build system that is not tested by CI.  It's due > for > removal in the 17 cycle IIUC but in the meantime I guess the new > codegen script needs to be

Re: Pre-proposal: unicode normalized text

2023-11-02 Thread Nico Williams
On Wed, Oct 04, 2023 at 01:15:03PM -0700, Jeff Davis wrote: > > The fact that there are multiple types of normalization and multiple > > notions of equality doesn't make this easier. And then there's text that isn't normalized to any of them. > NFC is really the only one that makes sense. Yes.

Re: Pre-proposal: unicode normalized text

2023-11-02 Thread Nico Williams
On Tue, Oct 17, 2023 at 05:07:40PM +0200, Daniel Verite wrote: > > * Add a per-database option to enforce only storing assigned unicode > > code points. > > There's a problem in the fact that the set of assigned code points is > expanding with every Unicode release, which happens about every

Re: Pre-proposal: unicode normalized text

2023-11-02 Thread Nico Williams
On Wed, Oct 04, 2023 at 01:16:22PM -0400, Robert Haas wrote: > There's a very popular commercial database where, or so I have been > led to believe, any byte sequence at all is accepted when you try to > put values into the database. [...] In other circles we call this "just-use-8". ZFS, for

Re: Pre-proposal: unicode normalized text

2023-11-02 Thread Nico Williams
On Fri, Oct 06, 2023 at 02:37:06PM -0400, Robert Haas wrote: > > Sure, because TEXT in PG doesn't have codeset+encoding as part of it -- > > it's whatever the database's encoding is. Collation can and should be a > > porperty of a column, since for Unicode it wouldn't be reasonable to > > make

Re: Pre-proposal: unicode normalized text

2023-11-02 Thread Thomas Munro
bowerbird and hammerkop didn't like commit a02b37fc. They're still using the old 3rd build system that is not tested by CI. It's due for removal in the 17 cycle IIUC but in the meantime I guess the new codegen script needs to be invoked by something under src/tools/msvc? varlena.obj : error

Re: Pre-proposal: unicode normalized text

2023-10-27 Thread Jeff Davis
On Mon, 2023-10-16 at 20:32 -0700, Jeff Davis wrote: > On Wed, 2023-10-11 at 08:56 +0200, Peter Eisentraut wrote: > > We need to be careful about precise terminology.  "Valid" has a > > defined > > meaning for Unicode.  A byte sequence can be valid or not as UTF- > > 8.  > > But > > a string

Re: Pre-proposal: unicode normalized text

2023-10-17 Thread Jeff Davis
On Tue, 2023-10-17 at 17:07 +0200, Daniel Verite wrote: > There's a problem in the fact that the set of assigned code points is > expanding with every Unicode release, which happens about every year. > > If we had this option in Postgres 11 released in 2018 it would use > Unicode 11, and in 2023

Re: Pre-proposal: unicode normalized text

2023-10-17 Thread Robert Haas
On Tue, Oct 17, 2023 at 11:38 AM Isaac Morland wrote: > On Tue, 17 Oct 2023 at 11:15, Robert Haas wrote: >> Are code points assigned from a gapless sequence? That is, is the >> implementation of codepoint_is_assigned(char) just 'codepoint < >> SOME_VALUE' and SOME_VALUE increases over time? > >

Re: Pre-proposal: unicode normalized text

2023-10-17 Thread Isaac Morland
On Tue, 17 Oct 2023 at 11:15, Robert Haas wrote: > Are code points assigned from a gapless sequence? That is, is the > implementation of codepoint_is_assigned(char) just 'codepoint < > SOME_VALUE' and SOME_VALUE increases over time? > Not even close. Code points are organized in blocks, e.g.

Re: Pre-proposal: unicode normalized text

2023-10-17 Thread Robert Haas
On Tue, Oct 17, 2023 at 11:07 AM Daniel Verite wrote: > There's a problem in the fact that the set of assigned code points is > expanding with every Unicode release, which happens about every year. > > If we had this option in Postgres 11 released in 2018 it would use > Unicode 11, and in 2023

Re: Pre-proposal: unicode normalized text

2023-10-17 Thread Daniel Verite
Jeff Davis wrote: > I believe the patch has utility as-is, but I've been brainstorming a > few more ideas that could build on it: > > * Add a per-database option to enforce only storing assigned unicode > code points. There's a problem in the fact that the set of assigned code points is

Re: Pre-proposal: unicode normalized text

2023-10-11 Thread Jeff Davis
On Wed, 2023-10-11 at 08:51 +0200, Peter Eisentraut wrote: > I don't see how this would really work in practice.  Whether your > data > has unassigned code points or not, when the collations are updated to > the next Unicode version, the collations will have a new version > number, > and so you

Re: Pre-proposal: unicode normalized text

2023-10-11 Thread Jeff Davis
On Wed, 2023-10-11 at 08:56 +0200, Peter Eisentraut wrote: > On 11.10.23 03:08, Jeff Davis wrote: > >    * unicode_is_valid(text): returns true if all codepoints are > > assigned, false otherwise > > We need to be careful about precise terminology.  "Valid" has a > defined > meaning for Unicode. 

Re: Pre-proposal: unicode normalized text

2023-10-11 Thread Peter Eisentraut
On 11.10.23 03:08, Jeff Davis wrote: * unicode_is_valid(text): returns true if all codepoints are assigned, false otherwise We need to be careful about precise terminology. "Valid" has a defined meaning for Unicode. A byte sequence can be valid or not as UTF-8. But a string containing

Re: Pre-proposal: unicode normalized text

2023-10-11 Thread Peter Eisentraut
On 10.10.23 16:02, Robert Haas wrote: On Tue, Oct 10, 2023 at 2:44 AM Peter Eisentraut wrote: Can you restate what this is supposed to be for? This thread appears to have morphed from "let's normalize everything" to "let's check for unassigned code points", but I'm not sure what we are aiming

Re: Pre-proposal: unicode normalized text

2023-10-10 Thread Robert Haas
On Tue, Oct 10, 2023 at 2:44 AM Peter Eisentraut wrote: > Can you restate what this is supposed to be for? This thread appears to > have morphed from "let's normalize everything" to "let's check for > unassigned code points", but I'm not sure what we are aiming for now. Jeff can say what he

Re: Pre-proposal: unicode normalized text

2023-10-10 Thread Peter Eisentraut
On 06.10.23 19:22, Jeff Davis wrote: On Fri, 2023-10-06 at 09:58 +0200, Peter Eisentraut wrote: If you want to be rigid about it, you also need to consider whether the Unicode version used by the ICU library in use matches the one used by the in-core tables. What problem are you concerned

Re: Pre-proposal: unicode normalized text

2023-10-10 Thread Peter Eisentraut
On 07.10.23 03:18, Jeff Davis wrote: On Wed, 2023-10-04 at 13:16 -0400, Robert Haas wrote: At minimum I think we need to have some internal functions to check for unassigned code points. That belongs in core, because we generate the unicode tables from a specific version. That's a good idea.

Re: Pre-proposal: unicode normalized text

2023-10-09 Thread Robert Haas
On Fri, Oct 6, 2023 at 3:07 PM Jeff Davis wrote: > On Fri, 2023-10-06 at 13:33 -0400, Robert Haas wrote: > > What I think people really want is a whole column in > > some encoding that isn't the normal one for that database. > > Do people really want that? I'd be curious to know why. Because

Re: Pre-proposal: unicode normalized text

2023-10-06 Thread Matthias van de Meent
On Fri, 6 Oct 2023, 21:08 Jeff Davis, wrote: > On Fri, 2023-10-06 at 13:33 -0400, Robert Haas wrote: > > What I think people really want is a whole column in > > some encoding that isn't the normal one for that database. > > Do people really want that? I'd be curious to know why. > One reason

Re: Pre-proposal: unicode normalized text

2023-10-06 Thread Isaac Morland
On Fri, 6 Oct 2023 at 15:07, Jeff Davis wrote: > On Fri, 2023-10-06 at 13:33 -0400, Robert Haas wrote: > > What I think people really want is a whole column in > > some encoding that isn't the normal one for that database. > > Do people really want that? I'd be curious to know why. > > A lot of

Re: Pre-proposal: unicode normalized text

2023-10-06 Thread Jeff Davis
On Fri, 2023-10-06 at 13:33 -0400, Robert Haas wrote: > What I think people really want is a whole column in > some encoding that isn't the normal one for that database. Do people really want that? I'd be curious to know why. A lot of modern projects are simply declaring UTF-8 to be the "one

Re: Pre-proposal: unicode normalized text

2023-10-06 Thread Robert Haas
On Fri, Oct 6, 2023 at 2:25 PM Nico Williams wrote: > > > > Well, that would be making the encoding a per-value property, rather > > > > than a per-column property like collation as I proposed. I can't see > > > > > > On-disk it would be just a property of the type, not part of the value. > > > >

Re: Pre-proposal: unicode normalized text

2023-10-06 Thread Nico Williams
On Fri, Oct 06, 2023 at 02:17:32PM -0400, Robert Haas wrote: > On Fri, Oct 6, 2023 at 1:38 PM Nico Williams wrote: > > On Fri, Oct 06, 2023 at 01:33:06PM -0400, Robert Haas wrote: > > > On Thu, Oct 5, 2023 at 3:15 PM Nico Williams > > > wrote: > > > > Text+encoding can be just like bytea with a

Re: Pre-proposal: unicode normalized text

2023-10-06 Thread Robert Haas
On Fri, Oct 6, 2023 at 1:38 PM Nico Williams wrote: > On Fri, Oct 06, 2023 at 01:33:06PM -0400, Robert Haas wrote: > > On Thu, Oct 5, 2023 at 3:15 PM Nico Williams wrote: > > > Text+encoding can be just like bytea with a one- or two-byte prefix > > > indicating what codeset+encoding it's in.

Re: Pre-proposal: unicode normalized text

2023-10-06 Thread Jeff Davis
On Thu, 2023-10-05 at 14:52 -0500, Nico Williams wrote: > This is just how you encode the type of the string.  You have any > number > of options.  The point is that already PG can encode binary data, so > if > how to encode text of disparate encodings on the wire, building on > top > of the

Re: Pre-proposal: unicode normalized text

2023-10-06 Thread Nico Williams
On Fri, Oct 06, 2023 at 01:33:06PM -0400, Robert Haas wrote: > On Thu, Oct 5, 2023 at 3:15 PM Nico Williams wrote: > > Text+encoding can be just like bytea with a one- or two-byte prefix > > indicating what codeset+encoding it's in. That'd be how to encode > > such text values on the wire,

Re: Pre-proposal: unicode normalized text

2023-10-06 Thread Robert Haas
On Thu, Oct 5, 2023 at 3:15 PM Nico Williams wrote: > Text+encoding can be just like bytea with a one- or two-byte prefix > indicating what codeset+encoding it's in. That'd be how to encode > such text values on the wire, though on disk the column's type should > indicate the codeset+encoding,

Re: Pre-proposal: unicode normalized text

2023-10-06 Thread Jeff Davis
On Fri, 2023-10-06 at 09:58 +0200, Peter Eisentraut wrote: > If you want to be rigid about it, you also need to consider whether > the > Unicode version used by the ICU library in use matches the one used > by > the in-core tables. What problem are you concerned about here? I thought about it

Re: Pre-proposal: unicode normalized text

2023-10-06 Thread Peter Eisentraut
On 05.10.23 19:30, Jeff Davis wrote: Agreed, at least until we understand the set of users per-column encoding is important to. I acknowledge that the presence of per-column encoding in the standard is some kind of signal there, but not enough by itself to justify something so invasive. The

Re: Pre-proposal: unicode normalized text

2023-10-06 Thread Peter Eisentraut
On 03.10.23 21:54, Jeff Davis wrote: Here, Jeff mentions normalization, but I think it's a major issue with collation support. If new code points are added, users can put them into the database before they are known to the collation library, and then when they become known to the collation

Re: Pre-proposal: unicode normalized text

2023-10-05 Thread Nico Williams
On Thu, Oct 05, 2023 at 03:49:37PM -0400, Tom Lane wrote: > Nico Williams writes: > > Text+encoding can be just like bytea with a one- or two-byte prefix > > indicating what codeset+encoding it's in. That'd be how to encode > > such text values on the wire, though on disk the column's type

Re: Pre-proposal: unicode normalized text

2023-10-05 Thread Tom Lane
Nico Williams writes: > Text+encoding can be just like bytea with a one- or two-byte prefix > indicating what codeset+encoding it's in. That'd be how to encode > such text values on the wire, though on disk the column's type should > indicate the codeset+encoding, so no need to add a prefix to

Re: Pre-proposal: unicode normalized text

2023-10-05 Thread Jeff Davis
On Thu, 2023-10-05 at 09:10 -0400, Isaac Morland wrote: > In the case you describe, the users don’t have text at all; they have > bytes, and a vague belief about what encoding the bytes might be in > and therefore what characters they are intended to represent. The > correct way to store that in

Re: Pre-proposal: unicode normalized text

2023-10-05 Thread Nico Williams
On Thu, Oct 05, 2023 at 07:31:54AM -0400, Robert Haas wrote: > [...] On the other hand, to do that in PostgreSQL, we'd need to > propagate the character set/encoding information into all of the > places that currently get the typmod and collation, and that is not a > small number of places. It's a

Re: Pre-proposal: unicode normalized text

2023-10-05 Thread Jeff Davis
On Thu, 2023-10-05 at 07:31 -0400, Robert Haas wrote: > It's a lot of infrastructure for the project to carry > around for a feature that's probably only going to continue to become > less relevant. Agreed, at least until we understand the set of users per-column encoding is important to. I

Re: Pre-proposal: unicode normalized text

2023-10-05 Thread Isaac Morland
On Thu, 5 Oct 2023 at 07:32, Robert Haas wrote: > But I do think that sometimes users are reluctant to perform encoding > conversions on the data that they have. Sometimes they're not > completely certain what encoding their data is in, and sometimes > they're worried that the encoding

Re: Pre-proposal: unicode normalized text

2023-10-05 Thread Robert Haas
On Wed, Oct 4, 2023 at 9:02 PM Isaac Morland wrote: >> > What about characters not in UTF-8? >> >> Honestly I'm not clear on this topic. Are the "private use" areas in >> unicode enough to cover use cases for characters not recognized by >> unicode? Which encodings in postgres can represent

Re: Pre-proposal: unicode normalized text

2023-10-04 Thread Isaac Morland
On Wed, 4 Oct 2023 at 17:37, Jeff Davis wrote: > On Wed, 2023-10-04 at 14:14 -0400, Isaac Morland wrote: > > Always store only UTF-8 in the database > > What problem does that solve? I don't see our encoding support as a big > source of problems, given that database-wide UTF-8 already works

Re: Pre-proposal: unicode normalized text

2023-10-04 Thread Nico Williams
On Wed, Oct 04, 2023 at 04:01:26PM -0700, Jeff Davis wrote: > On Wed, 2023-10-04 at 16:15 -0500, Nico Williams wrote: > > Better that than TEXT blobs w/ the encoding given by the `CREATE > > DATABASE` or `initdb` default! > > From an engineering perspective, yes, per-column encodings would be >

Re: Pre-proposal: unicode normalized text

2023-10-04 Thread Jeff Davis
On Wed, 2023-10-04 at 16:15 -0500, Nico Williams wrote: > Better that than TEXT blobs w/ the encoding given by the `CREATE > DATABASE` or `initdb` default! >From an engineering perspective, yes, per-column encodings would be more flexible. But I still don't understand who exactly would use that,

Re: Pre-proposal: unicode normalized text

2023-10-04 Thread Nico Williams
On Wed, Oct 04, 2023 at 05:32:50PM -0400, Chapman Flack wrote: > Well, for what reason does anybody run PG now with the encoding set > to anything besides UTF-8? I don't really have my finger on that pulse. Because they still have databases that didn't use UTF-8 10 or 20 years ago that they

Re: Pre-proposal: unicode normalized text

2023-10-04 Thread Jeff Davis
On Wed, 2023-10-04 at 14:14 -0400, Isaac Morland wrote: > Always store only UTF-8 in the database What problem does that solve? I don't see our encoding support as a big source of problems, given that database-wide UTF-8 already works fine. In fact, some postgres features only work with UTF-8. I

Re: Pre-proposal: unicode normalized text

2023-10-04 Thread Chapman Flack
On 2023-10-04 16:38, Jeff Davis wrote: On Wed, 2023-10-04 at 14:02 -0400, Chapman Flack wrote: The SQL standard would have me able to: CREATE TABLE foo (    a CHARACTER VARYING CHARACTER SET UTF8,    b CHARACTER VARYING CHARACTER SET LATIN1 ) and so on Is there a use case for that? UTF-8 is

Re: Pre-proposal: unicode normalized text

2023-10-04 Thread Nico Williams
On Wed, Oct 04, 2023 at 01:38:15PM -0700, Jeff Davis wrote: > On Wed, 2023-10-04 at 14:02 -0400, Chapman Flack wrote: > > The SQL standard would have me able to: > > > > [...] > > _UTF8'Hello, world!' and _LATIN1'Hello, world!' > > Is there a use case for that? UTF-8 is able to encode any

Re: Pre-proposal: unicode normalized text

2023-10-04 Thread Jeff Davis
On Wed, 2023-10-04 at 14:02 -0400, Chapman Flack wrote: > The SQL standard would have me able to: > > CREATE TABLE foo ( >    a CHARACTER VARYING CHARACTER SET UTF8, >    b CHARACTER VARYING CHARACTER SET LATIN1 > ) > > and so on, and write character literals like > > _UTF8'Hello, world!' and

Re: Pre-proposal: unicode normalized text

2023-10-04 Thread Jeff Davis
On Wed, 2023-10-04 at 13:16 -0400, Robert Haas wrote: > any byte sequence at all is accepted when you try to > put values into the database. We support SQL_ASCII, which allows something similar. > At any rate, if we were to go in the direction of rejecting code > points that aren't yet assigned,

Re: Pre-proposal: unicode normalized text

2023-10-04 Thread Isaac Morland
On Wed, 4 Oct 2023 at 14:05, Chapman Flack wrote: > On 2023-10-04 13:47, Robert Haas wrote: > > The SQL standard would have me able to: > > CREATE TABLE foo ( >a CHARACTER VARYING CHARACTER SET UTF8, >b CHARACTER VARYING CHARACTER SET LATIN1 > ) > > and so on, and write character

Re: Pre-proposal: unicode normalized text

2023-10-04 Thread Robert Haas
On Wed, Oct 4, 2023 at 2:02 PM Chapman Flack wrote: > Clearly, part of the job would involve making the wire protocol > able to transmit binary values and identify their encodings. Right. Which unfortunately is moving the goal posts into the stratosphere compared to any other work mentioned so

Re: Pre-proposal: unicode normalized text

2023-10-04 Thread Chapman Flack
On 2023-10-04 13:47, Robert Haas wrote: On Wed, Oct 4, 2023 at 1:27 PM Nico Williams wrote: A UTEXT type would be helpful for specifying that the text must be Unicode (in which transform?) even if the character data encoding for the database is not UTF-8. That's actually pretty thorny ...

Re: Pre-proposal: unicode normalized text

2023-10-04 Thread Robert Haas
On Wed, Oct 4, 2023 at 1:27 PM Nico Williams wrote: > A UTEXT type would be helpful for specifying that the text must be > Unicode (in which transform?) even if the character data encoding for > the database is not UTF-8. That's actually pretty thorny ... because right now client_encoding

Re: Pre-proposal: unicode normalized text

2023-10-04 Thread Nico Williams
On Tue, Sep 12, 2023 at 03:47:10PM -0700, Jeff Davis wrote: > The idea is to have a new data type, say "UTEXT", that normalizes the > input so that it can have an improved notion of equality while still > using memcmp(). A UTEXT type would be helpful for specifying that the text must be Unicode

Re: Pre-proposal: unicode normalized text

2023-10-04 Thread Robert Haas
On Tue, Oct 3, 2023 at 3:54 PM Jeff Davis wrote: > I assume you mean because we reject invalid byte sequences? Yeah, I'm > sure that causes a problem for some (especially migrations), but it's > difficult for me to imagine a database working well with no rules at > all for the the basic data

Re: Pre-proposal: unicode normalized text

2023-10-03 Thread Nico Williams
On Tue, Oct 03, 2023 at 03:34:44PM -0700, Jeff Davis wrote: > On Tue, 2023-10-03 at 15:15 -0500, Nico Williams wrote: > > Ugh, My client is not displying 'a' correctly > > Ugh. Is that an argument in favor of normalization or against? Heheh, well, it's an argument in favor of more software

Re: Pre-proposal: unicode normalized text

2023-10-03 Thread Jeff Davis
On Mon, 2023-10-02 at 10:47 +0200, Peter Eisentraut wrote: > I think a better direction here would be to work toward making > nondeterministic collations usable on the global/database level and > then > encouraging users to use those. > > It's also not clear which way the performance tradeoffs

Re: Pre-proposal: unicode normalized text

2023-10-03 Thread Jeff Davis
On Tue, 2023-10-03 at 15:15 -0500, Nico Williams wrote: > Ugh, My client is not displying 'a' correctly Ugh. Is that an argument in favor of normalization or against? I've also noticed that some fonts render the same character a bit differently depending on the constituent code points. For

Re: Pre-proposal: unicode normalized text

2023-10-03 Thread Nico Williams
On Tue, Oct 03, 2023 at 12:15:10PM -0700, Jeff Davis wrote: > On Mon, 2023-10-02 at 15:27 -0500, Nico Williams wrote: > > I think you misunderstand Unicode normalization and equivalence.  > > There is no standard Unicode `normalize()` that would cause the > > above equality predicate to be true. 

Re: Pre-proposal: unicode normalized text

2023-10-03 Thread Jeff Davis
On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote: > It seems to me that this overlooks one of the major points of Jeff's > proposal, which is that we don't reject text input that contains > unassigned code points. That decision turns out to be really painful. Yeah, because we lose

Re: Pre-proposal: unicode normalized text

2023-10-03 Thread Jeff Davis
On Mon, 2023-10-02 at 15:27 -0500, Nico Williams wrote: > I think you misunderstand Unicode normalization and equivalence.  > There > is no standard Unicode `normalize()` that would cause the above > equality > predicate to be true.  If you normalize to NFD (normal form > decomposed) > then a

Re: Pre-proposal: unicode normalized text

2023-10-02 Thread Nico Williams
On Tue, Sep 12, 2023 at 03:47:10PM -0700, Jeff Davis wrote: > One of the frustrations with using the "C" locale (or any deterministic > locale) is that the following returns false: > > SELECT 'á' = 'á'; -- false > > because those are the unicode sequences U&'\0061\0301' and U&'\00E1', >

Re: Pre-proposal: unicode normalized text

2023-10-02 Thread Robert Haas
On Mon, Oct 2, 2023 at 3:42 PM Peter Eisentraut wrote: > I think a better direction here would be to work toward making > nondeterministic collations usable on the global/database level and then > encouraging users to use those. It seems to me that this overlooks one of the major points of

Re: Pre-proposal: unicode normalized text

2023-10-02 Thread Peter Eisentraut
On 13.09.23 00:47, Jeff Davis wrote: The idea is to have a new data type, say "UTEXT", that normalizes the input so that it can have an improved notion of equality while still using memcmp(). I think a new type like this would obviously be suboptimal because it's nonstandard and most people

Pre-proposal: unicode normalized text

2023-09-12 Thread Jeff Davis
One of the frustrations with using the "C" locale (or any deterministic locale) is that the following returns false: SELECT 'á' = 'á'; -- false because those are the unicode sequences U&'\0061\0301' and U&'\00E1', respectively, so memcmp() returns non-zero. But it's really the same