Re: [HACKERS] Unicode Normalization

2009-09-24 Thread pg
In a context using normalization, wouldn't you typically want to store a 
normalized-text type that could perhaps (depending on locale) take advantage of 
simpler, more-efficient comparison functions? Whether you're doing 
INSERT/UPDATE, or importing a flat text file, if you canonicalize characters 
and substrings of identical meaning when trivial distinctions of encoding are 
irrelevant, you're better off later. User-invocable normalization functions by 
themselves don't make much sense. (If Postgres now supports binary- or 
mixed-binary-and-text flat files, perhaps for restore purposes, the same thing 
applies.)

David Hudson




Re: [HACKERS] Unicode Normalization

2009-09-24 Thread David E. Wheeler

On Sep 24, 2009, at 8:59 AM, Andrew Dunstan wrote:

That might be nice, but I'd be wary of a geometric multiplication  
of text types. We already have TEXT and CITEXT; what if we had your  
NTEXT (normalized text) but I wanted it to also be case-insensitive?


Actually, I don't think it's necessarily a good idea at all. If a  
user inputs a perfectly valid piece of UTF8 text, we should be able  
to give it back to them exactly, whether or not it's in normalized  
form. The normalized forms are useful for certain comparison  
purposes, but they don't affect the validity of the text. CITEXT  
doesn't mangle what is stored, just how it's compared.


Right, I don't think there's a need for a normalized TEXT type.

Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode Normalization

2009-09-24 Thread Andrew Dunstan



David E. Wheeler wrote:

On Sep 24, 2009, at 6:24 AM, p...@thetdh.com wrote:

In a context using normalization, wouldn't you typically want to 
store a normalized-text type that could perhaps (depending on locale) 
take advantage of simpler, more-efficient comparison functions?


That might be nice, but I'd be wary of a geometric multiplication of 
text types. We already have TEXT and CITEXT; what if we had your NTEXT 
(normalized text) but I wanted it to also be case-insensitive?


Actually, I don't think it's necessarily a good idea at all. If a user 
inputs a perfectly valid piece of UTF8 text, we should be able to give 
it back to them exactly, whether or not it's in normalized form. The 
normalized forms are useful for certain comparison purposes, but they 
don't affect the validity of the text. CITEXT doesn't mangle what is 
stored, just how it's compared.



cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode Normalization

2009-09-24 Thread David E. Wheeler

On Sep 24, 2009, at 6:24 AM, p...@thetdh.com wrote:

In a context using normalization, wouldn't you typically want to  
store a normalized-text type that could perhaps (depending on  
locale) take advantage of simpler, more-efficient comparison  
functions?


That might be nice, but I'd be wary of a geometric multiplication of  
text types. We already have TEXT and CITEXT; what if we had your NTEXT  
(normalized text) but I wanted it to also be case-insensitive?


Whether you're doing INSERT/UPDATE, or importing a flat text file,  
if you canonicalize characters and substrings of identical meaning  
when trivial distinctions of encoding are irrelevant, you're better  
off later.  User-invocable normalization functions by themselves  
don't make much sense.


Well, they make sense because there's nothing else right now. It's an  
easy way to get some support in, and besides, it's mandated by the SQL  
standard.


(If Postgres now supports binary- or mixed-binary-and-text flat  
files, perhaps for restore purposes, the same thing applies.)


Don't follow this bit.

Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode Normalization

2009-09-23 Thread David E. Wheeler

On Sep 23, 2009, at 11:08 AM, David E. Wheeler wrote:

I looked around and found the Public Software Group's utf8proc  
project, which even includes some PostgreSQL support (not, alas, for  
normalization). It has an MIT-licensed C library that offers these  
functions:


Sorry, forgot the link:

  http://www.public-software-group.org/utf8proc

Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unicode Normalization

2009-09-23 Thread David E. Wheeler

On Sep 23, 2009, at 11:08 AM, David E. Wheeler wrote:

I just had a discussion on IRC about unicode normalization in  
PostgreSQL. Apparently there is not support for it, currently.


BTW, the only reference I found on the [to do list](http://wiki.postgresql.org/wiki/Todo 
) was:



More sensible support for Unicode combining characters, normal forms


I think that should probably be changed to talk about the unicode  
standard support.


Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers