Re: [GENERAL] Concerning about Unicode-aware string handling

2012-05-21 Thread Craig Ringer
On 05/21/2012 06:59 PM, Andrew Sullivan wrote: On Mon, May 21, 2012 at 02:44:45AM -0700, John R Pierce wrote: support the bastardized UTF-16 'unicode' implemented by Windows NT To be fair to Microsoft, while the BOM might be an irritant, they do use a perfectly legitimate encoding of Unicode.

Re: [GENERAL] Concerning about Unicode-aware string handling

2012-05-21 Thread Tom Lane
Vincas Dargis writes: > Database created using: > initdb -D ../data -E utf-8 -U postgres That looks fairly dangerous, as it will absorb the database's locale settings (particularly LC_CTYPE, which is what you care about for these operations) from your shell environment. If the environment locale

Re: [GENERAL] Concerning about Unicode-aware string handling

2012-05-21 Thread Vincas Dargis
I've forgot to mention I'm working on Windows XP SP3 Yes, we are using UTF8 encoding and regexp works wrong. It looks like you replicated that. 2012/5/21 Albe Laurenz : > > I tried it with 9.1.3 on Linux: > > upper() and lower() works fine, no matter what the > database encoding is: > > test=> SE

Re: [GENERAL] Concerning about Unicode-aware string handling

2012-05-21 Thread Albe Laurenz
Vincas Dargis wrote: > We have problems (currently using 8.4, but also in latest 9.1.3) in > our application with Unicode word symbols in Lithuanian ('ąčęėįšųūž'), > Russian and of course potentially other languages. > > For example, regex_replace('acząčž', E'\\W', '', 'g') removes ąčž. > > lower

Re: [GENERAL] Concerning about Unicode-aware string handling

2012-05-21 Thread Vincas Dargis
Sorry I have to write "manual" replay since I've messed up mailing list settings (got "Partial Digest"...). John R Pierce wrote: > your database encoding is UTF8 ? the language or environment you're using to > generate those strings such as 'acząčž' is also UTF8 ? Database created using: initdb

Re: [GENERAL] Concerning about Unicode-aware string handling

2012-05-21 Thread Andrew Sullivan
On Mon, May 21, 2012 at 02:44:45AM -0700, John R Pierce wrote: > support the bastardized UTF-16 'unicode' implemented by Windows NT To be fair to Microsoft, while the BOM might be an irritant, they do use a perfectly legitimate encoding of Unicode. There is no Unicode requirement that code points

Re: [GENERAL] Concerning about Unicode-aware string handling

2012-05-21 Thread John R Pierce
On 05/21/12 2:09 AM, Vincas Dargis wrote: We have problems (currently using 8.4, but also in latest 9.1.3) in our application with Unicode word symbols in Lithuanian ('ąčęėįšųūž'), Russian and of course potentially other languages. For example, regex_replace('acząčž', E'\\W', '', 'g') removes ąč

[GENERAL] Concerning about Unicode-aware string handling

2012-05-21 Thread Vincas Dargis
Hello, We have problems (currently using 8.4, but also in latest 9.1.3) in our application with Unicode word symbols in Lithuanian ('ąčęėįšųūž'), Russian and of course potentially other languages. For example, regex_replace('acząčž', E'\\W', '', 'g') removes ąčž. lower() and ~* comparison works