Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-25 Thread Lincoln Yeoh
At 09:20 PM 8/24/2004 +0200, Peter Eisentraut wrote: David Wheeler wrote: That's not the trouble so much as that the locales can be badly If we always followed the principle X could be broken, so let's not use X, then we would never get anything done. Instead, X is broken, so fix it. broken,

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-24 Thread Peter Eisentraut
David Wheeler wrote: But given what you've said, Tatsuo, it makes me wonder if it's worth it to use the system locale default when running initdb? Yes, because that is the locale that the user prefers. If a locale is broken then you shouldn't set it as system locale in the first place. --

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-24 Thread David Wheeler
On Aug 23, 2004, at 10:25 PM, Joel wrote: If the locale machinery iw functioning correctly (and if I understand correctly), there ought to be a setting that would allow those to collate to the same point. Bleh. There must be some distinction between them. It sounds like querying for synonyms.

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-24 Thread David Wheeler
On Aug 24, 2004, at 12:20 PM, Peter Eisentraut wrote: broken, and that they're useless for multilingual use. I don't agree with that, but perhaps we differ in our interpretation of multilingual use. If you have special requirements, you can always turn the locales off. Well, we're getting beyond

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-24 Thread Tom Lane
David Wheeler [EMAIL PROTECTED] writes: Hmm. I tried putting your string into a UNICODE database and I got ERROR: invalid byte sequence for encoding UNICODE: 0xc7 Really? Curious. Oh, are you sure that you got my UTF-8 data? Because it came back in your reply all mangled. I

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread David Wheeler
On Aug 23, 2004, at 3:46 PM, Markus Bertheau wrote: The collation rules of your (and my) locale say that these strings are the same: [EMAIL PROTECTED] markus]$ cat t [EMAIL PROTECTED] markus]$ uniq t [EMAIL PROTECTED] markus]$ Interesting. Make sure that you have initdb'd the database under

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread Tom Lane
David Wheeler [EMAIL PROTECTED] writes: But is it possible to store non-UTF-8 data in a UNICODE database? In theory not ... but I think there was a discussion earlier that concluded that our check for encoding validity is not airtight ... regards, tom lane

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread David Wheeler
On Aug 23, 2004, at 3:59 PM, Tom Lane wrote: But is it possible to store non-UTF-8 data in a UNICODE database? In theory not ... but I think there was a discussion earlier that concluded that our check for encoding validity is not airtight ... Well, it it was mostly right, I wouldn't expect it to

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread Tom Lane
David Wheeler [EMAIL PROTECTED] writes: Is the encoding check fixed in 8.0beta1? [ looks back at discussion... ] Actually I misremembered --- the discussion was about how we would *reject* legal UTF-8 codes that are more than 2 bytes long. So the code is broken, but not in the direction that

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread David Wheeler
On Aug 23, 2004, at 4:08 PM, Tom Lane wrote: [ looks back at discussion... ] Actually I misremembered --- the discussion was about how we would *reject* legal UTF-8 codes that are more than 2 bytes long. So the code is broken, but not in the direction that would cause your problem. Time for

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread Ian Barwick
On Tue, 24 Aug 2004 00:46:50 +0200, Markus Bertheau [EMAIL PROTECTED] wrote: , 23.08.2004, 23:04, David Wheeler : On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote: er, the characters in name don't seem to match the characters in the query - '' vs. '' - does that have any bearing?

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread Tom Lane
David Wheeler [EMAIL PROTECTED] writes: Is the problem query using an index? If so, does REINDEX help? Doesn't look like it: bric=3D# reindex index udx_keyword__name; REINDEX bric=3D# select * from keyword where name =3D'=BA=CF=C7=D1=C0=C7'; id | name | screen_name | sort_name |

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread David Wheeler
On Aug 23, 2004, at 4:35 PM, Tom Lane wrote: Hmm. I tried putting your string into a UNICODE database and I got ERROR: invalid byte sequence for encoding UNICODE: 0xc7 Really? Curious. So there's something funny happening here. What is your client_encoding setting? It's not set. I've had it

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread David Wheeler
On Aug 23, 2004, at 4:34 PM, Ian Barwick wrote: wild speculation in need of a Korean speaker, but: [EMAIL PROTECTED]:~/tmp cat j.txt [EMAIL PROTECTED]:~/tmp uniq j.txt All but the first and last lines are random Korean (Hangul) characters. Evidently our respective locales think all

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread David Wheeler
On Aug 23, 2004, at 4:49 PM, David Wheeler wrote: Hmm. I tried putting your string into a UNICODE database and I got ERROR: invalid byte sequence for encoding UNICODE: 0xc7 Really? Curious. Oh, are you sure that you got my UTF-8 data? Because it came back in your reply all mangled. Cheers,

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread Ian Barwick
On Mon, 23 Aug 2004 16:50:04 -0700, David Wheeler [EMAIL PROTECTED] wrote: On Aug 23, 2004, at 4:34 PM, Ian Barwick wrote: wild speculation in need of a Korean speaker, but: [EMAIL PROTECTED]:~/tmp cat j.txt [EMAIL PROTECTED]:~/tmp uniq j.txt All but

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread David Wheeler
On Aug 23, 2004, at 5:07 PM, Ian Barwick wrote: Does this go away if you change your locale to C? Yes. Hallelujah! I'm running initdb again now. Cheers, David smime.p7s Description: S/MIME cryptographic signature

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread Tatsuo Ishii
, 23.08.2004, 23:04, David Wheeler : On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote: er, the characters in name don't seem to match the characters in the query - '' vs. '' - does that have any bearing? Yes, it means that = is doing the wrong thing!! The collation

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread David Wheeler
On Aug 23, 2004, at 5:22 PM, Tatsuo Ishii wrote: Locales for multibyte encodings are often broken on many platforms. I see identical things with Japanese on Red Hat. This is one of the reason why I tell Japanese PostgreSQL users not to enable locale while initdb... Yep, and exporting my data,

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread Tim Allen
Tom Lane wrote: David Wheeler [EMAIL PROTECTED] writes: bric=3D# reindex index udx_keyword__name; REINDEX bric=3D# select * from keyword where name =3D'=BA=CF=C7=D1=C0=C7'; id | name | screen_name | sort_name | active --++-+---+ 1218 |

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread David Wheeler
On Aug 23, 2004, at 6:49 PM, Tim Allen wrote: One possible clue: your original post in this thread was using encoding euc-kr, not unicode (utf-8). If your mailer was set to use that encoding, perhaps your other client software is/was also? Bah! Stupid Mail.app was trying to be too smart! Thanks,

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread Joel
On Tue, 24 Aug 2004 01:34:46 +0200 (BIan Barwick [EMAIL PROTECTED] wrote (B (B ... (B wild speculation in need of a Korean speaker, but: (B (B [EMAIL PROTECTED]:~/tmp cat j.txt (B $Bec,e$;ec(Bˆ (B $ByyPl%$%9wd!"(B (B $Bx"(l%$(B€l$B%i(B (B $Bw{%1v.%/wd(Bœ (B