Re: [HACKERS] UPPER()/LOWER() and UTF-8
TL == Tom Lane [EMAIL PROTECTED] writes: TL writes: upper/lower aren't TL going to work desirably in any multi-byte character set TL encoding. Can you please point me at their implementation? I do not understand why that's impossible. TL Because they use ctype.h's toupper() and tolower() TL functions, which only work on single-byte characters. Aha, that's in src/backend/utils/adt/formatting.c, right? Yes, I see, it goes byte by byte and uses toupper(). I believe we could look at the locale, and if it is UTF-8, then use (or copy) e.g. g_utf8_strup/strdown, right? http://developer.gnome.org/doc/API/2.0/glib/glib-Unicode-Manipulation.html#g-utf8-strup I belive that patch could be written in a matter of hours. TL There has been some discussion of using wctype.h where TL available, but this has a number of issues, notably figuring TL out the correct mapping from the server string encoding (eg TL UTF-8) to unpacked wide characters. At minimum we'd need to TL know which charset the locale setting is expecting, and there TL doesn't seem to be a portable way to find that out. TL IIRC, Peter thinks we must abandon use of libc's locale TL functionality altogether and write our own locale layer before TL we can really have all the locale-specific functionality we TL want. I believe that native Unicode strings (together with human language handling) should be introduced as (almost) separate data type (which have nothing to do with locale), but that's bluesky maybe. --alexm ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] UPPER()/LOWER() and UTF-8
TL == Tom Lane [EMAIL PROTECTED] writes: TL Alexey Mahotkin [EMAIL PROTECTED] writes: I'm running Postgresql 7.3.4 with ru_RU.UTF-8 locale (with UNICODE database encoding), and all is almost well, except that UPPER() and LOWER() seem to ignore locale. TL upper/lower aren't going to work desirably in any multi-byte TL character set encoding. Can you please point me at their implementation? I do not understand why that's impossible. TL I think Peter E. is looking into what TL it would take to fix this for 7.5, but at present you are TL going to need to use a single-byte encoding within the server. TL (Nothing to stop you from using UTF-8 on the client side TL though.) Thanks, --alexm ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] UPPER()/LOWER() and UTF-8
Alexey Mahotkin kirjutas K, 05.11.2003 kell 17:11: Aha, that's in src/backend/utils/adt/formatting.c, right? Yes, I see, it goes byte by byte and uses toupper(). I believe we could look at the locale, and if it is UTF-8, then use (or copy) e.g. g_utf8_strup/strdown, right? http://developer.gnome.org/doc/API/2.0/glib/glib-Unicode-Manipulation.html#g-utf8-strup I belive that patch could be written in a matter of hours. TL There has been some discussion of using wctype.h where TL available, but this has a number of issues, notably figuring TL out the correct mapping from the server string encoding (eg TL UTF-8) to unpacked wide characters. At minimum we'd need to TL know which charset the locale setting is expecting, and there TL doesn't seem to be a portable way to find that out. TL IIRC, Peter thinks we must abandon use of libc's locale TL functionality altogether and write our own locale layer before TL we can really have all the locale-specific functionality we TL want. I believe that native Unicode strings (together with human language handling) should be introduced as (almost) separate data type (which have nothing to do with locale), but that's bluesky maybe. They should have nothing to do with _system_ locale, but you can neither UPPER()/LOWER() nor ORDER BY unless you know the locale. It is just that the locale should either be property of column or given in the SQL statement. I guess one could write UCHAR, UVARCHAR, UTEXT types based on ICU. - Hannu ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] UPPER()/LOWER() and UTF-8
On Tue, Nov 04, 2003 at 04:52:33PM -0500, Tom Lane wrote: Alexey Mahotkin [EMAIL PROTECTED] writes: I'm running Postgresql 7.3.4 with ru_RU.UTF-8 locale (with UNICODE database encoding), and all is almost well, except that UPPER() and LOWER() seem to ignore locale. upper/lower aren't going to work desirably in any multi-byte character set encoding. I think Peter E. is looking into what it would take to It's a PostgreSQL and no UTF problem, because standard PostgreSQL text functions doesn't know something about arguments encoding and for this functions cannot use another (an example UTF's lower/upper) method for a work with strings. Maybe a little extend internal text datatype and like VARSIZE() use VARENCODING(). Maybe Peter already has some better idea. fix this for 7.5, but at present you are going to need to use a single-byte encoding within the server. (Nothing to stop you from using UTF-8 on the client side though.) You can use mutibyte on server side too, but you must to use for example convert() function for upper/lower arguments. Karel -- Karel Zak [EMAIL PROTECTED] http://home.zf.jcu.cz/~zakkr/ ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
[HACKERS] UPPER()/LOWER() and UTF-8
Hello, I'm running Postgresql 7.3.4 with ru_RU.UTF-8 locale (with UNICODE database encoding), and all is almost well, except that UPPER() and LOWER() seem to ignore locale. I searched the sources couple of times, but do not understand where is the implementation of UPPER()/LOWER(). Could you please point me to the right direction? I'll try to understand and fix that. (But maybe patches for that exist? Or maybe FreeBSD 4.8-RELEASE utf-8 locales are broken in that respect?) Thanks a lot, --alexm ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] UPPER()/LOWER() and UTF-8
Alexey Mahotkin [EMAIL PROTECTED] writes: I'm running Postgresql 7.3.4 with ru_RU.UTF-8 locale (with UNICODE database encoding), and all is almost well, except that UPPER() and LOWER() seem to ignore locale. upper/lower aren't going to work desirably in any multi-byte character set encoding. I think Peter E. is looking into what it would take to fix this for 7.5, but at present you are going to need to use a single-byte encoding within the server. (Nothing to stop you from using UTF-8 on the client side though.) regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly