Re: [HACKERS] UPPER()/LOWER() and UTF-8

2003-11-09 Thread Alexey Mahotkin
 TL == Tom Lane [EMAIL PROTECTED] writes:

TL writes: upper/lower aren't
TL going to work desirably in any multi-byte character set
TL encoding.

 Can you please point me at their implementation?  I do not
 understand why that's impossible.

TL Because they use ctype.h's toupper() and tolower()
TL functions, which only work on single-byte characters.

Aha, that's in src/backend/utils/adt/formatting.c, right?

Yes, I see, it goes byte by byte and uses toupper().  I believe we
could look at the locale, and if it is UTF-8, then use (or copy)
e.g. g_utf8_strup/strdown, right?

 
http://developer.gnome.org/doc/API/2.0/glib/glib-Unicode-Manipulation.html#g-utf8-strup

I belive that patch could be written in a matter of hours.


TL There has been some discussion of using wctype.h where
TL available, but this has a number of issues, notably figuring
TL out the correct mapping from the server string encoding (eg
TL UTF-8) to unpacked wide characters.  At minimum we'd need to
TL know which charset the locale setting is expecting, and there
TL doesn't seem to be a portable way to find that out.

TL IIRC, Peter thinks we must abandon use of libc's locale
TL functionality altogether and write our own locale layer before
TL we can really have all the locale-specific functionality we
TL want.

I believe that native Unicode strings (together with human language
handling) should be introduced as (almost) separate data type (which
have nothing to do with locale), but that's bluesky maybe.

--alexm

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] UPPER()/LOWER() and UTF-8

2003-11-09 Thread Alexey Mahotkin
 TL == Tom Lane [EMAIL PROTECTED] writes:

TL Alexey Mahotkin [EMAIL PROTECTED] writes:
 I'm running Postgresql 7.3.4 with ru_RU.UTF-8 locale (with
 UNICODE database encoding), and all is almost well, except that
 UPPER() and LOWER() seem to ignore locale.

TL upper/lower aren't going to work desirably in any multi-byte
TL character set encoding.  

Can you please point me at their implementation?  I do not understand
why that's impossible.

TL I think Peter E. is looking into what
TL it would take to fix this for 7.5, but at present you are
TL going to need to use a single-byte encoding within the server.
TL (Nothing to stop you from using UTF-8 on the client side
TL though.)


Thanks,

--alexm

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] UPPER()/LOWER() and UTF-8

2003-11-09 Thread Hannu Krosing
Alexey Mahotkin kirjutas K, 05.11.2003 kell 17:11:
 Aha, that's in src/backend/utils/adt/formatting.c, right?
 
 Yes, I see, it goes byte by byte and uses toupper().  I believe we
 could look at the locale, and if it is UTF-8, then use (or copy)
 e.g. g_utf8_strup/strdown, right?
 
  
 http://developer.gnome.org/doc/API/2.0/glib/glib-Unicode-Manipulation.html#g-utf8-strup
 
 I belive that patch could be written in a matter of hours.
 
 
 TL There has been some discussion of using wctype.h where
 TL available, but this has a number of issues, notably figuring
 TL out the correct mapping from the server string encoding (eg
 TL UTF-8) to unpacked wide characters.  At minimum we'd need to
 TL know which charset the locale setting is expecting, and there
 TL doesn't seem to be a portable way to find that out.
 
 TL IIRC, Peter thinks we must abandon use of libc's locale
 TL functionality altogether and write our own locale layer before
 TL we can really have all the locale-specific functionality we
 TL want.
 
 I believe that native Unicode strings (together with human language
 handling) should be introduced as (almost) separate data type (which
 have nothing to do with locale), but that's bluesky maybe.

They should have nothing to do with _system_ locale, but you can
neither  UPPER()/LOWER() nor ORDER BY unless you know the locale. It is
just that the locale should either be property of column or given in the
SQL statement.

I guess one could write UCHAR, UVARCHAR, UTEXT types based on ICU.

-
Hannu


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] UPPER()/LOWER() and UTF-8

2003-11-05 Thread Karel Zak
On Tue, Nov 04, 2003 at 04:52:33PM -0500, Tom Lane wrote:
 Alexey Mahotkin [EMAIL PROTECTED] writes:
  I'm running Postgresql 7.3.4 with ru_RU.UTF-8 locale (with UNICODE
  database encoding), and all is almost well, except that UPPER() and
  LOWER() seem to ignore locale.
 
 upper/lower aren't going to work desirably in any multi-byte character
 set encoding.  I think Peter E. is looking into what it would take to

 It's a PostgreSQL and no  UTF problem, because standard PostgreSQL text
 functions doesn't know something about  arguments encoding and for this
 functions cannot use another (an  example UTF's lower/upper) method for
 a work with strings.

 Maybe a little  extend internal text datatype and  like VARSIZE() use
 VARENCODING(). Maybe Peter already has some better idea.

 fix this for 7.5, but at present you are going to need to use a
 single-byte encoding within the server.  (Nothing to stop you from using
 UTF-8 on the client side though.)

 You  can use  mutibyte on  server side  too, but  you must  to use  for
 example convert() function for upper/lower arguments.

Karel

-- 
 Karel Zak  [EMAIL PROTECTED]
 http://home.zf.jcu.cz/~zakkr/

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


[HACKERS] UPPER()/LOWER() and UTF-8

2003-11-04 Thread Alexey Mahotkin

Hello,

I'm running Postgresql 7.3.4 with ru_RU.UTF-8 locale (with UNICODE
database encoding), and all is almost well, except that UPPER() and
LOWER() seem to ignore locale.

I searched the sources couple of times, but do not understand where is
the implementation of UPPER()/LOWER().  Could you please point me to
the right direction?

I'll try to understand and fix that.  (But maybe patches for that
exist?  Or maybe FreeBSD 4.8-RELEASE utf-8 locales are broken in that
respect?)


Thanks a lot,

--alexm

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] UPPER()/LOWER() and UTF-8

2003-11-04 Thread Tom Lane
Alexey Mahotkin [EMAIL PROTECTED] writes:
 I'm running Postgresql 7.3.4 with ru_RU.UTF-8 locale (with UNICODE
 database encoding), and all is almost well, except that UPPER() and
 LOWER() seem to ignore locale.

upper/lower aren't going to work desirably in any multi-byte character
set encoding.  I think Peter E. is looking into what it would take to
fix this for 7.5, but at present you are going to need to use a
single-byte encoding within the server.  (Nothing to stop you from using
UTF-8 on the client side though.)

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly