Re: [HACKERS] UTF8 regexp and char classes still does not work
Tom Lane writes: > Hmm, you're right. I only tested that on Latin1 characters, for which > it does work because those have Unicode points below 256. I'm not > sure of a reasonable solution for the general case --- we certainly > don't want this function iterating up to 2^21 or thereabouts. Yes, i understand this problem. How perl do this? May be this Unicode table can be precomputed or linked to postgres binary from external source? > Your test case seems to be using KOI8 encoding, though, which doesn't > have anything to do with UTF8 behavior. It's just for example of expected result. See first test, it is UTF8, two bytes per character: > > --- CYRILLIC SMALL LETTER ZHE ~* CYRILLIC CAPITAL LETTER ZHE > > select E'\320\266' ~* E'\320\226', E'\320\266' ~ '[[:alpha:]]+', 'g' ~ > > '[[:alpha:]]+'; > > ?column? | ?column? | ?column? > > --+--+-- > > t| f| t -- Sergey Burladyan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 regexp and char classes still does not work
Sergey Burladyan writes: > As i can see in Tom's patch 0d323425 only functions like pg_wc_isalpha is > changed, but > this pg_wc_isalpha is called from > static struct cvec * > cclass(struct vars * v,/* context */ >const chr *startp, /* where the name starts */ >const chr *endp,/* just past the end of the name */ >int cases) /* case-independent? */ > function, and this function have comment "For the moment, assume that only > char codes < 256 can be in these classes" and it call pg_wc_isalpha like this: > for (i = 0; i <= UCHAR_MAX; i++) > { > if (pg_wc_isalpha((chr) i)) > addchr(cv, (chr) i); > } > UCHAR_MAX is 255 Hmm, you're right. I only tested that on Latin1 characters, for which it does work because those have Unicode points below 256. I'm not sure of a reasonable solution for the general case --- we certainly don't want this function iterating up to 2^21 or thereabouts. Your test case seems to be using KOI8 encoding, though, which doesn't have anything to do with UTF8 behavior. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] UTF8 regexp and char classes still does not work
I see this in 9.0 Release note: - Support locale-specific regular expression processing with UTF-8 server encoding (Tom Lane) Locale-specific regular expression functionality includes case-insensitive matching and locale-specific character classes. But character classes still does not work, example (git REL9_0_STABLE c767c3bd): select version(); version PostgreSQL 9.0.0 on x86_64-unknown-linux-gnu, compiled by GCC gcc (Debian 4.4.4-8) 4.4.5 20100728 (prerelease), 64-bit --- CYRILLIC SMALL LETTER ZHE ~* CYRILLIC CAPITAL LETTER ZHE select E'\320\266' ~* E'\320\226', E'\320\266' ~ '[[:alpha:]]+', 'g' ~ '[[:alpha:]]+'; ?column? | ?column? | ?column? --+--+-- t| f| t all must be true, like below: create database koi8 template template0 encoding 'koi8r' lc_collate 'ru_RU.KOI8-R' lc_ctype 'ru_RU.KOI8-R'; \c koi8 set client_encoding TO utf8; select E'\326' ~* E'\366', E'\326' ~ '[[:alpha:]]+', 'g' ~ '[[:alpha:]]+'; ?column? | ?column? | ?column? --+--+-- t| t| t As i can see in Tom's patch 0d323425 only functions like pg_wc_isalpha is changed, but this pg_wc_isalpha is called from static struct cvec * cclass(struct vars * v,/* context */ const chr *startp, /* where the name starts */ const chr *endp,/* just past the end of the name */ int cases) /* case-independent? */ function, and this function have comment "For the moment, assume that only char codes < 256 can be in these classes" and it call pg_wc_isalpha like this: for (i = 0; i <= UCHAR_MAX; i++) { if (pg_wc_isalpha((chr) i)) addchr(cv, (chr) i); } UCHAR_MAX is 255 I do not understand fully this algorithm of regular expressions, but i think cclass function also need fix. -- Sergey Burladyan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers