Re: [HACKERS] UTF8 regexp and char classes still does not work

2010-09-28 Thread Sergey Burladyan
Tom Lane  writes:

> Hmm, you're right.  I only tested that on Latin1 characters, for which
> it does work because those have Unicode points below 256.  I'm not
> sure of a reasonable solution for the general case --- we certainly
> don't want this function iterating up to 2^21 or thereabouts.

Yes, i understand this problem. How perl do this? May be this Unicode table can
be precomputed or linked to postgres binary from external source?

> Your test case seems to be using KOI8 encoding, though, which doesn't
> have anything to do with UTF8 behavior.

It's just for example of expected result. See first test, it is UTF8, two bytes 
per character:
> > --- CYRILLIC SMALL LETTER ZHE ~* CYRILLIC CAPITAL LETTER ZHE
> > select E'\320\266' ~* E'\320\226', E'\320\266' ~ '[[:alpha:]]+', 'g' ~ 
> > '[[:alpha:]]+';
> >  ?column? | ?column? | ?column? 
> > --+--+--
> >  t| f| t


-- 
Sergey Burladyan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF8 regexp and char classes still does not work

2010-09-28 Thread Tom Lane
Sergey Burladyan  writes:
> As i can see in Tom's patch 0d323425 only functions like pg_wc_isalpha is 
> changed, but
> this pg_wc_isalpha is called from
> static struct cvec *
> cclass(struct vars * v,/* context */
>const chr *startp,  /* where the name starts */
>const chr *endp,/* just past the end of the name */
>int cases)  /* case-independent? */
> function, and this function have comment "For the moment, assume that only 
> char codes < 256 can be in these classes" and it call pg_wc_isalpha like this:
> for (i = 0; i <= UCHAR_MAX; i++)
> {
> if (pg_wc_isalpha((chr) i))
> addchr(cv, (chr) i);
> }
> UCHAR_MAX is 255

Hmm, you're right.  I only tested that on Latin1 characters, for which
it does work because those have Unicode points below 256.  I'm not
sure of a reasonable solution for the general case --- we certainly
don't want this function iterating up to 2^21 or thereabouts.

Your test case seems to be using KOI8 encoding, though, which doesn't
have anything to do with UTF8 behavior.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] UTF8 regexp and char classes still does not work

2010-09-28 Thread Sergey Burladyan
I see this in 9.0 Release note:
- Support locale-specific regular expression processing with UTF-8
  server encoding (Tom Lane)
Locale-specific regular expression functionality includes
case-insensitive matching and locale-specific character classes.

But character classes still does not work, example (git REL9_0_STABLE c767c3bd):
select version();
version 


 PostgreSQL 9.0.0 on x86_64-unknown-linux-gnu, compiled by GCC gcc (Debian 
4.4.4-8) 4.4.5 20100728 (prerelease), 64-bit

--- CYRILLIC SMALL LETTER ZHE ~* CYRILLIC CAPITAL LETTER ZHE
select E'\320\266' ~* E'\320\226', E'\320\266' ~ '[[:alpha:]]+', 'g' ~ 
'[[:alpha:]]+';
 ?column? | ?column? | ?column? 
--+--+--
 t| f| t

all must be true, like below:

create database koi8 template template0 encoding 'koi8r' lc_collate 
'ru_RU.KOI8-R' lc_ctype 'ru_RU.KOI8-R';
\c koi8
set client_encoding TO utf8;
select E'\326' ~* E'\366', E'\326' ~ '[[:alpha:]]+', 'g' ~ '[[:alpha:]]+';
 ?column? | ?column? | ?column? 
--+--+--
 t| t| t

As i can see in Tom's patch 0d323425 only functions like pg_wc_isalpha is 
changed, but
this pg_wc_isalpha is called from
static struct cvec *
cclass(struct vars * v,/* context */
   const chr *startp,  /* where the name starts */
   const chr *endp,/* just past the end of the name */
   int cases)  /* case-independent? */
function, and this function have comment "For the moment, assume that only char 
codes < 256 can be in these classes" and it call pg_wc_isalpha like this:
for (i = 0; i <= UCHAR_MAX; i++)
{
if (pg_wc_isalpha((chr) i))
addchr(cv, (chr) i);
}
UCHAR_MAX is 255

I do not understand fully this algorithm of regular expressions, but i think 
cclass function also need fix.

-- 
Sergey Burladyan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers