Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-23 Thread Peter Eisentraut
On fre, 2012-02-17 at 10:19 -0500, Tom Lane wrote:
  What if you did this ONCE and wrote the results to a file someplace?
 
 That's still a cache, you've just defaulted on your obligation to think
 about what conditions require the cache to be flushed.  (In the case at
 hand, the trigger for a cache rebuild would probably need to be a glibc
 package update, which we have no way of knowing about.) 

We basically hardwire locale behavior at initdb time, so computing this
then and storing it somewhere for eternity would be consistent.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-23 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 On fre, 2012-02-17 at 10:19 -0500, Tom Lane wrote:
 That's still a cache, you've just defaulted on your obligation to think
 about what conditions require the cache to be flushed.  (In the case at
 hand, the trigger for a cache rebuild would probably need to be a glibc
 package update, which we have no way of knowing about.) 

 We basically hardwire locale behavior at initdb time, so computing this
 then and storing it somewhere for eternity would be consistent.

Well, only if we could cache every locale-related libc inquiry we ever
make.  Locking down only part of the behavior does not sound like a
plan.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread NISHIYAMA Tomoaki

I don't believe it is valid to ignore CJK characters above U+2.
If it is used for names, it will be stored in the database.
If the behaviour is different from characters below U+, you will
get a bug report in meanwhile.

see
CJK Extension B, C, and D
from
http://www.unicode.org/charts/

Also, there are some code points that could be regarded alphabet and numbers
http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols

On the other hand, it is ok if processing of characters above U+1 is very 
slow, 
as far as properly processed, because it is considered rare.


On 2012/02/17, at 23:56, Andrew Dunstan wrote:

 
 
 On 02/17/2012 09:39 AM, Tom Lane wrote:
 Heikki Linnakangasheikki.linnakan...@enterprisedb.com  writes:
 Here's a wild idea: keep the class of each codepoint in a hash table.
 Initialize it with all codepoints up to 0x. After that, whenever a
 string contains a character that's not in the hash table yet, query the
 class of that character, and add it to the hash table. Then recompile
 the whole regex and restart the matching engine.
 Recompiling is expensive, but if you cache the results for the session,
 it would probably be acceptable.
 Dunno ... recompiling is so expensive that I can't see this being a win;
 not to mention that it would require fundamental surgery on the regex
 code.
 
 In the Tcl implementation, no codepoints above U+ have any locale
 properties (alpha/digit/punct/etc), period.  Personally I'd not have a
 problem imposing the same limitation, so that dealing with stuff above
 that range isn't really a consideration anyway.
 
 
 up to U+ is the BMP which is described as containing characters for 
 almost all modern languages, and a large number of special characters. It 
 seems very likely to be acceptable not to bother about the locale of code 
 points in the supplementary planes.
 
 See http://en.wikipedia.org/wiki/Plane_%28Unicode%29 for descriptions of 
 which sets of characters are involved.
 
 
 cheers
 
 andrew
 
 
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Tom Lane
NISHIYAMA Tomoaki tomoa...@staff.kanazawa-u.ac.jp writes:
 I don't believe it is valid to ignore CJK characters above U+2.
 If it is used for names, it will be stored in the database.
 If the behaviour is different from characters below U+, you will
 get a bug report in meanwhile.

I am skeptical that there is enough usage of such things to justify
slowing regexp operations down for everybody.  Note that it's not only
the initial probe of libc behavior that's at stake here --- the more
character codes are treated as letters, the larger the DFA transition
maps get and the more time it takes to build them.  So I'm unexcited
about just cranking up the loop limit in pg_ctype_get_cache.

 On the other hand, it is ok if processing of characters above U+1
 is very slow, as far as properly processed, because it is considered
 rare.

Yeah, it's conceivable that we could implement something whereby
characters with codes above some cutoff point are handled via runtime
calls to iswalpha() and friends, rather than being included in the
statically-constructed DFA maps.  The cutoff point could likely be a lot
less than U+, too, thereby saving storage and map build time all
round.

However, that we above is the editorial we.  *I* am not going to
do this.  Somebody who actually has a need for it should step up.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Dimitri Fontaine
Tom Lane t...@sss.pgh.pa.us writes:
 Yeah, it's conceivable that we could implement something whereby
 characters with codes above some cutoff point are handled via runtime
 calls to iswalpha() and friends, rather than being included in the
 statically-constructed DFA maps.  The cutoff point could likely be a lot
 less than U+, too, thereby saving storage and map build time all
 round.

It's been proposed to build a “regexp” type in PostgreSQL which would
store the DFA directly and provides some way to run that DFA out of its
“storage” without recompiling.

Would such a mechanism be useful here?  Would it be useful only when
storing the regexp in a column somewhere then applying it in the query
from there (so most probably adding a join or subquery somewhere)?

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Tom Lane
Dimitri Fontaine dimi...@2ndquadrant.fr writes:
 Tom Lane t...@sss.pgh.pa.us writes:
 Yeah, it's conceivable that we could implement something whereby
 characters with codes above some cutoff point are handled via runtime
 calls to iswalpha() and friends, rather than being included in the
 statically-constructed DFA maps.  The cutoff point could likely be a lot
 less than U+, too, thereby saving storage and map build time all
 round.

 It's been proposed to build a “regexp” type in PostgreSQL which would
 store the DFA directly and provides some way to run that DFA out of its
 “storage” without recompiling.

 Would such a mechanism be useful here?

No, this is about what goes into the DFA representation in the first
place, not about how we store it and reuse it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Tom Lane
I wrote:
 And here's a poorly-tested draft patch for that.

I've done some more testing now, and am satisfied that this works as
intended.  However, some crude performance testing suggests that people
might be annoyed with it.  As an example, in 9.1 with pl_PL.utf8 locale,
I see this:
select 'aa' ~ '\w\w\w\w\w\w\w\w\w\w\w';
taking perhaps 0.75 ms on first execution and 0.4 ms on subsequent
executions, the difference being the time needed to compile and cache
the DFA representation of the regexp.  With the patch, the numbers are
more like 5 ms and 0.4 ms, meaning the compilation time has gone up by
something near a factor of 10, though AFAICT execution time hasn't
moved.  It's hard to tell how significant that would be to real-world
queries, but in the worst case where our caching of regexps doesn't help
much, it could be disastrous.

All of the extra time is in manipulation of the much larger number of
DFA arcs required to represent all the additional character codes that
are being considered to be letters.

Perhaps I'm being overly ASCII-centric, but I'm afraid to commit this
as-is; I think the number of people who are hurt by the performance
degradation will be greatly larger than the number who are glad because
characters in $random_alphabet are now seen to be letters.  I think an
actually workable solution will require something like what I speculated
about earlier:

 Yeah, it's conceivable that we could implement something whereby
 characters with codes above some cutoff point are handled via runtime
 calls to iswalpha() and friends, rather than being included in the
 statically-constructed DFA maps.  The cutoff point could likely be a lot
 less than U+, too, thereby saving storage and map build time all
 round.

In the meantime, I still think the caching logic is worth having, and
we could at least make some people happy if we selected a cutoff point
somewhere between U+FF and U+.  I don't have any strong ideas about
what a good compromise cutoff would be.  One possibility is U+7FF, which
corresponds to the limit of what fits in 2-byte UTF8; but I don't know
if that corresponds to any significant dropoff in frequency of usage.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Robert Haas
On Sat, Feb 18, 2012 at 7:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Yeah, it's conceivable that we could implement something whereby
 characters with codes above some cutoff point are handled via runtime
 calls to iswalpha() and friends, rather than being included in the
 statically-constructed DFA maps.  The cutoff point could likely be a lot
 less than U+, too, thereby saving storage and map build time all
 round.

 In the meantime, I still think the caching logic is worth having, and
 we could at least make some people happy if we selected a cutoff point
 somewhere between U+FF and U+.  I don't have any strong ideas about
 what a good compromise cutoff would be.  One possibility is U+7FF, which
 corresponds to the limit of what fits in 2-byte UTF8; but I don't know
 if that corresponds to any significant dropoff in frequency of usage.

The problem, of course, is that this probably depends quite a bit on
what language you happen to be using.  For some languages, it won't
matter whether you cut it off at U+FF or U+7FF; while for others even
U+ might not be enough.  So I think this is one of those cases
where it's somewhat meaningless to talk about frequency of usage.

In theory you can imagine a regular expression engine where these
decisions can be postponed until we see the string we're matching
against.  IOW, your DFA ends up with state transitions for characters
specifically named, plus a state transition for anything else that's
a letter, plus a state transition for anything else not otherwise
specified.  Then you only need to test the letters that actually
appear in the target string, rather than all of the ones that might
appear there.

But implementing that could be quite a lot of work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Vik Reykja
On Sun, Feb 19, 2012 at 04:33, Robert Haas robertmh...@gmail.com wrote:

 On Sat, Feb 18, 2012 at 7:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Yeah, it's conceivable that we could implement something whereby
  characters with codes above some cutoff point are handled via runtime
  calls to iswalpha() and friends, rather than being included in the
  statically-constructed DFA maps.  The cutoff point could likely be a lot
  less than U+, too, thereby saving storage and map build time all
  round.
 
  In the meantime, I still think the caching logic is worth having, and
  we could at least make some people happy if we selected a cutoff point
  somewhere between U+FF and U+.  I don't have any strong ideas about
  what a good compromise cutoff would be.  One possibility is U+7FF, which
  corresponds to the limit of what fits in 2-byte UTF8; but I don't know
  if that corresponds to any significant dropoff in frequency of usage.

 The problem, of course, is that this probably depends quite a bit on
 what language you happen to be using.  For some languages, it won't
 matter whether you cut it off at U+FF or U+7FF; while for others even
 U+ might not be enough.  So I think this is one of those cases
 where it's somewhat meaningless to talk about frequency of usage.


Does it make sense for regexps to have collations?


Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Robert Haas
On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja vikrey...@gmail.com wrote:
 Does it make sense for regexps to have collations?

As I understand it, collations determine the sort-ordering of strings.
 Regular expressions don't care about that.  Why do you ask?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Vik Reykja
On Sun, Feb 19, 2012 at 05:03, Robert Haas robertmh...@gmail.com wrote:

 On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja vikrey...@gmail.com wrote:
  Does it make sense for regexps to have collations?

 As I understand it, collations determine the sort-ordering of strings.
  Regular expressions don't care about that.  Why do you ask?


Perhaps I used the wrong term, but I was thinking the locale could tell us
what alphabet we're dealing with. So a regexp using en_US would give
different word-boundary results from one using zh_CN.


Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 In theory you can imagine a regular expression engine where these
 decisions can be postponed until we see the string we're matching
 against.  IOW, your DFA ends up with state transitions for characters
 specifically named, plus a state transition for anything else that's
 a letter, plus a state transition for anything else not otherwise
 specified.  Then you only need to test the letters that actually
 appear in the target string, rather than all of the ones that might
 appear there.

 But implementing that could be quite a lot of work.

Yeah, not to mention slow.  The difficulty is overlapping sets of
characters.  As a simple example, if your regex refers to 3, 7,
[[:digit:]], X, and [[:alnum:]], then you end up needing five distinct
colors: 3, 7, X, all digits that aren't 3 or 7, all alphanumerics
that aren't any of the preceding.  And state transitions for the digit
and alnum cases had better mention all and only the correct colors.
I've been tracing through the logic this evening, and it works pretty
simply given that all named character classes are immediately expanded
out to their component characters.  If we are going to try to keep
the classes in some kind of symbolic form, it's a lot messier.  In
particular, I think your sketch above would lead to having to test
every character against iswdigit and iswalnum at runtime, which would
be disastrous performancewise.  I'd like to at least avoid that for the
shorter (and presumably more common) UTF8 codes.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Tom Lane
Vik Reykja vikrey...@gmail.com writes:
 On Sun, Feb 19, 2012 at 05:03, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja vikrey...@gmail.com wrote:
 Does it make sense for regexps to have collations?

 As I understand it, collations determine the sort-ordering of strings.
 Regular expressions don't care about that.  Why do you ask?

 Perhaps I used the wrong term, but I was thinking the locale could tell us
 what alphabet we're dealing with. So a regexp using en_US would give
 different word-boundary results from one using zh_CN.

Our interpretation of a collation is that it sets both LC_COLLATE and
LC_CTYPE.  Regexps may not care about the first but they definitely care
about the second.  This is why the stuff in regc_pg_locale.c pays
attention to collation.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Robert Haas
On Sat, Feb 18, 2012 at 11:16 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 In theory you can imagine a regular expression engine where these
 decisions can be postponed until we see the string we're matching
 against.  IOW, your DFA ends up with state transitions for characters
 specifically named, plus a state transition for anything else that's
 a letter, plus a state transition for anything else not otherwise
 specified.  Then you only need to test the letters that actually
 appear in the target string, rather than all of the ones that might
 appear there.

 But implementing that could be quite a lot of work.

 Yeah, not to mention slow.  The difficulty is overlapping sets of
 characters.  As a simple example, if your regex refers to 3, 7,
 [[:digit:]], X, and [[:alnum:]], then you end up needing five distinct
 colors: 3, 7, X, all digits that aren't 3 or 7, all alphanumerics
 that aren't any of the preceding.  And state transitions for the digit
 and alnum cases had better mention all and only the correct colors.

Yeah, that's unfortunate.  On the other hand, if you don't use colors
for this case, aren't you going to need, for each DFA state, a
gigantic lookup table that includes every character in the server
encoding?  Even if you've got plenty of memory, initializing such a
beast seems awfully expensive, and it might not do very good things
for cache locality, either.

 I've been tracing through the logic this evening, and it works pretty
 simply given that all named character classes are immediately expanded
 out to their component characters.  If we are going to try to keep
 the classes in some kind of symbolic form, it's a lot messier.  In
 particular, I think your sketch above would lead to having to test
 every character against iswdigit and iswalnum at runtime, which would
 be disastrous performancewise.  I'd like to at least avoid that for the
 shorter (and presumably more common) UTF8 codes.

Hmm, but you could cache that information.  Instead of building a
cache that covers every possible character that might appear in the
target string, you can just cache the results for the code points that
you actually see.

Yet another option would be to dictate that the cache can't holes - it
will always include information for every code point from 0 up to some
value X.  If we see a code point in the target string which is greater
than X, then we extend the cache out as far as that code point.  That
way, people who are using only code points out to U+FF (or even U+7F)
don't pay the cost of building a large cache, but people who need it
can get correct behavior.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Heikki Linnakangas

On 16.02.2012 01:06, Tom Lane wrote:

In bug #6457 it's pointed out that we *still* don't have full
functionality for locale-dependent regexp behavior with UTF8 encoding.
The reason is that there's old crufty code in regc_locale.c that only
considers character codes up to 255 when searching for characters that
should be considered letters, digits, etc.  We could fix that, for
some value of fix, by iterating up to perhaps 0x when dealing with
UTF8 encoding, but the time that would take is unappealing.  Especially
so considering that this code is executed afresh anytime we compile a
regex that requires locale knowledge.

I looked into the upstream Tcl code and observed that they deal with
this by having hard-wired tables of which Unicode code points are to be
considered letters etc.  The tables are directly traceable to the
Unicode standard (they provide a script to regenerate them from files
available from unicode.org).  Nonetheless, I do not find that approach
appealing, mainly because we'd be risking deviating from the libc locale
code's behavior within regexes when we follow it everywhere else.
It seems entirely likely to me that a particular locale setting might
consider only some of what Unicode says are letters to be letters.

However, we could possibly compromise by using Unicode-derived tables
as a guide to which code points are worth probing libc for.  That is,
assume that a utf8-based locale will never claim that some code is a
letter that unicode.org doesn't think is a letter.  That would cut the
number of required probes by a pretty large factor.

The other thing that seems worth doing is to install some caching.
We could presumably assume that the behavior of iswupper() et al are
fixed for the duration of a database session, so that we only need to
run the probe loop once when first asked to create a cvec for a
particular category.

Thoughts, better ideas?


Here's a wild idea: keep the class of each codepoint in a hash table. 
Initialize it with all codepoints up to 0x. After that, whenever a 
string contains a character that's not in the hash table yet, query the 
class of that character, and add it to the hash table. Then recompile 
the whole regex and restart the matching engine.


Recompiling is expensive, but if you cache the results for the session, 
it would probably be acceptable.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Here's a wild idea: keep the class of each codepoint in a hash table. 
 Initialize it with all codepoints up to 0x. After that, whenever a 
 string contains a character that's not in the hash table yet, query the 
 class of that character, and add it to the hash table. Then recompile 
 the whole regex and restart the matching engine.

 Recompiling is expensive, but if you cache the results for the session, 
 it would probably be acceptable.

Dunno ... recompiling is so expensive that I can't see this being a win;
not to mention that it would require fundamental surgery on the regex
code.

In the Tcl implementation, no codepoints above U+ have any locale
properties (alpha/digit/punct/etc), period.  Personally I'd not have a
problem imposing the same limitation, so that dealing with stuff above
that range isn't really a consideration anyway.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Andrew Dunstan



On 02/17/2012 09:39 AM, Tom Lane wrote:

Heikki Linnakangasheikki.linnakan...@enterprisedb.com  writes:

Here's a wild idea: keep the class of each codepoint in a hash table.
Initialize it with all codepoints up to 0x. After that, whenever a
string contains a character that's not in the hash table yet, query the
class of that character, and add it to the hash table. Then recompile
the whole regex and restart the matching engine.
Recompiling is expensive, but if you cache the results for the session,
it would probably be acceptable.

Dunno ... recompiling is so expensive that I can't see this being a win;
not to mention that it would require fundamental surgery on the regex
code.

In the Tcl implementation, no codepoints above U+ have any locale
properties (alpha/digit/punct/etc), period.  Personally I'd not have a
problem imposing the same limitation, so that dealing with stuff above
that range isn't really a consideration anyway.



up to U+ is the BMP which is described as containing characters for 
almost all modern languages, and a large number of special characters. 
It seems very likely to be acceptable not to bother about the locale of 
code points in the supplementary planes.


See http://en.wikipedia.org/wiki/Plane_%28Unicode%29 for descriptions 
of which sets of characters are involved.



cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Robert Haas
On Fri, Feb 17, 2012 at 3:48 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Here's a wild idea: keep the class of each codepoint in a hash table.
 Initialize it with all codepoints up to 0x. After that, whenever a
 string contains a character that's not in the hash table yet, query the
 class of that character, and add it to the hash table. Then recompile the
 whole regex and restart the matching engine.

 Recompiling is expensive, but if you cache the results for the session, it
 would probably be acceptable.

What if you did this ONCE and wrote the results to a file someplace?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Feb 17, 2012 at 3:48 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 Recompiling is expensive, but if you cache the results for the session, it
 would probably be acceptable.

 What if you did this ONCE and wrote the results to a file someplace?

That's still a cache, you've just defaulted on your obligation to think
about what conditions require the cache to be flushed.  (In the case at
hand, the trigger for a cache rebuild would probably need to be a glibc
package update, which we have no way of knowing about.)

Before going much further with this, we should probably do some timings
of 64K calls of iswupper and friends, just to see how bad a dumb
implementation will be.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Robert Haas
On Fri, Feb 17, 2012 at 10:19 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 What if you did this ONCE and wrote the results to a file someplace?

 That's still a cache, you've just defaulted on your obligation to think
 about what conditions require the cache to be flushed.

Yep.  Unfortunately, I don't have a good idea how to handle that; I
was hoping someone else did.

 Before going much further with this, we should probably do some timings
 of 64K calls of iswupper and friends, just to see how bad a dumb
 implementation will be.

Can't hurt.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Feb 17, 2012 at 10:19 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Before going much further with this, we should probably do some timings
 of 64K calls of iswupper and friends, just to see how bad a dumb
 implementation will be.

 Can't hurt.

The answer, on a reasonably new desktop machine (2.0GHz Xeon E5503)
running Fedora 16 in en_US.utf8 locale, is that 64K iterations of
pg_wc_isalpha or sibling functions requires a shade under 2ms.
So this definitely justifies caching the values to avoid computing
them more than once per session, but I'm not convinced there are
grounds for trying harder than that.

BTW, I am also a bit surprised to find out that this locale considers
48342 of those characters to satisfy isalpha().  Seems like a heck of
a lot.  But anyway we can forget my idea of trying to save work by
incorporating a-priori assumptions about which Unicode codepoints are
which --- it'll be faster to just iterate through them all, at least
for that case.  Maybe we should hard-wire some cases like digits, not
sure.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Tom Lane
I wrote:
 The answer, on a reasonably new desktop machine (2.0GHz Xeon E5503)
 running Fedora 16 in en_US.utf8 locale, is that 64K iterations of
 pg_wc_isalpha or sibling functions requires a shade under 2ms.
 So this definitely justifies caching the values to avoid computing
 them more than once per session, but I'm not convinced there are
 grounds for trying harder than that.

And here's a poorly-tested draft patch for that.

regards, tom lane


diff --git a/src/backend/regex/regc_cvec.c b/src/backend/regex/regc_cvec.c
index fb6f06b5243f50bfad2cefa5c016d4e842791a3d..98f3c597678b492dd59afcd956e5cdfecdba4f86 100644
*** a/src/backend/regex/regc_cvec.c
--- b/src/backend/regex/regc_cvec.c
*** static void
*** 77,82 
--- 77,83 
  addchr(struct cvec * cv,		/* character vector */
  	   chr c)	/* character to add */
  {
+ 	assert(cv-nchrs  cv-chrspace);
  	cv-chrs[cv-nchrs++] = (chr) c;
  }
  
diff --git a/src/backend/regex/regc_locale.c b/src/backend/regex/regc_locale.c
index 6cf27958b1545a61fba01e76dc4d37aca32789dc..44ce582bdad1a7d830d4122cada45a39c188981c 100644
*** a/src/backend/regex/regc_locale.c
--- b/src/backend/regex/regc_locale.c
*** static const struct cname
*** 351,356 
--- 351,366 
  
  
  /*
+  * We do not use the hard-wired Unicode classification tables that Tcl does.
+  * This is because (a) we need to deal with other encodings besides Unicode,
+  * and (b) we want to track the behavior of the libc locale routines as
+  * closely as possible.  For example, it wouldn't be unreasonable for a
+  * locale to not consider every Unicode letter as a letter.  So we build
+  * character classification cvecs by asking libc, even for Unicode.
+  */
+ 
+ 
+ /*
   * element - map collating-element name to celt
   */
  static celt
*** cclass(struct vars * v,			/* context */
*** 498,503 
--- 508,514 
  	   int cases)/* case-independent? */
  {
  	size_t		len;
+ 	const struct cvec *ccv = NULL;
  	struct cvec *cv = NULL;
  	const char * const *namePtr;
  	int			i,
*** cclass(struct vars * v,			/* context */
*** 549,626 
  
  	/*
  	 * Now compute the character class contents.
- 	 *
- 	 * For the moment, assume that only char codes  256 can be in these
- 	 * classes.
  	 */
  
  	switch ((enum classes) index)
  	{
  		case CC_PRINT:
! 			cv = getcvec(v, UCHAR_MAX, 0);
! 			if (cv)
! 			{
! for (i = 0; i = UCHAR_MAX; i++)
! {
! 	if (pg_wc_isprint((chr) i))
! 		addchr(cv, (chr) i);
! }
! 			}
  			break;
  		case CC_ALNUM:
! 			cv = getcvec(v, UCHAR_MAX, 0);
! 			if (cv)
! 			{
! for (i = 0; i = UCHAR_MAX; i++)
! {
! 	if (pg_wc_isalnum((chr) i))
! 		addchr(cv, (chr) i);
! }
! 			}
  			break;
  		case CC_ALPHA:
! 			cv = getcvec(v, UCHAR_MAX, 0);
! 			if (cv)
! 			{
! for (i = 0; i = UCHAR_MAX; i++)
! {
! 	if (pg_wc_isalpha((chr) i))
! 		addchr(cv, (chr) i);
! }
! 			}
  			break;
  		case CC_ASCII:
  			cv = getcvec(v, 0, 1);
  			if (cv)
  addrange(cv, 0, 0x7f);
  			break;
  		case CC_BLANK:
  			cv = getcvec(v, 2, 0);
  			addchr(cv, '\t');
  			addchr(cv, ' ');
  			break;
  		case CC_CNTRL:
  			cv = getcvec(v, 0, 2);
  			addrange(cv, 0x0, 0x1f);
  			addrange(cv, 0x7f, 0x9f);
  			break;
  		case CC_DIGIT:
! 			cv = getcvec(v, 0, 1);
! 			if (cv)
! addrange(cv, (chr) '0', (chr) '9');
  			break;
  		case CC_PUNCT:
! 			cv = getcvec(v, UCHAR_MAX, 0);
! 			if (cv)
! 			{
! for (i = 0; i = UCHAR_MAX; i++)
! {
! 	if (pg_wc_ispunct((chr) i))
! 		addchr(cv, (chr) i);
! }
! 			}
  			break;
  		case CC_XDIGIT:
  			cv = getcvec(v, 0, 3);
  			if (cv)
  			{
--- 560,608 
  
  	/*
  	 * Now compute the character class contents.
  	 */
  
  	switch ((enum classes) index)
  	{
  		case CC_PRINT:
! 			ccv = pg_ctype_get_cache(pg_wc_isprint);
  			break;
  		case CC_ALNUM:
! 			ccv = pg_ctype_get_cache(pg_wc_isalnum);
  			break;
  		case CC_ALPHA:
! 			ccv = pg_ctype_get_cache(pg_wc_isalpha);
  			break;
  		case CC_ASCII:
+ 			/* hard-wired meaning */
  			cv = getcvec(v, 0, 1);
  			if (cv)
  addrange(cv, 0, 0x7f);
  			break;
  		case CC_BLANK:
+ 			/* hard-wired meaning */
  			cv = getcvec(v, 2, 0);
  			addchr(cv, '\t');
  			addchr(cv, ' ');
  			break;
  		case CC_CNTRL:
+ 			/* hard-wired meaning */
  			cv = getcvec(v, 0, 2);
  			addrange(cv, 0x0, 0x1f);
  			addrange(cv, 0x7f, 0x9f);
  			break;
  		case CC_DIGIT:
! 			ccv = pg_ctype_get_cache(pg_wc_isdigit);
  			break;
  		case CC_PUNCT:
! 			ccv = pg_ctype_get_cache(pg_wc_ispunct);
  			break;
  		case CC_XDIGIT:
+ 			/*
+ 			 * It's not clear how to define this in non-western locales, and
+ 			 * even less clear that there's any particular use in trying.
+ 			 * So just hard-wire the meaning.
+ 			 */
  			cv = getcvec(v, 0, 3);
  			if (cv)
  			{
*** cclass(struct vars * v,			/* context */
*** 630,679 
  			}
  			break;
  		case