Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-23 Thread Peter Eisentraut
On fre, 2012-02-17 at 10:19 -0500, Tom Lane wrote: What if you did this ONCE and wrote the results to a file someplace? That's still a cache, you've just defaulted on your obligation to think about what conditions require the cache to be flushed. (In the case at hand, the trigger for a

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-23 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes: On fre, 2012-02-17 at 10:19 -0500, Tom Lane wrote: That's still a cache, you've just defaulted on your obligation to think about what conditions require the cache to be flushed. (In the case at hand, the trigger for a cache rebuild would probably need

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread NISHIYAMA Tomoaki
I don't believe it is valid to ignore CJK characters above U+2. If it is used for names, it will be stored in the database. If the behaviour is different from characters below U+, you will get a bug report in meanwhile. see CJK Extension B, C, and D from http://www.unicode.org/charts/

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Tom Lane
NISHIYAMA Tomoaki tomoa...@staff.kanazawa-u.ac.jp writes: I don't believe it is valid to ignore CJK characters above U+2. If it is used for names, it will be stored in the database. If the behaviour is different from characters below U+, you will get a bug report in meanwhile. I am

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Dimitri Fontaine
Tom Lane t...@sss.pgh.pa.us writes: Yeah, it's conceivable that we could implement something whereby characters with codes above some cutoff point are handled via runtime calls to iswalpha() and friends, rather than being included in the statically-constructed DFA maps. The cutoff point could

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Tom Lane
Dimitri Fontaine dimi...@2ndquadrant.fr writes: Tom Lane t...@sss.pgh.pa.us writes: Yeah, it's conceivable that we could implement something whereby characters with codes above some cutoff point are handled via runtime calls to iswalpha() and friends, rather than being included in the

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Tom Lane
I wrote: And here's a poorly-tested draft patch for that. I've done some more testing now, and am satisfied that this works as intended. However, some crude performance testing suggests that people might be annoyed with it. As an example, in 9.1 with pl_PL.utf8 locale, I see this:

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Robert Haas
On Sat, Feb 18, 2012 at 7:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: Yeah, it's conceivable that we could implement something whereby characters with codes above some cutoff point are handled via runtime calls to iswalpha() and friends, rather than being included in the statically-constructed

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Vik Reykja
On Sun, Feb 19, 2012 at 04:33, Robert Haas robertmh...@gmail.com wrote: On Sat, Feb 18, 2012 at 7:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: Yeah, it's conceivable that we could implement something whereby characters with codes above some cutoff point are handled via runtime calls to

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Robert Haas
On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja vikrey...@gmail.com wrote: Does it make sense for regexps to have collations? As I understand it, collations determine the sort-ordering of strings. Regular expressions don't care about that. Why do you ask? -- Robert Haas EnterpriseDB:

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Vik Reykja
On Sun, Feb 19, 2012 at 05:03, Robert Haas robertmh...@gmail.com wrote: On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja vikrey...@gmail.com wrote: Does it make sense for regexps to have collations? As I understand it, collations determine the sort-ordering of strings. Regular expressions

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes: In theory you can imagine a regular expression engine where these decisions can be postponed until we see the string we're matching against. IOW, your DFA ends up with state transitions for characters specifically named, plus a state transition for

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Tom Lane
Vik Reykja vikrey...@gmail.com writes: On Sun, Feb 19, 2012 at 05:03, Robert Haas robertmh...@gmail.com wrote: On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja vikrey...@gmail.com wrote: Does it make sense for regexps to have collations? As I understand it, collations determine the sort-ordering

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Robert Haas
On Sat, Feb 18, 2012 at 11:16 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: In theory you can imagine a regular expression engine where these decisions can be postponed until we see the string we're matching against.  IOW, your DFA ends up with state

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Heikki Linnakangas
On 16.02.2012 01:06, Tom Lane wrote: In bug #6457 it's pointed out that we *still* don't have full functionality for locale-dependent regexp behavior with UTF8 encoding. The reason is that there's old crufty code in regc_locale.c that only considers character codes up to 255 when searching for

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Here's a wild idea: keep the class of each codepoint in a hash table. Initialize it with all codepoints up to 0x. After that, whenever a string contains a character that's not in the hash table yet, query the class of that

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Andrew Dunstan
On 02/17/2012 09:39 AM, Tom Lane wrote: Heikki Linnakangasheikki.linnakan...@enterprisedb.com writes: Here's a wild idea: keep the class of each codepoint in a hash table. Initialize it with all codepoints up to 0x. After that, whenever a string contains a character that's not in the

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Robert Haas
On Fri, Feb 17, 2012 at 3:48 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Here's a wild idea: keep the class of each codepoint in a hash table. Initialize it with all codepoints up to 0x. After that, whenever a string contains a character that's not in the hash table

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes: On Fri, Feb 17, 2012 at 3:48 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Recompiling is expensive, but if you cache the results for the session, it would probably be acceptable. What if you did this ONCE and wrote the results to

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Robert Haas
On Fri, Feb 17, 2012 at 10:19 AM, Tom Lane t...@sss.pgh.pa.us wrote: What if you did this ONCE and wrote the results to a file someplace? That's still a cache, you've just defaulted on your obligation to think about what conditions require the cache to be flushed. Yep. Unfortunately, I don't

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes: On Fri, Feb 17, 2012 at 10:19 AM, Tom Lane t...@sss.pgh.pa.us wrote: Before going much further with this, we should probably do some timings of 64K calls of iswupper and friends, just to see how bad a dumb implementation will be. Can't hurt. The

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Tom Lane
I wrote: The answer, on a reasonably new desktop machine (2.0GHz Xeon E5503) running Fedora 16 in en_US.utf8 locale, is that 64K iterations of pg_wc_isalpha or sibling functions requires a shade under 2ms. So this definitely justifies caching the values to avoid computing them more than once

[HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-15 Thread Tom Lane
In bug #6457 it's pointed out that we *still* don't have full functionality for locale-dependent regexp behavior with UTF8 encoding. The reason is that there's old crufty code in regc_locale.c that only considers character codes up to 255 when searching for characters that should be considered