Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-23 Thread Tom Lane
Peter Eisentraut writes: > On fre, 2012-02-17 at 10:19 -0500, Tom Lane wrote: >> That's still a cache, you've just defaulted on your obligation to think >> about what conditions require the cache to be flushed. (In the case at >> hand, the trigger for a cache rebuild would probably need to be a g

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-23 Thread Peter Eisentraut
On fre, 2012-02-17 at 10:19 -0500, Tom Lane wrote: > > What if you did this ONCE and wrote the results to a file someplace? > > That's still a cache, you've just defaulted on your obligation to think > about what conditions require the cache to be flushed. (In the case at > hand, the trigger for

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Robert Haas
On Sat, Feb 18, 2012 at 11:16 PM, Tom Lane wrote: > Robert Haas writes: >> In theory you can imagine a regular expression engine where these >> decisions can be postponed until we see the string we're matching >> against.  IOW, your DFA ends up with state transitions for characters >> specificall

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Tom Lane
Vik Reykja writes: > On Sun, Feb 19, 2012 at 05:03, Robert Haas wrote: >> On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja wrote: >>> Does it make sense for regexps to have collations? >> As I understand it, collations determine the sort-ordering of strings. >> Regular expressions don't care about

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Tom Lane
Robert Haas writes: > In theory you can imagine a regular expression engine where these > decisions can be postponed until we see the string we're matching > against. IOW, your DFA ends up with state transitions for characters > specifically named, plus a state transition for "anything else that'

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Vik Reykja
On Sun, Feb 19, 2012 at 05:03, Robert Haas wrote: > On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja wrote: > > Does it make sense for regexps to have collations? > > As I understand it, collations determine the sort-ordering of strings. > Regular expressions don't care about that. Why do you ask?

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Robert Haas
On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja wrote: > Does it make sense for regexps to have collations? As I understand it, collations determine the sort-ordering of strings. Regular expressions don't care about that. Why do you ask? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com T

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Vik Reykja
On Sun, Feb 19, 2012 at 04:33, Robert Haas wrote: > On Sat, Feb 18, 2012 at 7:29 PM, Tom Lane wrote: > >> Yeah, it's conceivable that we could implement something whereby > >> characters with codes above some cutoff point are handled via runtime > >> calls to iswalpha() and friends, rather than

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Robert Haas
On Sat, Feb 18, 2012 at 7:29 PM, Tom Lane wrote: >> Yeah, it's conceivable that we could implement something whereby >> characters with codes above some cutoff point are handled via runtime >> calls to iswalpha() and friends, rather than being included in the >> statically-constructed DFA maps.  T

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Tom Lane
I wrote: > And here's a poorly-tested draft patch for that. I've done some more testing now, and am satisfied that this works as intended. However, some crude performance testing suggests that people might be annoyed with it. As an example, in 9.1 with pl_PL.utf8 locale, I see this: sele

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Tom Lane
Dimitri Fontaine writes: > Tom Lane writes: >> Yeah, it's conceivable that we could implement something whereby >> characters with codes above some cutoff point are handled via runtime >> calls to iswalpha() and friends, rather than being included in the >> statically-constructed DFA maps. The c

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Dimitri Fontaine
Tom Lane writes: > Yeah, it's conceivable that we could implement something whereby > characters with codes above some cutoff point are handled via runtime > calls to iswalpha() and friends, rather than being included in the > statically-constructed DFA maps. The cutoff point could likely be a lo

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread Tom Lane
NISHIYAMA Tomoaki writes: > I don't believe it is valid to ignore CJK characters above U+2. > If it is used for names, it will be stored in the database. > If the behaviour is different from characters below U+, you will > get a bug report in meanwhile. I am skeptical that there is enough

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-18 Thread NISHIYAMA Tomoaki
I don't believe it is valid to ignore CJK characters above U+2. If it is used for names, it will be stored in the database. If the behaviour is different from characters below U+, you will get a bug report in meanwhile. see CJK Extension B, C, and D from http://www.unicode.org/charts/ Al

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Tom Lane
I wrote: > The answer, on a reasonably new desktop machine (2.0GHz Xeon E5503) > running Fedora 16 in en_US.utf8 locale, is that 64K iterations of > pg_wc_isalpha or sibling functions requires a shade under 2ms. > So this definitely justifies caching the values to avoid computing > them more than o

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Tom Lane
Robert Haas writes: > On Fri, Feb 17, 2012 at 10:19 AM, Tom Lane wrote: >> Before going much further with this, we should probably do some timings >> of 64K calls of iswupper and friends, just to see how bad a dumb >> implementation will be. > Can't hurt. The answer, on a reasonably new desktop

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Robert Haas
On Fri, Feb 17, 2012 at 10:19 AM, Tom Lane wrote: >> What if you did this ONCE and wrote the results to a file someplace? > > That's still a cache, you've just defaulted on your obligation to think > about what conditions require the cache to be flushed. Yep. Unfortunately, I don't have a good i

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Tom Lane
Robert Haas writes: > On Fri, Feb 17, 2012 at 3:48 AM, Heikki Linnakangas > wrote: >> Recompiling is expensive, but if you cache the results for the session, it >> would probably be acceptable. > What if you did this ONCE and wrote the results to a file someplace? That's still a cache, you've j

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Robert Haas
On Fri, Feb 17, 2012 at 3:48 AM, Heikki Linnakangas wrote: > Here's a wild idea: keep the class of each codepoint in a hash table. > Initialize it with all codepoints up to 0x. After that, whenever a > string contains a character that's not in the hash table yet, query the > class of that char

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Andrew Dunstan
On 02/17/2012 09:39 AM, Tom Lane wrote: Heikki Linnakangas writes: Here's a wild idea: keep the class of each codepoint in a hash table. Initialize it with all codepoints up to 0x. After that, whenever a string contains a character that's not in the hash table yet, query the class of that

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Tom Lane
Heikki Linnakangas writes: > Here's a wild idea: keep the class of each codepoint in a hash table. > Initialize it with all codepoints up to 0x. After that, whenever a > string contains a character that's not in the hash table yet, query the > class of that character, and add it to the hash

Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-17 Thread Heikki Linnakangas
On 16.02.2012 01:06, Tom Lane wrote: In bug #6457 it's pointed out that we *still* don't have full functionality for locale-dependent regexp behavior with UTF8 encoding. The reason is that there's old crufty code in regc_locale.c that only considers character codes up to 255 when searching for ch

[HACKERS] Notes about fixing regexes and UTF-8 (yet again)

2012-02-15 Thread Tom Lane
In bug #6457 it's pointed out that we *still* don't have full functionality for locale-dependent regexp behavior with UTF8 encoding. The reason is that there's old crufty code in regc_locale.c that only considers character codes up to 255 when searching for characters that should be considered "let