Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
On fre, 2012-02-17 at 10:19 -0500, Tom Lane wrote: What if you did this ONCE and wrote the results to a file someplace? That's still a cache, you've just defaulted on your obligation to think about what conditions require the cache to be flushed. (In the case at hand, the trigger for a cache rebuild would probably need to be a glibc package update, which we have no way of knowing about.) We basically hardwire locale behavior at initdb time, so computing this then and storing it somewhere for eternity would be consistent. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
Peter Eisentraut pete...@gmx.net writes: On fre, 2012-02-17 at 10:19 -0500, Tom Lane wrote: That's still a cache, you've just defaulted on your obligation to think about what conditions require the cache to be flushed. (In the case at hand, the trigger for a cache rebuild would probably need to be a glibc package update, which we have no way of knowing about.) We basically hardwire locale behavior at initdb time, so computing this then and storing it somewhere for eternity would be consistent. Well, only if we could cache every locale-related libc inquiry we ever make. Locking down only part of the behavior does not sound like a plan. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
I don't believe it is valid to ignore CJK characters above U+2. If it is used for names, it will be stored in the database. If the behaviour is different from characters below U+, you will get a bug report in meanwhile. see CJK Extension B, C, and D from http://www.unicode.org/charts/ Also, there are some code points that could be regarded alphabet and numbers http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols On the other hand, it is ok if processing of characters above U+1 is very slow, as far as properly processed, because it is considered rare. On 2012/02/17, at 23:56, Andrew Dunstan wrote: On 02/17/2012 09:39 AM, Tom Lane wrote: Heikki Linnakangasheikki.linnakan...@enterprisedb.com writes: Here's a wild idea: keep the class of each codepoint in a hash table. Initialize it with all codepoints up to 0x. After that, whenever a string contains a character that's not in the hash table yet, query the class of that character, and add it to the hash table. Then recompile the whole regex and restart the matching engine. Recompiling is expensive, but if you cache the results for the session, it would probably be acceptable. Dunno ... recompiling is so expensive that I can't see this being a win; not to mention that it would require fundamental surgery on the regex code. In the Tcl implementation, no codepoints above U+ have any locale properties (alpha/digit/punct/etc), period. Personally I'd not have a problem imposing the same limitation, so that dealing with stuff above that range isn't really a consideration anyway. up to U+ is the BMP which is described as containing characters for almost all modern languages, and a large number of special characters. It seems very likely to be acceptable not to bother about the locale of code points in the supplementary planes. See http://en.wikipedia.org/wiki/Plane_%28Unicode%29 for descriptions of which sets of characters are involved. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
NISHIYAMA Tomoaki tomoa...@staff.kanazawa-u.ac.jp writes: I don't believe it is valid to ignore CJK characters above U+2. If it is used for names, it will be stored in the database. If the behaviour is different from characters below U+, you will get a bug report in meanwhile. I am skeptical that there is enough usage of such things to justify slowing regexp operations down for everybody. Note that it's not only the initial probe of libc behavior that's at stake here --- the more character codes are treated as letters, the larger the DFA transition maps get and the more time it takes to build them. So I'm unexcited about just cranking up the loop limit in pg_ctype_get_cache. On the other hand, it is ok if processing of characters above U+1 is very slow, as far as properly processed, because it is considered rare. Yeah, it's conceivable that we could implement something whereby characters with codes above some cutoff point are handled via runtime calls to iswalpha() and friends, rather than being included in the statically-constructed DFA maps. The cutoff point could likely be a lot less than U+, too, thereby saving storage and map build time all round. However, that we above is the editorial we. *I* am not going to do this. Somebody who actually has a need for it should step up. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
Tom Lane t...@sss.pgh.pa.us writes: Yeah, it's conceivable that we could implement something whereby characters with codes above some cutoff point are handled via runtime calls to iswalpha() and friends, rather than being included in the statically-constructed DFA maps. The cutoff point could likely be a lot less than U+, too, thereby saving storage and map build time all round. It's been proposed to build a “regexp” type in PostgreSQL which would store the DFA directly and provides some way to run that DFA out of its “storage” without recompiling. Would such a mechanism be useful here? Would it be useful only when storing the regexp in a column somewhere then applying it in the query from there (so most probably adding a join or subquery somewhere)? Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
Dimitri Fontaine dimi...@2ndquadrant.fr writes: Tom Lane t...@sss.pgh.pa.us writes: Yeah, it's conceivable that we could implement something whereby characters with codes above some cutoff point are handled via runtime calls to iswalpha() and friends, rather than being included in the statically-constructed DFA maps. The cutoff point could likely be a lot less than U+, too, thereby saving storage and map build time all round. It's been proposed to build a âregexpâ type in PostgreSQL which would store the DFA directly and provides some way to run that DFA out of its âstorageâ without recompiling. Would such a mechanism be useful here? No, this is about what goes into the DFA representation in the first place, not about how we store it and reuse it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
I wrote: And here's a poorly-tested draft patch for that. I've done some more testing now, and am satisfied that this works as intended. However, some crude performance testing suggests that people might be annoyed with it. As an example, in 9.1 with pl_PL.utf8 locale, I see this: select 'aa' ~ '\w\w\w\w\w\w\w\w\w\w\w'; taking perhaps 0.75 ms on first execution and 0.4 ms on subsequent executions, the difference being the time needed to compile and cache the DFA representation of the regexp. With the patch, the numbers are more like 5 ms and 0.4 ms, meaning the compilation time has gone up by something near a factor of 10, though AFAICT execution time hasn't moved. It's hard to tell how significant that would be to real-world queries, but in the worst case where our caching of regexps doesn't help much, it could be disastrous. All of the extra time is in manipulation of the much larger number of DFA arcs required to represent all the additional character codes that are being considered to be letters. Perhaps I'm being overly ASCII-centric, but I'm afraid to commit this as-is; I think the number of people who are hurt by the performance degradation will be greatly larger than the number who are glad because characters in $random_alphabet are now seen to be letters. I think an actually workable solution will require something like what I speculated about earlier: Yeah, it's conceivable that we could implement something whereby characters with codes above some cutoff point are handled via runtime calls to iswalpha() and friends, rather than being included in the statically-constructed DFA maps. The cutoff point could likely be a lot less than U+, too, thereby saving storage and map build time all round. In the meantime, I still think the caching logic is worth having, and we could at least make some people happy if we selected a cutoff point somewhere between U+FF and U+. I don't have any strong ideas about what a good compromise cutoff would be. One possibility is U+7FF, which corresponds to the limit of what fits in 2-byte UTF8; but I don't know if that corresponds to any significant dropoff in frequency of usage. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
On Sat, Feb 18, 2012 at 7:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: Yeah, it's conceivable that we could implement something whereby characters with codes above some cutoff point are handled via runtime calls to iswalpha() and friends, rather than being included in the statically-constructed DFA maps. The cutoff point could likely be a lot less than U+, too, thereby saving storage and map build time all round. In the meantime, I still think the caching logic is worth having, and we could at least make some people happy if we selected a cutoff point somewhere between U+FF and U+. I don't have any strong ideas about what a good compromise cutoff would be. One possibility is U+7FF, which corresponds to the limit of what fits in 2-byte UTF8; but I don't know if that corresponds to any significant dropoff in frequency of usage. The problem, of course, is that this probably depends quite a bit on what language you happen to be using. For some languages, it won't matter whether you cut it off at U+FF or U+7FF; while for others even U+ might not be enough. So I think this is one of those cases where it's somewhat meaningless to talk about frequency of usage. In theory you can imagine a regular expression engine where these decisions can be postponed until we see the string we're matching against. IOW, your DFA ends up with state transitions for characters specifically named, plus a state transition for anything else that's a letter, plus a state transition for anything else not otherwise specified. Then you only need to test the letters that actually appear in the target string, rather than all of the ones that might appear there. But implementing that could be quite a lot of work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
On Sun, Feb 19, 2012 at 04:33, Robert Haas robertmh...@gmail.com wrote: On Sat, Feb 18, 2012 at 7:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: Yeah, it's conceivable that we could implement something whereby characters with codes above some cutoff point are handled via runtime calls to iswalpha() and friends, rather than being included in the statically-constructed DFA maps. The cutoff point could likely be a lot less than U+, too, thereby saving storage and map build time all round. In the meantime, I still think the caching logic is worth having, and we could at least make some people happy if we selected a cutoff point somewhere between U+FF and U+. I don't have any strong ideas about what a good compromise cutoff would be. One possibility is U+7FF, which corresponds to the limit of what fits in 2-byte UTF8; but I don't know if that corresponds to any significant dropoff in frequency of usage. The problem, of course, is that this probably depends quite a bit on what language you happen to be using. For some languages, it won't matter whether you cut it off at U+FF or U+7FF; while for others even U+ might not be enough. So I think this is one of those cases where it's somewhat meaningless to talk about frequency of usage. Does it make sense for regexps to have collations?
Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja vikrey...@gmail.com wrote: Does it make sense for regexps to have collations? As I understand it, collations determine the sort-ordering of strings. Regular expressions don't care about that. Why do you ask? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
On Sun, Feb 19, 2012 at 05:03, Robert Haas robertmh...@gmail.com wrote: On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja vikrey...@gmail.com wrote: Does it make sense for regexps to have collations? As I understand it, collations determine the sort-ordering of strings. Regular expressions don't care about that. Why do you ask? Perhaps I used the wrong term, but I was thinking the locale could tell us what alphabet we're dealing with. So a regexp using en_US would give different word-boundary results from one using zh_CN.
Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
Robert Haas robertmh...@gmail.com writes: In theory you can imagine a regular expression engine where these decisions can be postponed until we see the string we're matching against. IOW, your DFA ends up with state transitions for characters specifically named, plus a state transition for anything else that's a letter, plus a state transition for anything else not otherwise specified. Then you only need to test the letters that actually appear in the target string, rather than all of the ones that might appear there. But implementing that could be quite a lot of work. Yeah, not to mention slow. The difficulty is overlapping sets of characters. As a simple example, if your regex refers to 3, 7, [[:digit:]], X, and [[:alnum:]], then you end up needing five distinct colors: 3, 7, X, all digits that aren't 3 or 7, all alphanumerics that aren't any of the preceding. And state transitions for the digit and alnum cases had better mention all and only the correct colors. I've been tracing through the logic this evening, and it works pretty simply given that all named character classes are immediately expanded out to their component characters. If we are going to try to keep the classes in some kind of symbolic form, it's a lot messier. In particular, I think your sketch above would lead to having to test every character against iswdigit and iswalnum at runtime, which would be disastrous performancewise. I'd like to at least avoid that for the shorter (and presumably more common) UTF8 codes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
Vik Reykja vikrey...@gmail.com writes: On Sun, Feb 19, 2012 at 05:03, Robert Haas robertmh...@gmail.com wrote: On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja vikrey...@gmail.com wrote: Does it make sense for regexps to have collations? As I understand it, collations determine the sort-ordering of strings. Regular expressions don't care about that. Why do you ask? Perhaps I used the wrong term, but I was thinking the locale could tell us what alphabet we're dealing with. So a regexp using en_US would give different word-boundary results from one using zh_CN. Our interpretation of a collation is that it sets both LC_COLLATE and LC_CTYPE. Regexps may not care about the first but they definitely care about the second. This is why the stuff in regc_pg_locale.c pays attention to collation. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
On Sat, Feb 18, 2012 at 11:16 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: In theory you can imagine a regular expression engine where these decisions can be postponed until we see the string we're matching against. IOW, your DFA ends up with state transitions for characters specifically named, plus a state transition for anything else that's a letter, plus a state transition for anything else not otherwise specified. Then you only need to test the letters that actually appear in the target string, rather than all of the ones that might appear there. But implementing that could be quite a lot of work. Yeah, not to mention slow. The difficulty is overlapping sets of characters. As a simple example, if your regex refers to 3, 7, [[:digit:]], X, and [[:alnum:]], then you end up needing five distinct colors: 3, 7, X, all digits that aren't 3 or 7, all alphanumerics that aren't any of the preceding. And state transitions for the digit and alnum cases had better mention all and only the correct colors. Yeah, that's unfortunate. On the other hand, if you don't use colors for this case, aren't you going to need, for each DFA state, a gigantic lookup table that includes every character in the server encoding? Even if you've got plenty of memory, initializing such a beast seems awfully expensive, and it might not do very good things for cache locality, either. I've been tracing through the logic this evening, and it works pretty simply given that all named character classes are immediately expanded out to their component characters. If we are going to try to keep the classes in some kind of symbolic form, it's a lot messier. In particular, I think your sketch above would lead to having to test every character against iswdigit and iswalnum at runtime, which would be disastrous performancewise. I'd like to at least avoid that for the shorter (and presumably more common) UTF8 codes. Hmm, but you could cache that information. Instead of building a cache that covers every possible character that might appear in the target string, you can just cache the results for the code points that you actually see. Yet another option would be to dictate that the cache can't holes - it will always include information for every code point from 0 up to some value X. If we see a code point in the target string which is greater than X, then we extend the cache out as far as that code point. That way, people who are using only code points out to U+FF (or even U+7F) don't pay the cost of building a large cache, but people who need it can get correct behavior. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
On 16.02.2012 01:06, Tom Lane wrote: In bug #6457 it's pointed out that we *still* don't have full functionality for locale-dependent regexp behavior with UTF8 encoding. The reason is that there's old crufty code in regc_locale.c that only considers character codes up to 255 when searching for characters that should be considered letters, digits, etc. We could fix that, for some value of fix, by iterating up to perhaps 0x when dealing with UTF8 encoding, but the time that would take is unappealing. Especially so considering that this code is executed afresh anytime we compile a regex that requires locale knowledge. I looked into the upstream Tcl code and observed that they deal with this by having hard-wired tables of which Unicode code points are to be considered letters etc. The tables are directly traceable to the Unicode standard (they provide a script to regenerate them from files available from unicode.org). Nonetheless, I do not find that approach appealing, mainly because we'd be risking deviating from the libc locale code's behavior within regexes when we follow it everywhere else. It seems entirely likely to me that a particular locale setting might consider only some of what Unicode says are letters to be letters. However, we could possibly compromise by using Unicode-derived tables as a guide to which code points are worth probing libc for. That is, assume that a utf8-based locale will never claim that some code is a letter that unicode.org doesn't think is a letter. That would cut the number of required probes by a pretty large factor. The other thing that seems worth doing is to install some caching. We could presumably assume that the behavior of iswupper() et al are fixed for the duration of a database session, so that we only need to run the probe loop once when first asked to create a cvec for a particular category. Thoughts, better ideas? Here's a wild idea: keep the class of each codepoint in a hash table. Initialize it with all codepoints up to 0x. After that, whenever a string contains a character that's not in the hash table yet, query the class of that character, and add it to the hash table. Then recompile the whole regex and restart the matching engine. Recompiling is expensive, but if you cache the results for the session, it would probably be acceptable. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Here's a wild idea: keep the class of each codepoint in a hash table. Initialize it with all codepoints up to 0x. After that, whenever a string contains a character that's not in the hash table yet, query the class of that character, and add it to the hash table. Then recompile the whole regex and restart the matching engine. Recompiling is expensive, but if you cache the results for the session, it would probably be acceptable. Dunno ... recompiling is so expensive that I can't see this being a win; not to mention that it would require fundamental surgery on the regex code. In the Tcl implementation, no codepoints above U+ have any locale properties (alpha/digit/punct/etc), period. Personally I'd not have a problem imposing the same limitation, so that dealing with stuff above that range isn't really a consideration anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
On 02/17/2012 09:39 AM, Tom Lane wrote: Heikki Linnakangasheikki.linnakan...@enterprisedb.com writes: Here's a wild idea: keep the class of each codepoint in a hash table. Initialize it with all codepoints up to 0x. After that, whenever a string contains a character that's not in the hash table yet, query the class of that character, and add it to the hash table. Then recompile the whole regex and restart the matching engine. Recompiling is expensive, but if you cache the results for the session, it would probably be acceptable. Dunno ... recompiling is so expensive that I can't see this being a win; not to mention that it would require fundamental surgery on the regex code. In the Tcl implementation, no codepoints above U+ have any locale properties (alpha/digit/punct/etc), period. Personally I'd not have a problem imposing the same limitation, so that dealing with stuff above that range isn't really a consideration anyway. up to U+ is the BMP which is described as containing characters for almost all modern languages, and a large number of special characters. It seems very likely to be acceptable not to bother about the locale of code points in the supplementary planes. See http://en.wikipedia.org/wiki/Plane_%28Unicode%29 for descriptions of which sets of characters are involved. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
On Fri, Feb 17, 2012 at 3:48 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Here's a wild idea: keep the class of each codepoint in a hash table. Initialize it with all codepoints up to 0x. After that, whenever a string contains a character that's not in the hash table yet, query the class of that character, and add it to the hash table. Then recompile the whole regex and restart the matching engine. Recompiling is expensive, but if you cache the results for the session, it would probably be acceptable. What if you did this ONCE and wrote the results to a file someplace? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
Robert Haas robertmh...@gmail.com writes: On Fri, Feb 17, 2012 at 3:48 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Recompiling is expensive, but if you cache the results for the session, it would probably be acceptable. What if you did this ONCE and wrote the results to a file someplace? That's still a cache, you've just defaulted on your obligation to think about what conditions require the cache to be flushed. (In the case at hand, the trigger for a cache rebuild would probably need to be a glibc package update, which we have no way of knowing about.) Before going much further with this, we should probably do some timings of 64K calls of iswupper and friends, just to see how bad a dumb implementation will be. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
On Fri, Feb 17, 2012 at 10:19 AM, Tom Lane t...@sss.pgh.pa.us wrote: What if you did this ONCE and wrote the results to a file someplace? That's still a cache, you've just defaulted on your obligation to think about what conditions require the cache to be flushed. Yep. Unfortunately, I don't have a good idea how to handle that; I was hoping someone else did. Before going much further with this, we should probably do some timings of 64K calls of iswupper and friends, just to see how bad a dumb implementation will be. Can't hurt. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
Robert Haas robertmh...@gmail.com writes: On Fri, Feb 17, 2012 at 10:19 AM, Tom Lane t...@sss.pgh.pa.us wrote: Before going much further with this, we should probably do some timings of 64K calls of iswupper and friends, just to see how bad a dumb implementation will be. Can't hurt. The answer, on a reasonably new desktop machine (2.0GHz Xeon E5503) running Fedora 16 in en_US.utf8 locale, is that 64K iterations of pg_wc_isalpha or sibling functions requires a shade under 2ms. So this definitely justifies caching the values to avoid computing them more than once per session, but I'm not convinced there are grounds for trying harder than that. BTW, I am also a bit surprised to find out that this locale considers 48342 of those characters to satisfy isalpha(). Seems like a heck of a lot. But anyway we can forget my idea of trying to save work by incorporating a-priori assumptions about which Unicode codepoints are which --- it'll be faster to just iterate through them all, at least for that case. Maybe we should hard-wire some cases like digits, not sure. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Notes about fixing regexes and UTF-8 (yet again)
I wrote: The answer, on a reasonably new desktop machine (2.0GHz Xeon E5503) running Fedora 16 in en_US.utf8 locale, is that 64K iterations of pg_wc_isalpha or sibling functions requires a shade under 2ms. So this definitely justifies caching the values to avoid computing them more than once per session, but I'm not convinced there are grounds for trying harder than that. And here's a poorly-tested draft patch for that. regards, tom lane diff --git a/src/backend/regex/regc_cvec.c b/src/backend/regex/regc_cvec.c index fb6f06b5243f50bfad2cefa5c016d4e842791a3d..98f3c597678b492dd59afcd956e5cdfecdba4f86 100644 *** a/src/backend/regex/regc_cvec.c --- b/src/backend/regex/regc_cvec.c *** static void *** 77,82 --- 77,83 addchr(struct cvec * cv, /* character vector */ chr c) /* character to add */ { + assert(cv-nchrs cv-chrspace); cv-chrs[cv-nchrs++] = (chr) c; } diff --git a/src/backend/regex/regc_locale.c b/src/backend/regex/regc_locale.c index 6cf27958b1545a61fba01e76dc4d37aca32789dc..44ce582bdad1a7d830d4122cada45a39c188981c 100644 *** a/src/backend/regex/regc_locale.c --- b/src/backend/regex/regc_locale.c *** static const struct cname *** 351,356 --- 351,366 /* + * We do not use the hard-wired Unicode classification tables that Tcl does. + * This is because (a) we need to deal with other encodings besides Unicode, + * and (b) we want to track the behavior of the libc locale routines as + * closely as possible. For example, it wouldn't be unreasonable for a + * locale to not consider every Unicode letter as a letter. So we build + * character classification cvecs by asking libc, even for Unicode. + */ + + + /* * element - map collating-element name to celt */ static celt *** cclass(struct vars * v, /* context */ *** 498,503 --- 508,514 int cases)/* case-independent? */ { size_t len; + const struct cvec *ccv = NULL; struct cvec *cv = NULL; const char * const *namePtr; int i, *** cclass(struct vars * v, /* context */ *** 549,626 /* * Now compute the character class contents. - * - * For the moment, assume that only char codes 256 can be in these - * classes. */ switch ((enum classes) index) { case CC_PRINT: ! cv = getcvec(v, UCHAR_MAX, 0); ! if (cv) ! { ! for (i = 0; i = UCHAR_MAX; i++) ! { ! if (pg_wc_isprint((chr) i)) ! addchr(cv, (chr) i); ! } ! } break; case CC_ALNUM: ! cv = getcvec(v, UCHAR_MAX, 0); ! if (cv) ! { ! for (i = 0; i = UCHAR_MAX; i++) ! { ! if (pg_wc_isalnum((chr) i)) ! addchr(cv, (chr) i); ! } ! } break; case CC_ALPHA: ! cv = getcvec(v, UCHAR_MAX, 0); ! if (cv) ! { ! for (i = 0; i = UCHAR_MAX; i++) ! { ! if (pg_wc_isalpha((chr) i)) ! addchr(cv, (chr) i); ! } ! } break; case CC_ASCII: cv = getcvec(v, 0, 1); if (cv) addrange(cv, 0, 0x7f); break; case CC_BLANK: cv = getcvec(v, 2, 0); addchr(cv, '\t'); addchr(cv, ' '); break; case CC_CNTRL: cv = getcvec(v, 0, 2); addrange(cv, 0x0, 0x1f); addrange(cv, 0x7f, 0x9f); break; case CC_DIGIT: ! cv = getcvec(v, 0, 1); ! if (cv) ! addrange(cv, (chr) '0', (chr) '9'); break; case CC_PUNCT: ! cv = getcvec(v, UCHAR_MAX, 0); ! if (cv) ! { ! for (i = 0; i = UCHAR_MAX; i++) ! { ! if (pg_wc_ispunct((chr) i)) ! addchr(cv, (chr) i); ! } ! } break; case CC_XDIGIT: cv = getcvec(v, 0, 3); if (cv) { --- 560,608 /* * Now compute the character class contents. */ switch ((enum classes) index) { case CC_PRINT: ! ccv = pg_ctype_get_cache(pg_wc_isprint); break; case CC_ALNUM: ! ccv = pg_ctype_get_cache(pg_wc_isalnum); break; case CC_ALPHA: ! ccv = pg_ctype_get_cache(pg_wc_isalpha); break; case CC_ASCII: + /* hard-wired meaning */ cv = getcvec(v, 0, 1); if (cv) addrange(cv, 0, 0x7f); break; case CC_BLANK: + /* hard-wired meaning */ cv = getcvec(v, 2, 0); addchr(cv, '\t'); addchr(cv, ' '); break; case CC_CNTRL: + /* hard-wired meaning */ cv = getcvec(v, 0, 2); addrange(cv, 0x0, 0x1f); addrange(cv, 0x7f, 0x9f); break; case CC_DIGIT: ! ccv = pg_ctype_get_cache(pg_wc_isdigit); break; case CC_PUNCT: ! ccv = pg_ctype_get_cache(pg_wc_ispunct); break; case CC_XDIGIT: + /* + * It's not clear how to define this in non-western locales, and + * even less clear that there's any particular use in trying. + * So just hard-wire the meaning. + */ cv = getcvec(v, 0, 3); if (cv) { *** cclass(struct vars * v, /* context */ *** 630,679 } break; case