Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
On 08/27/2014 08:13 AM, Jeff Davis wrote: On Mon, 2014-08-25 at 17:41 +0300, Heikki Linnakangas wrote: your patch seems to be about 2x-3x as slow as unpatched master. So this needs some optimization. A couple of ideas: I didn't see anywhere near that kind of regression. On unpatched master, with your test case, I saw it stabilize to about 680ms. With similar-escape-1, I saw about 775ms (15% regression). Are those at all close to your numbers? Is there a chance you used an unoptimized build for one of them, or left asserts enabled? Oh. I can't now reproduce my earlier results either, I must've messed up something. I'm now seeing similar numbers as you. Attached a patch implementing the same idea though: only use the multibyte path if *both* the escape char and the current character from the pattern are multibyte. I also changed the comment to more clearly state the behavior upon which we're relying. I hope what I said is accurate. s/the the/the/. Other than that, looks good to me. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
On Mon, 2014-08-25 at 17:41 +0300, Heikki Linnakangas wrote: Actually, that gets optimized to a constant in the planner: Oops, thank you (and Tom). your patch seems to be about 2x-3x as slow as unpatched master. So this needs some optimization. A couple of ideas: I didn't see anywhere near that kind of regression. On unpatched master, with your test case, I saw it stabilize to about 680ms. With similar-escape-1, I saw about 775ms (15% regression). Are those at all close to your numbers? Is there a chance you used an unoptimized build for one of them, or left asserts enabled? 1. If the escape string is in fact a single-byte character, you can proceed with the loop just as it is today, without the pg_mblen calls. 2. Since pg_mblen() will always return an integer between 1-6, it would probably be faster to replace the memcpy() and memcmp() calls with simple for-loops iterating byte-by-byte. In very brief testing, with the 1. change above, the performance with this patch is back to what it's without the patch. See attached. The particular patch has a mistake: the first branch is always taken because pg_mblen() won't return 0. It's also fairly ugly to set mblen in the test for the branch that doesn't use it. Attached a patch implementing the same idea though: only use the multibyte path if *both* the escape char and the current character from the pattern are multibyte. I also changed the comment to more clearly state the behavior upon which we're relying. I hope what I said is accurate. Regards, Jeff Davis *** a/src/backend/utils/adt/regexp.c --- b/src/backend/utils/adt/regexp.c *** *** 688,698 similar_escape(PG_FUNCTION_ARGS) elen = VARSIZE_ANY_EXHDR(esc_text); if (elen == 0) e = NULL; /* no escape character */ ! else if (elen != 1) ! ereport(ERROR, ! (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE), ! errmsg(invalid escape string), ! errhint(Escape string must be empty or one character.))); } /*-- --- 688,703 elen = VARSIZE_ANY_EXHDR(esc_text); if (elen == 0) e = NULL; /* no escape character */ ! else ! { ! int escape_mblen = pg_mbstrlen_with_len(e, elen); ! ! if (escape_mblen 1) ! ereport(ERROR, ! (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE), ! errmsg(invalid escape string), ! errhint(Escape string must be empty or one character.))); ! } } /*-- *** *** 724,729 similar_escape(PG_FUNCTION_ARGS) --- 729,782 { char pchar = *p; + /* + * If both the escape character and the current character from the + * pattern are multi-byte, we need to take the slow path. + * + * But if one of them is single-byte, we can process the the pattern + * one byte at a time, ignoring multi-byte characters. (This works + * because all server-encodings have the property that a valid + * multi-byte character representation cannot contain the + * representation of a valid single-byte character.) + */ + + if (elen 1) + { + int mblen = pg_mblen(p); + if (mblen 1) + { + /* slow, multi-byte path */ + if (afterescape) + { + *r++ = '\\'; + memcpy(r, p, mblen); + r += mblen; + afterescape = false; + } + else if (e elen == mblen memcmp(e, p, mblen) == 0) + { + /* SQL99 escape character; do not send to output */ + afterescape = true; + } + else + { + /* + * We know it's a multi-byte character, so we don't need + * to do all the comparisons to single-byte characters + * that we do below. + */ + memcpy(r, p, mblen); + r += mblen; + } + + p += mblen; + plen -= mblen; + + continue; + } + } + + /* fast path */ if (afterescape) { if (pchar == '' !incharclass) /* for SUBSTRING patterns */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
On Tue, 2014-08-26 at 22:13 -0700, Jeff Davis wrote: Attached a patch implementing the same idea though: only use the multibyte path if *both* the escape char and the current character from the pattern are multibyte. Forgot to mention: with this patch, the test completes in about 720ms, so just between unpatched and the v1 patch. I feel like we're spending too much time optimizing this patch, but we got this far, and it doesn't really complicate the code, so we might as well finish it. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
On 07/12/2014 05:16 AM, Jeff Davis wrote: I was able to see about a 2% increase in runtime when using the similar_escape function directly. I made a 10M tuple table and did: explain analyze select similar_escape('','#') from t; which was the worst reasonable case I could think of. (It appears that selecting from a table is faster than from generate_series. I'm curious what you use when testing the performance of an individual function at the SQL level.) A large table like that is what I usually do. A large generate_series() spends a lot of time building the tuplestore, especially if it doesn't fit in work_mem and spills to disk. Sometimes I use this to avoid it: explain analyze select similar_escape('','#') from generate_series(1, 1) a, generate_series(1,1000); although in my experience it still has somewhat more overhead than a straight seqscan because. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
Heikki Linnakangas hlinnakan...@vmware.com writes: On 07/12/2014 05:16 AM, Jeff Davis wrote: I was able to see about a 2% increase in runtime when using the similar_escape function directly. I made a 10M tuple table and did: explain analyze select similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from t; which was the worst reasonable case I could think of. (It appears that selecting from a table is faster than from generate_series. I'm curious what you use when testing the performance of an individual function at the SQL level.) A large table like that is what I usually do. A large generate_series() spends a lot of time building the tuplestore, especially if it doesn't fit in work_mem and spills to disk. Sometimes I use this to avoid it: explain analyze select similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from generate_series(1, 1) a, generate_series(1,1000); although in my experience it still has somewhat more overhead than a straight seqscan because. [ scratches head... ] Surely similar_escape is marked immutable, and will therefore be executed exactly once in either of these formulations, because the planner will fold the expression to a constant. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
On 07/12/2014 05:16 AM, Jeff Davis wrote: On Fri, 2014-07-11 at 11:51 -0400, Tom Lane wrote: Jeff Davis pg...@j-davis.com writes: Attached is a small patch to $SUBJECT. In master, only single-byte characters are allowed as an escape. Of course, with the patch it must still be a single character, but it may be multi-byte. I'm concerned about the performance cost of this patch. Have you done any measurements about what kind of overhead you are putting on the inner loop of similar_escape? I didn't consider this very performance critical, because this is looping through the pattern, which I wouldn't expect to be a long string. On my machine using en_US.UTF-8, the difference is imperceptible for a SIMILAR TO ... ESCAPE query. I was able to see about a 2% increase in runtime when using the similar_escape function directly. I made a 10M tuple table and did: explain analyze select similar_escape('','#') from t; which was the worst reasonable case I could think of. Actually, that gets optimized to a constant in the planner: postgres=# explain verbose select similar_escape('','#') from t; QUERY PLAN -- Seq Scan on public.t (cost=0.00..144247.85 rows=985 width=0) Output: '^(?: )$'::text Planning time: 0.033 ms (3 rows) With a working test case: create table t (pattern text); insert into t select '' from generate_series(1, 100); vacuum t; explain (analyze) select similar_escape(pattern,'#') from t; your patch seems to be about 2x-3x as slow as unpatched master. So this needs some optimization. A couple of ideas: 1. If the escape string is in fact a single-byte character, you can proceed with the loop just as it is today, without the pg_mblen calls. 2. Since pg_mblen() will always return an integer between 1-6, it would probably be faster to replace the memcpy() and memcmp() calls with simple for-loops iterating byte-by-byte. In very brief testing, with the 1. change above, the performance with this patch is back to what it's without the patch. See attached. - Heikki diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c index caf45ef..dd46325 100644 --- a/src/backend/utils/adt/regexp.c +++ b/src/backend/utils/adt/regexp.c @@ -688,11 +688,16 @@ similar_escape(PG_FUNCTION_ARGS) elen = VARSIZE_ANY_EXHDR(esc_text); if (elen == 0) e = NULL; /* no escape character */ - else if (elen != 1) - ereport(ERROR, - (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE), - errmsg(invalid escape string), - errhint(Escape string must be empty or one character.))); + else + { + int escape_mblen = pg_mbstrlen_with_len(e, elen); + + if (escape_mblen 1) +ereport(ERROR, + (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE), + errmsg(invalid escape string), + errhint(Escape string must be empty or one character.))); + } } /*-- @@ -723,59 +728,94 @@ similar_escape(PG_FUNCTION_ARGS) while (plen 0) { char pchar = *p; + int mblen; - if (afterescape) + /* + * If the escape string is single-byte character, we can process the + * the pattern one byte at a time, ignoring multi-byte characters. + * (This works because all server-encodings have the property that the + * a non-first byte of a multi-byte characters always has the high-bit + * set, and hence we cannot be fooled by a byte in the middle of a + * multi-byte character.) + */ + if (elen == 1 || (mblen = pg_mblen(p))) { - if (pchar == '' !incharclass) /* for SUBSTRING patterns */ -*r++ = ((nquotes++ % 2) == 0) ? '(' : ')'; - else + if (afterescape) + { +if (pchar == '' !incharclass) /* for SUBSTRING patterns */ + *r++ = ((nquotes++ % 2) == 0) ? '(' : ')'; +else +{ + *r++ = '\\'; + *r++ = pchar; +} +afterescape = false; + } + else if (e pchar == *e) + { +/* SQL99 escape character; do not send to output */ +afterescape = true; + } + else if (incharclass) + { +if (pchar == '\\') + *r++ = '\\'; +*r++ = pchar; +if (pchar == ']') + incharclass = false; + } + else if (pchar == '[') + { +*r++ = pchar; +incharclass = true; + } + else if (pchar == '%') + { +*r++ = '.'; +*r++ = '*'; + } + else if (pchar == '_') +*r++ = '.'; + else if (pchar == '(') + { +/* convert to non-capturing parenthesis */ +*r++ = '('; +*r++ = '?'; +*r++ = ':'; + } + else if (pchar == '\\' || pchar == '.' || + pchar == '^' || pchar == '$') { *r++ = '\\'; *r++ = pchar; } - afterescape = false; - } - else if
Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
On 08/25/2014 04:48 PM, Tom Lane wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: On 07/12/2014 05:16 AM, Jeff Davis wrote: I was able to see about a 2% increase in runtime when using the similar_escape function directly. I made a 10M tuple table and did: explain analyze select similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from t; which was the worst reasonable case I could think of. (It appears that selecting from a table is faster than from generate_series. I'm curious what you use when testing the performance of an individual function at the SQL level.) A large table like that is what I usually do. A large generate_series() spends a lot of time building the tuplestore, especially if it doesn't fit in work_mem and spills to disk. Sometimes I use this to avoid it: explain analyze select similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from generate_series(1, 1) a, generate_series(1,1000); although in my experience it still has somewhat more overhead than a straight seqscan because. [ scratches head... ] Surely similar_escape is marked immutable, and will therefore be executed exactly once in either of these formulations, because the planner will fold the expression to a constant. Yeah, just noticed that myself.. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
Heikki Linnakangas hlinnakan...@vmware.com writes: On 08/25/2014 04:48 PM, Tom Lane wrote: [ scratches head... ] Surely similar_escape is marked immutable, and will therefore be executed exactly once in either of these formulations, because the planner will fold the expression to a constant. Yeah, just noticed that myself.. ... although, given that, it is also fair to wonder how much the speed of similar_escape really matters. Surely in most use-cases the pattern and escape char will be constants. And, when they are not, wouldn't the subsequent parsing work for the regexes dominate the cost of similar_escape anyway? IOW, I'm not sure we should be advising Jeff to duplicate code in order to have a fast path. Keeping it short might be the best goal. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
On 08/25/2014 06:14 PM, Tom Lane wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: On 08/25/2014 04:48 PM, Tom Lane wrote: [ scratches head... ] Surely similar_escape is marked immutable, and will therefore be executed exactly once in either of these formulations, because the planner will fold the expression to a constant. Yeah, just noticed that myself.. ... although, given that, it is also fair to wonder how much the speed of similar_escape really matters. Surely in most use-cases the pattern and escape char will be constants. And, when they are not, wouldn't the subsequent parsing work for the regexes dominate the cost of similar_escape anyway? IOW, I'm not sure we should be advising Jeff to duplicate code in order to have a fast path. Keeping it short might be the best goal. It's certainly not worth bending over backwards for a small performance gain here, but I think special-casing a single-byte escape sequence is still quite reasonable. It doesn't make the code any longer. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
On Fri, 2014-07-11 at 14:41 +0900, Fujii Masao wrote: Could you add the patch into next CF? Sure. The patch is so small I was thinking about committing it in a few days (assuming no complaints), but I'm in no hurry. The patch doesn't contain the change of the document. But I think that it's better to document what character is allowed as escape in LIKE, SIMILAR TO and SUBSTRING. It should be assumed that multi-byte characters are allowed nearly everywhere, and we should document the places where only single-byte characters are allowed. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
Jeff Davis pg...@j-davis.com writes: Attached is a small patch to $SUBJECT. In master, only single-byte characters are allowed as an escape. Of course, with the patch it must still be a single character, but it may be multi-byte. I'm concerned about the performance cost of this patch. Have you done any measurements about what kind of overhead you are putting on the inner loop of similar_escape? At the very least, please don't call GetDatabaseEncoding() over again every single time through the inner loop. More generally, why do we need to use pg_encoding_verifymb? The input data is presumably validly encoded. ISTM the significantly cheaper pg_mblen() would be more appropriate. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
On Fri, 2014-07-11 at 11:51 -0400, Tom Lane wrote: Jeff Davis pg...@j-davis.com writes: Attached is a small patch to $SUBJECT. In master, only single-byte characters are allowed as an escape. Of course, with the patch it must still be a single character, but it may be multi-byte. I'm concerned about the performance cost of this patch. Have you done any measurements about what kind of overhead you are putting on the inner loop of similar_escape? I didn't consider this very performance critical, because this is looping through the pattern, which I wouldn't expect to be a long string. On my machine using en_US.UTF-8, the difference is imperceptible for a SIMILAR TO ... ESCAPE query. I was able to see about a 2% increase in runtime when using the similar_escape function directly. I made a 10M tuple table and did: explain analyze select similar_escape('','#') from t; which was the worst reasonable case I could think of. (It appears that selecting from a table is faster than from generate_series. I'm curious what you use when testing the performance of an individual function at the SQL level.) At the very least, please don't call GetDatabaseEncoding() over again every single time through the inner loop. More generally, why do we need to use pg_encoding_verifymb? The input data is presumably validly encoded. ISTM the significantly cheaper pg_mblen() would be more appropriate. Thank you. Using the non-verifying variants reduces the penalty in the above test to 1%. If needed, we could optimize further through code specialization, as like_escape() does. Though I think like_escape() is specialized just because MatchText() is specialized; and MatchText is specialized because it operates on the actual data, not just the pattern. So I don't see a reason to specialize similar_escape(). Regards, Jeff Davis *** a/src/backend/utils/adt/regexp.c --- b/src/backend/utils/adt/regexp.c *** *** 688,698 similar_escape(PG_FUNCTION_ARGS) elen = VARSIZE_ANY_EXHDR(esc_text); if (elen == 0) e = NULL; /* no escape character */ ! else if (elen != 1) ! ereport(ERROR, ! (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE), ! errmsg(invalid escape string), ! errhint(Escape string must be empty or one character.))); } /*-- --- 688,703 elen = VARSIZE_ANY_EXHDR(esc_text); if (elen == 0) e = NULL; /* no escape character */ ! else ! { ! int escape_mblen = pg_mbstrlen_with_len(e, elen); ! ! if (escape_mblen 1) ! ereport(ERROR, ! (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE), ! errmsg(invalid escape string), ! errhint(Escape string must be empty or one character.))); ! } } /*-- *** *** 722,781 similar_escape(PG_FUNCTION_ARGS) while (plen 0) { ! char pchar = *p; ! if (afterescape) { ! if (pchar == '' !incharclass) /* for SUBSTRING patterns */ ! *r++ = ((nquotes++ % 2) == 0) ? '(' : ')'; ! else { *r++ = '\\'; *r++ = pchar; } ! afterescape = false; ! } ! else if (e pchar == *e) ! { ! /* SQL99 escape character; do not send to output */ ! afterescape = true; } ! else if (incharclass) { ! if (pchar == '\\') *r++ = '\\'; ! *r++ = pchar; ! if (pchar == ']') ! incharclass = false; ! } ! else if (pchar == '[') ! { ! *r++ = pchar; ! incharclass = true; ! } ! else if (pchar == '%') ! { ! *r++ = '.'; ! *r++ = '*'; ! } ! else if (pchar == '_') ! *r++ = '.'; ! else if (pchar == '(') ! { ! /* convert to non-capturing parenthesis */ ! *r++ = '('; ! *r++ = '?'; ! *r++ = ':'; ! } ! else if (pchar == '\\' || pchar == '.' || ! pchar == '^' || pchar == '$') ! { ! *r++ = '\\'; ! *r++ = pchar; } - else - *r++ = pchar; - p++, plen--; } *r++ = ')'; --- 727,813 while (plen 0) { ! char pchar = *p; ! int mblen = pg_mblen(p); ! if (mblen == 1) { ! if (afterescape) ! { ! if (pchar == '' !incharclass) /* for SUBSTRING patterns */ ! *r++ = ((nquotes++ % 2) == 0) ? '(' : ')'; ! else ! { ! *r++ = '\\'; ! *r++ = pchar; ! } ! afterescape = false; ! } ! else if (e pchar == *e) ! { ! /* SQL99 escape character; do not send to output */ ! afterescape = true; ! } ! else if (incharclass) ! { ! if (pchar == '\\') ! *r++ = '\\'; ! *r++ = pchar; ! if (pchar == ']') ! incharclass = false; ! } ! else if (pchar == '[') ! { ! *r++ = pchar; ! incharclass = true; ! } ! else if (pchar == '%') ! { ! *r++ = '.'; ! *r++ = '*'; ! } ! else if (pchar == '_') ! *r++ = '.'; ! else if (pchar == '(') ! { ! /* convert to non-capturing parenthesis */ ! *r++ = '('; ! *r++ = '?'; ! *r++ =
[HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
Attached is a small patch to $SUBJECT. In master, only single-byte characters are allowed as an escape. Of course, with the patch it must still be a single character, but it may be multi-byte. Regards, Jeff Davis *** a/src/backend/utils/adt/regexp.c --- b/src/backend/utils/adt/regexp.c *** *** 688,698 similar_escape(PG_FUNCTION_ARGS) elen = VARSIZE_ANY_EXHDR(esc_text); if (elen == 0) e = NULL; /* no escape character */ ! else if (elen != 1) ! ereport(ERROR, ! (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE), ! errmsg(invalid escape string), ! errhint(Escape string must be empty or one character.))); } /*-- --- 688,704 elen = VARSIZE_ANY_EXHDR(esc_text); if (elen == 0) e = NULL; /* no escape character */ ! else ! { ! int escape_mblen = pg_verify_mbstr_len(GetDatabaseEncoding(), e, ! elen, false); ! ! if (escape_mblen 1) ! ereport(ERROR, ! (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE), ! errmsg(invalid escape string), ! errhint(Escape string must be empty or one character.))); ! } } /*-- *** *** 722,781 similar_escape(PG_FUNCTION_ARGS) while (plen 0) { ! char pchar = *p; ! if (afterescape) { ! if (pchar == '' !incharclass) /* for SUBSTRING patterns */ ! *r++ = ((nquotes++ % 2) == 0) ? '(' : ')'; ! else { *r++ = '\\'; *r++ = pchar; } ! afterescape = false; ! } ! else if (e pchar == *e) ! { ! /* SQL99 escape character; do not send to output */ ! afterescape = true; } ! else if (incharclass) { ! if (pchar == '\\') *r++ = '\\'; ! *r++ = pchar; ! if (pchar == ']') ! incharclass = false; ! } ! else if (pchar == '[') ! { ! *r++ = pchar; ! incharclass = true; ! } ! else if (pchar == '%') ! { ! *r++ = '.'; ! *r++ = '*'; ! } ! else if (pchar == '_') ! *r++ = '.'; ! else if (pchar == '(') ! { ! /* convert to non-capturing parenthesis */ ! *r++ = '('; ! *r++ = '?'; ! *r++ = ':'; ! } ! else if (pchar == '\\' || pchar == '.' || ! pchar == '^' || pchar == '$') ! { ! *r++ = '\\'; ! *r++ = pchar; } - else - *r++ = pchar; - p++, plen--; } *r++ = ')'; --- 728,816 while (plen 0) { ! char pchar = *p; ! int mblen = pg_encoding_verifymb(GetDatabaseEncoding(), p, plen); ! ! Assert(mblen 0); ! if (mblen == 1) { ! if (afterescape) ! { ! if (pchar == '' !incharclass) /* for SUBSTRING patterns */ ! *r++ = ((nquotes++ % 2) == 0) ? '(' : ')'; ! else ! { ! *r++ = '\\'; ! *r++ = pchar; ! } ! afterescape = false; ! } ! else if (e pchar == *e) ! { ! /* SQL99 escape character; do not send to output */ ! afterescape = true; ! } ! else if (incharclass) ! { ! if (pchar == '\\') ! *r++ = '\\'; ! *r++ = pchar; ! if (pchar == ']') ! incharclass = false; ! } ! else if (pchar == '[') ! { ! *r++ = pchar; ! incharclass = true; ! } ! else if (pchar == '%') ! { ! *r++ = '.'; ! *r++ = '*'; ! } ! else if (pchar == '_') ! *r++ = '.'; ! else if (pchar == '(') ! { ! /* convert to non-capturing parenthesis */ ! *r++ = '('; ! *r++ = '?'; ! *r++ = ':'; ! } ! else if (pchar == '\\' || pchar == '.' || ! pchar == '^' || pchar == '$') { *r++ = '\\'; *r++ = pchar; } ! else ! *r++ = pchar; ! p++, plen--; } ! else { ! if (afterescape) ! { *r++ = '\\'; ! memcpy(r, p, mblen); ! r += mblen; ! afterescape = false; ! } ! else if (e elen == mblen memcmp(e, p, mblen) == 0) ! { ! /* SQL99 escape character; do not send to output */ ! afterescape = true; ! } ! else ! { ! memcpy(r, p, mblen); ! r += mblen; ! } ! ! p += mblen; ! plen -= mblen; } } *r++ = ')'; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
On Fri, Jul 11, 2014 at 12:49 PM, Jeff Davis pg...@j-davis.com wrote: Attached is a small patch to $SUBJECT. In master, only single-byte characters are allowed as an escape. Of course, with the patch it must still be a single character, but it may be multi-byte. +1 Probably you know that, multi-byte character is already allowed as escape in LIKE. There seems no reason why we don't support multi-byte escape character in SIMILAR TO and SUBSTRING. Could you add the patch into next CF? The patch doesn't contain the change of the document. But I think that it's better to document what character is allowed as escape in LIKE, SIMILAR TO and SUBSTRING. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers