Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

2014-08-27 Thread Heikki Linnakangas

On 08/27/2014 08:13 AM, Jeff Davis wrote:

On Mon, 2014-08-25 at 17:41 +0300, Heikki Linnakangas wrote:

your patch seems to be about 2x-3x as slow as unpatched master. So this
needs some optimization. A couple of ideas:


I didn't see anywhere near that kind of regression. On unpatched master,
with your test case, I saw it stabilize to about 680ms. With
similar-escape-1, I saw about 775ms (15% regression). Are those at all
close to your numbers? Is there a chance you used an unoptimized build
for one of them, or left asserts enabled?


Oh. I can't now reproduce my earlier results either, I must've messed up 
something. I'm now seeing similar numbers as you.



Attached a patch implementing the same idea though: only use the
multibyte path if *both* the escape char and the current character from
the pattern are multibyte.

I also changed the comment to more clearly state the behavior upon which
we're relying. I hope what I said is accurate.


s/the the/the/. Other than that, looks good to me.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

2014-08-26 Thread Jeff Davis
On Mon, 2014-08-25 at 17:41 +0300, Heikki Linnakangas wrote:
 Actually, that gets optimized to a constant in the planner:

Oops, thank you (and Tom).

 your patch seems to be about 2x-3x as slow as unpatched master. So this 
 needs some optimization. A couple of ideas:

I didn't see anywhere near that kind of regression. On unpatched master,
with your test case, I saw it stabilize to about 680ms. With
similar-escape-1, I saw about 775ms (15% regression). Are those at all
close to your numbers? Is there a chance you used an unoptimized build
for one of them, or left asserts enabled?

 1. If the escape string is in fact a single-byte character, you can 
 proceed with the loop just as it is today, without the pg_mblen calls.
 
 2. Since pg_mblen() will always return an integer between 1-6, it would 
 probably be faster to replace the memcpy() and memcmp() calls with 
 simple for-loops iterating byte-by-byte.
 
 In very brief testing, with the 1. change above, the performance with 
 this patch is back to what it's without the patch. See attached.

The particular patch has a mistake: the first branch is always taken
because pg_mblen() won't return 0. It's also fairly ugly to set mblen in
the test for the branch that doesn't use it.

Attached a patch implementing the same idea though: only use the
multibyte path if *both* the escape char and the current character from
the pattern are multibyte.

I also changed the comment to more clearly state the behavior upon which
we're relying. I hope what I said is accurate.

Regards,
Jeff Davis

*** a/src/backend/utils/adt/regexp.c
--- b/src/backend/utils/adt/regexp.c
***
*** 688,698  similar_escape(PG_FUNCTION_ARGS)
  		elen = VARSIZE_ANY_EXHDR(esc_text);
  		if (elen == 0)
  			e = NULL;			/* no escape character */
! 		else if (elen != 1)
! 			ereport(ERROR,
! 	(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
! 	 errmsg(invalid escape string),
!   errhint(Escape string must be empty or one character.)));
  	}
  
  	/*--
--- 688,703 
  		elen = VARSIZE_ANY_EXHDR(esc_text);
  		if (elen == 0)
  			e = NULL;			/* no escape character */
! 		else
! 		{
! 			int			escape_mblen = pg_mbstrlen_with_len(e, elen);
! 
! 			if (escape_mblen  1)
! ereport(ERROR,
! 		(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
! 		 errmsg(invalid escape string),
! 		 errhint(Escape string must be empty or one character.)));
! 		}
  	}
  
  	/*--
***
*** 724,729  similar_escape(PG_FUNCTION_ARGS)
--- 729,782 
  	{
  		char		pchar = *p;
  
+ 		/*
+ 		 * If both the escape character and the current character from the
+ 		 * pattern are multi-byte, we need to take the slow path.
+ 		 *
+ 		 * But if one of them is single-byte, we can process the the pattern
+ 		 * one byte at a time, ignoring multi-byte characters.  (This works
+ 		 * because all server-encodings have the property that a valid
+ 		 * multi-byte character representation cannot contain the
+ 		 * representation of a valid single-byte character.)
+ 		 */
+ 
+ 		if (elen  1)
+ 		{
+ 			int mblen = pg_mblen(p);
+ 			if (mblen  1)
+ 			{
+ /* slow, multi-byte path */
+ if (afterescape)
+ {
+ 	*r++ = '\\';
+ 	memcpy(r, p, mblen);
+ 	r += mblen;
+ 	afterescape = false;
+ }
+ else if (e  elen == mblen  memcmp(e, p, mblen) == 0)
+ {
+ 	/* SQL99 escape character; do not send to output */
+ 	afterescape = true;
+ }
+ else
+ {
+ 	/*
+ 	 * We know it's a multi-byte character, so we don't need
+ 	 * to do all the comparisons to single-byte characters
+ 	 * that we do below.
+ 	 */
+ 	memcpy(r, p, mblen);
+ 	r += mblen;
+ }
+ 
+ p += mblen;
+ plen -= mblen;
+ 
+ continue;
+ 			}
+ 		}
+ 
+ 		/* fast path */
  		if (afterescape)
  		{
  			if (pchar == ''  !incharclass)	/* for SUBSTRING patterns */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

2014-08-26 Thread Jeff Davis
On Tue, 2014-08-26 at 22:13 -0700, Jeff Davis wrote:
 Attached a patch implementing the same idea though: only use the
 multibyte path if *both* the escape char and the current character from
 the pattern are multibyte.

Forgot to mention: with this patch, the test completes in about 720ms,
so just between unpatched and the v1 patch.

I feel like we're spending too much time optimizing this patch, but we
got this far, and it doesn't really complicate the code, so we might as
well finish it.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

2014-08-25 Thread Heikki Linnakangas

On 07/12/2014 05:16 AM, Jeff Davis wrote:

I was able to see about a 2% increase in runtime when using the
similar_escape function directly. I made a 10M tuple table and did:

 explain analyze
   select
similar_escape('','#')
 from t;

which was the worst reasonable case I could think of. (It appears that
selecting from a table is faster than from generate_series. I'm curious
what you use when testing the performance of an individual function at
the SQL level.)


A large table like that is what I usually do. A large generate_series() 
spends a lot of time building the tuplestore, especially if it doesn't 
fit in work_mem and spills to disk. Sometimes I use this to avoid it:


explain analyze
  select
similar_escape('','#') 
from generate_series(1, 1) a, generate_series(1,1000);


although in my experience it still has somewhat more overhead than a 
straight seqscan because.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

2014-08-25 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 07/12/2014 05:16 AM, Jeff Davis wrote:
 I was able to see about a 2% increase in runtime when using the
 similar_escape function directly. I made a 10M tuple table and did:
 
 explain analyze
 select
 similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#')
  from t;
 
 which was the worst reasonable case I could think of. (It appears that
 selecting from a table is faster than from generate_series. I'm curious
 what you use when testing the performance of an individual function at
 the SQL level.)

 A large table like that is what I usually do. A large generate_series() 
 spends a lot of time building the tuplestore, especially if it doesn't 
 fit in work_mem and spills to disk. Sometimes I use this to avoid it:

 explain analyze
select
 similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#')
  
 from generate_series(1, 1) a, generate_series(1,1000);

 although in my experience it still has somewhat more overhead than a 
 straight seqscan because.

[ scratches head... ]  Surely similar_escape is marked immutable, and
will therefore be executed exactly once in either of these formulations,
because the planner will fold the expression to a constant.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

2014-08-25 Thread Heikki Linnakangas

On 07/12/2014 05:16 AM, Jeff Davis wrote:

On Fri, 2014-07-11 at 11:51 -0400, Tom Lane wrote:

Jeff Davis pg...@j-davis.com writes:

Attached is a small patch to $SUBJECT.
In master, only single-byte characters are allowed as an escape. Of
course, with the patch it must still be a single character, but it may
be multi-byte.


I'm concerned about the performance cost of this patch.  Have you done
any measurements about what kind of overhead you are putting on the
inner loop of similar_escape?


I didn't consider this very performance critical, because this is
looping through the pattern, which I wouldn't expect to be a long
string. On my machine using en_US.UTF-8, the difference is imperceptible
for a SIMILAR TO ... ESCAPE query.

I was able to see about a 2% increase in runtime when using the
similar_escape function directly. I made a 10M tuple table and did:

 explain analyze
   select
similar_escape('','#')
 from t;

which was the worst reasonable case I could think of.


Actually, that gets optimized to a constant in the planner:

postgres=# explain  verbose select
similar_escape('','#') 
from t;
QUERY PLAN 




--
 Seq Scan on public.t  (cost=0.00..144247.85 rows=985 width=0)
   Output: 
'^(?:

)$'::text
 Planning time: 0.033 ms
(3 rows)

With a working test case:

create table t (pattern text);
insert into t select 
'' from 
generate_series(1, 100);

vacuum t;

explain (analyze) select similar_escape(pattern,'#') from t;

your patch seems to be about 2x-3x as slow as unpatched master. So this 
needs some optimization. A couple of ideas:


1. If the escape string is in fact a single-byte character, you can 
proceed with the loop just as it is today, without the pg_mblen calls.


2. Since pg_mblen() will always return an integer between 1-6, it would 
probably be faster to replace the memcpy() and memcmp() calls with 
simple for-loops iterating byte-by-byte.


In very brief testing, with the 1. change above, the performance with 
this patch is back to what it's without the patch. See attached.


- Heikki

diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index caf45ef..dd46325 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -688,11 +688,16 @@ similar_escape(PG_FUNCTION_ARGS)
 		elen = VARSIZE_ANY_EXHDR(esc_text);
 		if (elen == 0)
 			e = NULL;			/* no escape character */
-		else if (elen != 1)
-			ereport(ERROR,
-	(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
-	 errmsg(invalid escape string),
-  errhint(Escape string must be empty or one character.)));
+		else
+		{
+			int			escape_mblen = pg_mbstrlen_with_len(e, elen);
+
+			if (escape_mblen  1)
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
+		 errmsg(invalid escape string),
+		 errhint(Escape string must be empty or one character.)));
+		}
 	}
 
 	/*--
@@ -723,59 +728,94 @@ similar_escape(PG_FUNCTION_ARGS)
 	while (plen  0)
 	{
 		char		pchar = *p;
+		int			mblen;
 
-		if (afterescape)
+		/*
+		 * If the escape string is single-byte character, we can process the
+		 * the pattern one byte at a time, ignoring multi-byte characters.
+		 * (This works because all server-encodings have the property that the
+		 * a non-first byte of a multi-byte characters always has the high-bit
+		 * set, and hence we cannot be fooled by a byte in the middle of a
+		 * multi-byte character.)
+		 */
+		if (elen == 1 || (mblen = pg_mblen(p)))
 		{
-			if (pchar == ''  !incharclass)	/* for SUBSTRING patterns */
-*r++ = ((nquotes++ % 2) == 0) ? '(' : ')';
-			else
+			if (afterescape)
+			{
+if (pchar == ''  !incharclass)	/* for SUBSTRING patterns */
+	*r++ = ((nquotes++ % 2) == 0) ? '(' : ')';
+else
+{
+	*r++ = '\\';
+	*r++ = pchar;
+}
+afterescape = false;
+			}
+			else if (e  pchar == *e)
+			{
+/* SQL99 escape character; do not send to output */
+afterescape = true;
+			}
+			else if (incharclass)
+			{
+if (pchar == '\\')
+	*r++ = '\\';
+*r++ = pchar;
+if (pchar == ']')
+	incharclass = false;
+			}
+			else if (pchar == '[')
+			{
+*r++ = pchar;
+incharclass = true;
+			}
+			else if (pchar == '%')
+			{
+*r++ = '.';
+*r++ = '*';
+			}
+			else if (pchar == '_')
+*r++ = '.';
+			else if (pchar == '(')
+			{
+/* convert to non-capturing parenthesis */
+*r++ = '(';
+*r++ = '?';
+*r++ = ':';
+			}
+			else if (pchar == '\\' || pchar == '.' ||
+	 pchar == '^' || pchar == '$')
 			{
 *r++ = '\\';
 *r++ = pchar;
 			}
-			afterescape = false;
-		}
-		else if 

Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

2014-08-25 Thread Heikki Linnakangas

On 08/25/2014 04:48 PM, Tom Lane wrote:

Heikki Linnakangas hlinnakan...@vmware.com writes:

On 07/12/2014 05:16 AM, Jeff Davis wrote:

I was able to see about a 2% increase in runtime when using the
similar_escape function directly. I made a 10M tuple table and did:

explain analyze
select
similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#')
 from t;

which was the worst reasonable case I could think of. (It appears that
selecting from a table is faster than from generate_series. I'm curious
what you use when testing the performance of an individual function at
the SQL level.)



A large table like that is what I usually do. A large generate_series()
spends a lot of time building the tuplestore, especially if it doesn't
fit in work_mem and spills to disk. Sometimes I use this to avoid it:



explain analyze
select
similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#')
from generate_series(1, 1) a, generate_series(1,1000);



although in my experience it still has somewhat more overhead than a
straight seqscan because.


[ scratches head... ]  Surely similar_escape is marked immutable, and
will therefore be executed exactly once in either of these formulations,
because the planner will fold the expression to a constant.


Yeah, just noticed that myself..

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

2014-08-25 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 08/25/2014 04:48 PM, Tom Lane wrote:
 [ scratches head... ]  Surely similar_escape is marked immutable, and
 will therefore be executed exactly once in either of these formulations,
 because the planner will fold the expression to a constant.

 Yeah, just noticed that myself..

... although, given that, it is also fair to wonder how much the speed of
similar_escape really matters.  Surely in most use-cases the pattern and
escape char will be constants.  And, when they are not, wouldn't the
subsequent parsing work for the regexes dominate the cost of
similar_escape anyway?

IOW, I'm not sure we should be advising Jeff to duplicate code in
order to have a fast path.  Keeping it short might be the best goal.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

2014-08-25 Thread Heikki Linnakangas

On 08/25/2014 06:14 PM, Tom Lane wrote:

Heikki Linnakangas hlinnakan...@vmware.com writes:

On 08/25/2014 04:48 PM, Tom Lane wrote:

[ scratches head... ]  Surely similar_escape is marked immutable, and
will therefore be executed exactly once in either of these formulations,
because the planner will fold the expression to a constant.



Yeah, just noticed that myself..


... although, given that, it is also fair to wonder how much the speed of
similar_escape really matters.  Surely in most use-cases the pattern and
escape char will be constants.  And, when they are not, wouldn't the
subsequent parsing work for the regexes dominate the cost of
similar_escape anyway?

IOW, I'm not sure we should be advising Jeff to duplicate code in
order to have a fast path.  Keeping it short might be the best goal.


It's certainly not worth bending over backwards for a small performance 
gain here, but I think special-casing a single-byte escape sequence is 
still quite reasonable. It doesn't make the code any longer.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

2014-07-11 Thread Jeff Davis
On Fri, 2014-07-11 at 14:41 +0900, Fujii Masao wrote:
 Could you add the patch into next CF?

Sure. The patch is so small I was thinking about committing it in a few
days (assuming no complaints), but I'm in no hurry.

 The patch doesn't contain the change of the document. But I think that
 it's better to document what character is allowed as escape in LIKE,
 SIMILAR TO and SUBSTRING.

It should be assumed that multi-byte characters are allowed nearly
everywhere, and we should document the places where only single-byte
characters are allowed.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

2014-07-11 Thread Tom Lane
Jeff Davis pg...@j-davis.com writes:
 Attached is a small patch to $SUBJECT.
 In master, only single-byte characters are allowed as an escape. Of
 course, with the patch it must still be a single character, but it may
 be multi-byte.

I'm concerned about the performance cost of this patch.  Have you done
any measurements about what kind of overhead you are putting on the
inner loop of similar_escape?

At the very least, please don't call GetDatabaseEncoding() over again
every single time through the inner loop.  More generally, why do we
need to use pg_encoding_verifymb?  The input data is presumably validly
encoded.  ISTM the significantly cheaper pg_mblen() would be more
appropriate.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

2014-07-11 Thread Jeff Davis
On Fri, 2014-07-11 at 11:51 -0400, Tom Lane wrote:
 Jeff Davis pg...@j-davis.com writes:
  Attached is a small patch to $SUBJECT.
  In master, only single-byte characters are allowed as an escape. Of
  course, with the patch it must still be a single character, but it may
  be multi-byte.
 
 I'm concerned about the performance cost of this patch.  Have you done
 any measurements about what kind of overhead you are putting on the
 inner loop of similar_escape?

I didn't consider this very performance critical, because this is
looping through the pattern, which I wouldn't expect to be a long
string. On my machine using en_US.UTF-8, the difference is imperceptible
for a SIMILAR TO ... ESCAPE query.

I was able to see about a 2% increase in runtime when using the
similar_escape function directly. I made a 10M tuple table and did:

explain analyze
  select 
similar_escape('','#')
 from t;

which was the worst reasonable case I could think of. (It appears that
selecting from a table is faster than from generate_series. I'm curious
what you use when testing the performance of an individual function at
the SQL level.)

 At the very least, please don't call GetDatabaseEncoding() over again
 every single time through the inner loop.  More generally, why do we
 need to use pg_encoding_verifymb?  The input data is presumably validly
 encoded.  ISTM the significantly cheaper pg_mblen() would be more
 appropriate.

Thank you. Using the non-verifying variants reduces the penalty in the
above test to 1%.

If needed, we could optimize further through code specialization, as
like_escape() does. Though I think like_escape() is specialized just
because MatchText() is specialized; and MatchText is specialized because
it operates on the actual data, not just the pattern. So I don't see a
reason to specialize similar_escape().

Regards,
Jeff Davis

*** a/src/backend/utils/adt/regexp.c
--- b/src/backend/utils/adt/regexp.c
***
*** 688,698  similar_escape(PG_FUNCTION_ARGS)
  		elen = VARSIZE_ANY_EXHDR(esc_text);
  		if (elen == 0)
  			e = NULL;			/* no escape character */
! 		else if (elen != 1)
! 			ereport(ERROR,
! 	(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
! 	 errmsg(invalid escape string),
!   errhint(Escape string must be empty or one character.)));
  	}
  
  	/*--
--- 688,703 
  		elen = VARSIZE_ANY_EXHDR(esc_text);
  		if (elen == 0)
  			e = NULL;			/* no escape character */
! 		else
! 		{
! 			int escape_mblen = pg_mbstrlen_with_len(e, elen);
! 
! 			if (escape_mblen  1)
! ereport(ERROR,
! 		(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
! 		 errmsg(invalid escape string),
! 		 errhint(Escape string must be empty or one character.)));
! 		}
  	}
  
  	/*--
***
*** 722,781  similar_escape(PG_FUNCTION_ARGS)
  
  	while (plen  0)
  	{
! 		char		pchar = *p;
  
! 		if (afterescape)
  		{
! 			if (pchar == ''  !incharclass)	/* for SUBSTRING patterns */
! *r++ = ((nquotes++ % 2) == 0) ? '(' : ')';
! 			else
  			{
  *r++ = '\\';
  *r++ = pchar;
  			}
! 			afterescape = false;
! 		}
! 		else if (e  pchar == *e)
! 		{
! 			/* SQL99 escape character; do not send to output */
! 			afterescape = true;
  		}
! 		else if (incharclass)
  		{
! 			if (pchar == '\\')
  *r++ = '\\';
! 			*r++ = pchar;
! 			if (pchar == ']')
! incharclass = false;
! 		}
! 		else if (pchar == '[')
! 		{
! 			*r++ = pchar;
! 			incharclass = true;
! 		}
! 		else if (pchar == '%')
! 		{
! 			*r++ = '.';
! 			*r++ = '*';
! 		}
! 		else if (pchar == '_')
! 			*r++ = '.';
! 		else if (pchar == '(')
! 		{
! 			/* convert to non-capturing parenthesis */
! 			*r++ = '(';
! 			*r++ = '?';
! 			*r++ = ':';
! 		}
! 		else if (pchar == '\\' || pchar == '.' ||
!  pchar == '^' || pchar == '$')
! 		{
! 			*r++ = '\\';
! 			*r++ = pchar;
  		}
- 		else
- 			*r++ = pchar;
- 		p++, plen--;
  	}
  
  	*r++ = ')';
--- 727,813 
  
  	while (plen  0)
  	{
! 		char	pchar = *p;
! 		int		mblen = pg_mblen(p);
  
! 		if (mblen == 1)
  		{
! 			if (afterescape)
! 			{
! if (pchar == ''  !incharclass)	/* for SUBSTRING patterns */
! 	*r++ = ((nquotes++ % 2) == 0) ? '(' : ')';
! else
! {
! 	*r++ = '\\';
! 	*r++ = pchar;
! }
! afterescape = false;
! 			}
! 			else if (e  pchar == *e)
! 			{
! /* SQL99 escape character; do not send to output */
! afterescape = true;
! 			}
! 			else if (incharclass)
! 			{
! if (pchar == '\\')
! 	*r++ = '\\';
! *r++ = pchar;
! if (pchar == ']')
! 	incharclass = false;
! 			}
! 			else if (pchar == '[')
! 			{
! *r++ = pchar;
! incharclass = true;
! 			}
! 			else if (pchar == '%')
! 			{
! *r++ = '.';
! *r++ = '*';
! 			}
! 			else if (pchar == '_')
! *r++ = '.';
! 			else if (pchar == '(')
! 			{
! /* convert to non-capturing parenthesis */
! *r++ = '(';
! *r++ = '?';
! *r++ = 

Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

2014-07-10 Thread Fujii Masao
On Fri, Jul 11, 2014 at 12:49 PM, Jeff Davis pg...@j-davis.com wrote:
 Attached is a small patch to $SUBJECT.

 In master, only single-byte characters are allowed as an escape. Of
 course, with the patch it must still be a single character, but it may
 be multi-byte.

+1

Probably you know that, multi-byte character is already allowed
as escape in LIKE. There seems no reason why we don't support
multi-byte escape character in SIMILAR TO and SUBSTRING.

Could you add the patch into next CF?

The patch doesn't contain the change of the document. But I think that
it's better to document what character is allowed as escape in LIKE,
SIMILAR TO and SUBSTRING.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers