Re: [PATCHES] UTF8MatchText

2007-05-21 Thread db
 Doh, you're right ... but on third thought, what happens with a pattern
 containing %_?  If % tries to advance bytewise then we'll be trying to
 apply NextChar in the middle of a data character, and bad things ensue.

Right, when you have '_' after a '%' you need to make sure the '%'
advances full characters. In my suggestion the test if '_' (or '\') come
after the '%' is done once and it select which of the two loops to use,
the one that do byte stepping or the one with NextChar.

It's difficult to know for sure that we have thought about all the corner
cases. I hope the gain is worth the effort.. :-)

/Dennis

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] UTF8MatchText

2007-05-21 Thread Andrew Dunstan



[EMAIL PROTECTED] wrote:

Doh, you're right ... but on third thought, what happens with a pattern
containing %_?  If % tries to advance bytewise then we'll be trying to
apply NextChar in the middle of a data character, and bad things ensue.



Right, when you have '_' after a '%' you need to make sure the '%'
advances full characters. In my suggestion the test if '_' (or '\') come
after the '%' is done once and it select which of the two loops to use,
the one that do byte stepping or the one with NextChar.

It's difficult to know for sure that we have thought about all the corner
cases. I hope the gain is worth the effort.. :-)


  


Yes, I came to the same conclusion about how to restructure the code.

The current code contains this:

   while (tlen  0)
   {
   /*
* Optimization to prevent most recursion: don't recurse
* unless first pattern char might match this text char.
*/
   if (CHAREQ(t, p) || (*p == '\\') || (*p == '_'))
   {
   int matched = MatchText(t, tlen, p, plen);

   if (matched != LIKE_FALSE)
   return matched; /* TRUE or ABORT */
   }

   NextChar(t, tlen);
   }


The code appears to date from v 1.23 of like.c way back in 2001. I'm not 
sure I agree with the comment, though. In the first place, the invariant 
tests should not be in the loop, I think, and I'll hoist them out as 
Dennis suggests. But why are we doing that CHAREQ? If it succeeds we'll 
just do it again when we recurse, I think.


cheers

andrew

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] UTF8MatchText

2007-05-21 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 But why are we doing that CHAREQ?

To avoid the cost of the recursive call, just like it says.

 If it succeeds we'll 
 just do it again when we recurse, I think.

If you move the other two cases then you could advance t and p before
entering the recursion.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] UTF8MatchText

2007-05-21 Thread Andrew Dunstan



Tom Lane wrote:

Andrew Dunstan [EMAIL PROTECTED] writes:
  

But why are we doing that CHAREQ?



To avoid the cost of the recursive call, just like it says.

  
If it succeeds we'll 
just do it again when we recurse, I think.



If you move the other two cases then you could advance t and p before
entering the recursion.


  


Yeah. Since I have removed the _ case I believe it's now safe there to 
use BYTEEQ/NextByte, and since they are sufficiently cheap it's not 
worth worrying about.


Attached is a patch version that I think draws together all the threads 
of discussion so far. It's in fact quite a lot simpler than the existing 
code, with no special UTF8 case - this should improve LIKE/ILIKE 
processing for all charsets.


More eyeballs please for nasty corner cases.

cheers

andrew
? src/backend/utils/adt/.deps
Index: src/backend/utils/adt/like.c
===
RCS file: /cvsroot/pgsql/src/backend/utils/adt/like.c,v
retrieving revision 1.68
diff -c -r1.68 like.c
*** src/backend/utils/adt/like.c	27 Feb 2007 23:48:08 -	1.68
--- src/backend/utils/adt/like.c	21 May 2007 16:50:48 -
***
*** 28,48 
  #define LIKE_ABORT		(-1)
  
  
! static int	MatchText(char *t, int tlen, char *p, int plen);
! static int	MatchTextIC(char *t, int tlen, char *p, int plen);
! static int	MatchBytea(char *t, int tlen, char *p, int plen);
! static text *do_like_escape(text *, text *);
  
! static int	MBMatchText(char *t, int tlen, char *p, int plen);
! static int	MBMatchTextIC(char *t, int tlen, char *p, int plen);
  static text *MB_do_like_escape(text *, text *);
  
  /*
   * Support routine for MatchText. Compares given multibyte streams
   * as wide characters. If they match, returns 1 otherwise returns 0.
   *
   */
! static int
  wchareq(char *p1, char *p2)
  {
  	int			p1_len;
--- 28,48 
  #define LIKE_ABORT		(-1)
  
  
! static int	SB_MatchText(char *t, int tlen, char *p, int plen);
! static text *SB_do_like_escape(text *, text *);
  
! static int	MB_MatchText(char *t, int tlen, char *p, int plen);
  static text *MB_do_like_escape(text *, text *);
  
+ static int	GenericMatchText(char *s, int slen, char* p, int plen);
+ static int	Generic_Text_IC_like(text *str, text *pat);
+ 
  /*
   * Support routine for MatchText. Compares given multibyte streams
   * as wide characters. If they match, returns 1 otherwise returns 0.
   *
   */
! static inline int
  wchareq(char *p1, char *p2)
  {
  	int			p1_len;
***
*** 72,86 
   * of getting a single character transformed to the system's wchar_t format.
   * So now, we just downcase the strings using lower() and apply regular LIKE
   * comparison.	This should be revisited when we install better locale support.
-  *
-  * Note that MBMatchText and MBMatchTextIC do exactly the same thing now.
-  * Is it worth refactoring to avoid duplicated code?  They might become
-  * different again in the future.
   */
  
  /* Set up to compile like_match.c for multibyte characters */
  #define CHAREQ(p1, p2) wchareq(p1, p2)
- #define ICHAREQ(p1, p2) wchareq(p1, p2)
  #define NextChar(p, plen) \
  	do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
  #define CopyAdvChar(dst, src, srclen) \
--- 72,84 
   * of getting a single character transformed to the system's wchar_t format.
   * So now, we just downcase the strings using lower() and apply regular LIKE
   * comparison.	This should be revisited when we install better locale support.
   */
  
+ #define NextByte(p, plen)	((p)++, (plen)--)
+ #define BYTEEQ(p1, p2)		(*(p1) == *(p2))
+ 
  /* Set up to compile like_match.c for multibyte characters */
  #define CHAREQ(p1, p2) wchareq(p1, p2)
  #define NextChar(p, plen) \
  	do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
  #define CopyAdvChar(dst, src, srclen) \
***
*** 89,122 
  		 while (__l--  0) \
  			 *(dst)++ = *(src)++; \
  	   } while (0)
  
! #define MatchText	MBMatchText
! #define MatchTextIC MBMatchTextIC
  #define do_like_escape	MB_do_like_escape
  
  #include like_match.c
  
- #undef CHAREQ
- #undef ICHAREQ
- #undef NextChar
- #undef CopyAdvChar
- #undef MatchText
- #undef MatchTextIC
- #undef do_like_escape
- 
  /* Set up to compile like_match.c for single-byte characters */
! #define CHAREQ(p1, p2) (*(p1) == *(p2))
! #define ICHAREQ(p1, p2) (tolower((unsigned char) *(p1)) == tolower((unsigned char) *(p2)))
! #define NextChar(p, plen) ((p)++, (plen)--)
  #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--)
  
  #include like_match.c
  
! /* And some support for BYTEA */
! #define BYTEA_CHAREQ(p1, p2) (*(p1) == *(p2))
! #define BYTEA_NextChar(p, plen) ((p)++, (plen)--)
! #define BYTEA_CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--)
  
  
  /*
   *	interface routines called by the function manager
--- 

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Dennis Bjorklund

Tom Lane skrev:

You could imagine trying to do
% a byte at a time (and indeed that's what I'd been thinking it did)
but that gets you out of sync which breaks the _ case.


It is only when you have a pattern like '%_' when this is a problem and 
we could detect this and do byte by byte when it's not. Now we check (*p 
== '\\') || (*p == '_') in each iteration when we scan over characters 
for '%', and we could do it once and have different loops for the two cases.


Other than this part that I think can be optimized I don't see anything 
wrong with the idea behind the patch. To make the '%' case fast might be 
an important optimization for a lot of use cases. It's not uncommon that 
'%' matches a bigger part of the string than the rest of the pattern.


It's easy to make a misstake when one is used to think about the simple 
fixed size characters like ascii. Strange that this simple topic can be 
so difficult to think about... :-)


/Dennis

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Andrew Dunstan



Dennis Bjorklund wrote:

Tom Lane skrev:

You could imagine trying to do
% a byte at a time (and indeed that's what I'd been thinking it did)
but that gets you out of sync which breaks the _ case.


It is only when you have a pattern like '%_' when this is a problem 
and we could detect this and do byte by byte when it's not. Now we 
check (*p == '\\') || (*p == '_') in each iteration when we scan over 
characters for '%', and we could do it once and have different loops 
for the two cases.


Other than this part that I think can be optimized I don't see 
anything wrong with the idea behind the patch. To make the '%' case 
fast might be an important optimization for a lot of use cases. It's 
not uncommon that '%' matches a bigger part of the string than the 
rest of the pattern.





Are you sure? The big remaining char-matching bottleneck will surely be 
in the code that scans for a place to start matching a %. But that's 
exactly where we can't use byte matching for cases where the charset 
might include AB and BA as characters - the pattern might contain %BA 
and the string AB. However, this isn't a danger for UTF8, which leads me 
to think that we do indeed need a special case for UTF8, but for a 
different improvement from that proposed in the original patch. I'll 
post an updated patch shortly.


cheers

andrew

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Andrew Dunstan



I wrote:





It is only when you have a pattern like '%_' when this is a problem 
and we could detect this and do byte by byte when it's not. Now we 
check (*p == '\\') || (*p == '_') in each iteration when we scan over 
characters for '%', and we could do it once and have different loops 
for the two cases.


Other than this part that I think can be optimized I don't see 
anything wrong with the idea behind the patch. To make the '%' case 
fast might be an important optimization for a lot of use cases. It's 
not uncommon that '%' matches a bigger part of the string than the 
rest of the pattern.





Are you sure? The big remaining char-matching bottleneck will surely 
be in the code that scans for a place to start matching a %. But 
that's exactly where we can't use byte matching for cases where the 
charset might include AB and BA as characters - the pattern might 
contain %BA and the string AB. However, this isn't a danger for UTF8, 
which leads me to think that we do indeed need a special case for 
UTF8, but for a different improvement from that proposed in the 
original patch. I'll post an updated patch shortly.




Here is a patch that implements this. Please analyse for possible breakage.

cheers

andrew



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Andrew Dunstan


oops. patch attached this time


Andrew Dunstan wrote:



I wrote:





It is only when you have a pattern like '%_' when this is a problem 
and we could detect this and do byte by byte when it's not. Now we 
check (*p == '\\') || (*p == '_') in each iteration when we scan 
over characters for '%', and we could do it once and have different 
loops for the two cases.


Other than this part that I think can be optimized I don't see 
anything wrong with the idea behind the patch. To make the '%' case 
fast might be an important optimization for a lot of use cases. It's 
not uncommon that '%' matches a bigger part of the string than the 
rest of the pattern.





Are you sure? The big remaining char-matching bottleneck will surely 
be in the code that scans for a place to start matching a %. But 
that's exactly where we can't use byte matching for cases where the 
charset might include AB and BA as characters - the pattern might 
contain %BA and the string AB. However, this isn't a danger for UTF8, 
which leads me to think that we do indeed need a special case for 
UTF8, but for a different improvement from that proposed in the 
original patch. I'll post an updated patch shortly.




Here is a patch that implements this. Please analyse for possible 
breakage.


cheers

andrew



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Index: src/backend/utils/adt/like.c
===
RCS file: /cvsroot/pgsql/src/backend/utils/adt/like.c,v
retrieving revision 1.68
diff -c -r1.68 like.c
*** src/backend/utils/adt/like.c	27 Feb 2007 23:48:08 -	1.68
--- src/backend/utils/adt/like.c	20 May 2007 14:16:22 -
***
*** 28,48 
  #define LIKE_ABORT		(-1)
  
  
! static int	MatchText(char *t, int tlen, char *p, int plen);
! static int	MatchTextIC(char *t, int tlen, char *p, int plen);
! static int	MatchBytea(char *t, int tlen, char *p, int plen);
! static text *do_like_escape(text *, text *);
  
! static int	MBMatchText(char *t, int tlen, char *p, int plen);
! static int	MBMatchTextIC(char *t, int tlen, char *p, int plen);
  static text *MB_do_like_escape(text *, text *);
  
  /*
   * Support routine for MatchText. Compares given multibyte streams
   * as wide characters. If they match, returns 1 otherwise returns 0.
   *
   */
! static int
  wchareq(char *p1, char *p2)
  {
  	int			p1_len;
--- 28,51 
  #define LIKE_ABORT		(-1)
  
  
! static int	SB_MatchText(char *t, int tlen, char *p, int plen);
! static int	SB_MatchTextIC(char *t, int tlen, char *p, int plen);
! static text *SB_do_like_escape(text *, text *);
  
! static int	MB_MatchText(char *t, int tlen, char *p, int plen);
  static text *MB_do_like_escape(text *, text *);
  
+ static int	UTF8_MatchText(char *t, int tlen, char *p, int plen);
+ 
+ static int	GenericMatchText(char *s, int slen, char* p, int plen);
+ static int	mbtexticlike(text *str, text *pat);
+ 
  /*
   * Support routine for MatchText. Compares given multibyte streams
   * as wide characters. If they match, returns 1 otherwise returns 0.
   *
   */
! static inline int
  wchareq(char *p1, char *p2)
  {
  	int			p1_len;
***
*** 72,86 
   * of getting a single character transformed to the system's wchar_t format.
   * So now, we just downcase the strings using lower() and apply regular LIKE
   * comparison.	This should be revisited when we install better locale support.
-  *
-  * Note that MBMatchText and MBMatchTextIC do exactly the same thing now.
-  * Is it worth refactoring to avoid duplicated code?  They might become
-  * different again in the future.
   */
  
  /* Set up to compile like_match.c for multibyte characters */
  #define CHAREQ(p1, p2) wchareq(p1, p2)
- #define ICHAREQ(p1, p2) wchareq(p1, p2)
  #define NextChar(p, plen) \
  	do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
  #define CopyAdvChar(dst, src, srclen) \
--- 75,87 
   * of getting a single character transformed to the system's wchar_t format.
   * So now, we just downcase the strings using lower() and apply regular LIKE
   * comparison.	This should be revisited when we install better locale support.
   */
  
+ #define NextByte(p, plen)	((p)++, (plen)--)
+ #define BYTEEQ(p1, p2)		(*(p1) == *(p2))
+ 
  /* Set up to compile like_match.c for multibyte characters */
  #define CHAREQ(p1, p2) wchareq(p1, p2)
  #define NextChar(p, plen) \
  	do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
  #define CopyAdvChar(dst, src, srclen) \
***
*** 89,122 
  		 while (__l--  0) \
  			 *(dst)++ = *(src)++; \
  	   } while (0)
  
! #define MatchText	MBMatchText
! #define MatchTextIC MBMatchTextIC
  #define do_like_escape	MB_do_like_escape
  
  

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 Are you sure? The big remaining char-matching bottleneck will surely 
 be in the code that scans for a place to start matching a %. But 
 that's exactly where we can't use byte matching for cases where the 
 charset might include AB and BA as characters - the pattern might 
 contain %BA and the string AB. However, this isn't a danger for UTF8, 
 which leads me to think that we do indeed need a special case for 
 UTF8, but for a different improvement from that proposed in the 
 original patch. I'll post an updated patch shortly.

 Here is a patch that implements this. Please analyse for possible 
 breakage.

On the strength of this analysis, shouldn't we drop the separate
UTF8 match function and just use SB_MatchText for UTF8?

It strikes me that we may be overcomplicating matters in another way
too.  If you believe that the %-scan code is now the bottleneck, that
is, the key loop is where we have pattern '%foo' and we are trying to
match 'f' to each successive data position, then you should be bothered
that SB_MatchTextIC is applying tolower() to 'f' again for each data
character.  Worst-case we could have O(N^2) applications of tolower()
during a match.  I think there's a fair case to be made that we should
get rid of SB_MatchTextIC and implement *all* the case-insensitive
variants by means of an initial lower() call.  This would leave us with
just two match functions and allow considerable unification of the setup
logic.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Andrew Dunstan



Tom Lane wrote:

On the strength of this analysis, shouldn't we drop the separate
UTF8 match function and just use SB_MatchText for UTF8?
  


Possibly - IIRC I looked at that and there was some reason I didn't, but 
I'll look again.



It strikes me that we may be overcomplicating matters in another way
too.  If you believe that the %-scan code is now the bottleneck, that
is, the key loop is where we have pattern '%foo' and we are trying to
match 'f' to each successive data position, then you should be bothered
that SB_MatchTextIC is applying tolower() to 'f' again for each data
character.  Worst-case we could have O(N^2) applications of tolower()
during a match.  I think there's a fair case to be made that we should
get rid of SB_MatchTextIC and implement *all* the case-insensitive
variants by means of an initial lower() call.  This would leave us with
just two match functions and allow considerable unification of the setup
logic.


  



Yeah, quite possibly. I'm also wondering if we are wasting effort 
downcasing what will in most cases be the same pattern over and over 
again. Maybe we need to look at memoizing that somehow, or at least test 
to see if that would be a gain.


We're getting quite a long way from the original patch :-)

cheers

andrew

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 Yeah, quite possibly. I'm also wondering if we are wasting effort 
 downcasing what will in most cases be the same pattern over and over 
 again. Maybe we need to look at memoizing that somehow, or at least test 
 to see if that would be a gain.

Someone (Itagaki-san IIRC) suggested that we ought to convert
x ILIKE y into lower(x) LIKE lower(y) at some fairly early
stage, definitely before constant-folding in the planner.  That
would take care of that issue without any run-time mechanism,
and would open opportunities for making use of an index on lower(x).

I recall thinking at the time that there were some potential downsides,
but right at the moment I'm darned if I can see any --- especially
if we're going to make ILIKE do this uniformly at runtime anyway.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Andrew Dunstan



Tom Lane wrote:

Andrew Dunstan [EMAIL PROTECTED] writes:
  
Yeah, quite possibly. I'm also wondering if we are wasting effort 
downcasing what will in most cases be the same pattern over and over 
again. Maybe we need to look at memoizing that somehow, or at least test 
to see if that would be a gain.



Someone (Itagaki-san IIRC) suggested that we ought to convert
x ILIKE y into lower(x) LIKE lower(y) at some fairly early
stage, definitely before constant-folding in the planner.  That
would take care of that issue without any run-time mechanism,
and would open opportunities for making use of an index on lower(x).

I recall thinking at the time that there were some potential downsides,
but right at the moment I'm darned if I can see any --- especially
if we're going to make ILIKE do this uniformly at runtime anyway.


  
Sounds like a TODO item. I'm already concerned a bit about scope creep 
for this item.


cheers

andrew

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 Andrew Dunstan [EMAIL PROTECTED] writes:
 Yeah, quite possibly. I'm also wondering if we are wasting effort 
 downcasing what will in most cases be the same pattern over and over 
 again. Maybe we need to look at memoizing that somehow, or at least test 
 to see if that would be a gain.
 
 Someone (Itagaki-san IIRC) suggested that we ought to convert
 x ILIKE y into lower(x) LIKE lower(y) at some fairly early
 stage, definitely before constant-folding in the planner.
 
 Sounds like a TODO item. I'm already concerned a bit about scope creep 
 for this item.

Agreed, I don't want to tackle this right now --- I'm just suggesting
it's probably a better answer than memoizing at runtime.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Andrew Dunstan



Tom Lane wrote:


On the strength of this analysis, shouldn't we drop the separate
UTF8 match function and just use SB_MatchText for UTF8?
  


We still call NextChar() after _, and I think we probably need to, 
don't we? If so we can't just marry the cases.



cheers

andrew

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 On the strength of this analysis, shouldn't we drop the separate
 UTF8 match function and just use SB_MatchText for UTF8?

 We still call NextChar() after _, and I think we probably need to, 
 don't we? If so we can't just marry the cases.

Doh, you're right ... but on third thought, what happens with a pattern
containing %_?  If % tries to advance bytewise then we'll be trying to
apply NextChar in the middle of a data character, and bad things ensue.

I think we need to go back to the scheme with SB_ and MB_ variants and
no special case for UTF8.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Andrew Dunstan



Tom Lane wrote:

Andrew Dunstan [EMAIL PROTECTED] writes:
  

Tom Lane wrote:


On the strength of this analysis, shouldn't we drop the separate
UTF8 match function and just use SB_MatchText for UTF8?
  


  
We still call NextChar() after _, and I think we probably need to, 
don't we? If so we can't just marry the cases.



Doh, you're right ... but on third thought, what happens with a pattern
containing %_?  If % tries to advance bytewise then we'll be trying to
apply NextChar in the middle of a data character, and bad things ensue.

I think we need to go back to the scheme with SB_ and MB_ variants and
no special case for UTF8.


  


My head is spinning with all these variants. I'll look at ti tomorrow.

cheers

andrew

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] UTF8MatchText

2007-05-18 Thread Andrew Dunstan



Tom Lane wrote:

ITAGAKI Takahiro [EMAIL PROTECTED] writes:
  

Yes, I only used the 'disjoint representations for first-bytes and
not-first-bytes of MB characters' feature in UTF8. Other encodings
allows both [AB] and [BA] for MB character patterns. UTF8Match() does
not cope with those encodings; If we have '[AB][AB]' in a table and
search it with LIKE '%[BA]%', we judge that they are matched by mistake.



AFAICS, the patch does *not* make that mistake because % will not
advance over a fractional character.


  


Unless I hear differently, my present intention is to apply the 
suggested improvement universally. I'll wait a day or two before 
completing the patch.


cheers

andrew

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan


Itagaki,

I find this still fairly unclean. It certainly took me some time to get 
me head around what's going on.


ISTM we should generate all these match functions from one body of code 
plus some #define magic.


As I understand it, we have three possible encoding switches:  Single 
Byte, UTF8 and other Multi Byte Charsets, and two possible case 
settings: case Sensitive and Case Insensitive. That would make for a 
total of six functions, but in the case of both UTF8 and other MBCS we 
don't need a special Case Insensitive function - instead we downcase 
both the string and the pattern and then use the Case Sensitive 
function. That leaves a total of four functions.


What is not clear to me is why the UTF8 optimisation work, and why it 
doesn't apply to other MBCS. At the very least we need a comment on that.


I also find the existing function naming convention somewhat annoying - 
having foo() and MB_foo() is less than clear. I'd rather have SB_foo() 
and MB_foo(). That's not your fault, of course.


If you supply me with some explanation on the UTF8 optimisation issue, 
I'll prepare a revised patch along these lines.


cheers

andrew



ITAGAKI Takahiro wrote:

Bruce Momjian [EMAIL PROTECTED] wrote:

  

I do not understand this patch.  You have defined two functions,
UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
is that one calls CHAREQ and the other calls ICHAREQ, but just above
those two functions you define the macros identically:
  

Why are there two functions?  Also, can't you use one function and just
pass a boolean to indicate whether case should be ignored?



The same is true of MBMatchText() and MBMatchTextIC().
Now, I'll split the patch into two changes.

1. DropMBMatchTextIC.patch
Drop MBMatchTextIC() and use MBMatchText() instead.

2. UTF8MatchText.patch
Add UTF8MatchText() as a specialized version of MBMatchText().


As a future work, it might be good to research the performance of rewriting
col ILIKE 'pattern' to lower(col) LIKE lower('pattern') in planner so that
we can avoid to call lower() for constant pattern in the right-hand side and
can use functional indexes (lower(col)). I think we never need MBMatchTextIC()
in the future unless we move to wide-character server encoding :)

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

  




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match
  


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan


I wrote:



ISTM we should generate all these match functions from one body of 
code plus some #define magic.


As I understand it, we have three possible encoding switches:  Single 
Byte, UTF8 and other Multi Byte Charsets, and two possible case 
settings: case Sensitive and Case Insensitive. That would make for a 
total of six functions, but in the case of both UTF8 and other MBCS we 
don't need a special Case Insensitive function - instead we downcase 
both the string and the pattern and then use the Case Sensitive 
function. That leaves a total of four functions.


What is not clear to me is why the UTF8 optimisation work, and why it 
doesn't apply to other MBCS. At the very least we need a comment on that.


I also find the existing function naming convention somewhat annoying 
- having foo() and MB_foo() is less than clear. I'd rather have 
SB_foo() and MB_foo(). That's not your fault, of course.


If you supply me with some explanation on the UTF8 optimisation issue, 
I'll prepare a revised patch along these lines.





Ok, I have studied some more and I think I understand what's going on. 
AIUI, we are switching from some expensive char-wise comparisons to 
cheap byte-wise comparisons in the UTF8 case because we know that in 
UTF8 the magic characters ('_', '%' and '\') aren't a part of any other 
character sequence. Is that putting it too mildly? Do we need stronger 
conditions than that? If it's correct, are there other MBCS for which 
this is true?


cheers

andrew



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 Ok, I have studied some more and I think I understand what's going on. 
 AIUI, we are switching from some expensive char-wise comparisons to 
 cheap byte-wise comparisons in the UTF8 case because we know that in 
 UTF8 the magic characters ('_', '%' and '\') aren't a part of any other 
 character sequence. Is that putting it too mildly? Do we need stronger 
 conditions than that? If it's correct, are there other MBCS for which 
 this is true?

I don't think this is a correct analysis.  If it were correct then we
could use the optimization for all backend charsets because none of them
allow MB characters to contain non-high-bit-set bytes.  But it was
stated somewhere upthread that that doesn't actually work.  Clearly
it's a necessary property that we not falsely detect the magic pattern
characters, but that's not sufficient.

I think the real issue is that UTF8 has disjoint representations for
first-bytes and not-first-bytes of MB characters, and thus it is
impossible to make a false match in which an MB pattern character is
matched to the end of one data character plus the start of another.
In character sets without that property, we have to use the slow way to
ensure we don't make out-of-sync matches.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Tom Lane
Wait a second ... I just thought of a counterexample that destroys the
entire concept.  Consider the pattern 'A__B', which clearly is supposed
to match strings of four *characters*.  With the proposed patch in
place, it would match strings of four *bytes*.  Which is not the correct
behavior.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan



Tom Lane wrote:

UTF8 has disjoint representations for
first-bytes and not-first-bytes of MB characters, and thus it is
impossible to make a false match in which an MB pattern character is
matched to the end of one data character plus the start of another.
In character sets without that property, we have to use the slow way to
ensure we don't make out-of-sync matches.


  


Thanks. I will include this info in the comments.

cheers

andrew

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan



Tom Lane wrote:

Wait a second ... I just thought of a counterexample that destroys the
entire concept.  Consider the pattern 'A__B', which clearly is supposed
to match strings of four *characters*.  With the proposed patch in
place, it would match strings of four *bytes*.  Which is not the correct
behavior.

  


From what I can see the code is quite careful about when it calls 
NextByte vs NextChar, and after _ it calls NextChar.


So I'll test for this, but I think it's safe.

cheers

andrew



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 Wait a second ... I just thought of a counterexample that destroys the
 entire concept.  Consider the pattern 'A__B', which clearly is supposed
 to match strings of four *characters*.  With the proposed patch in
 place, it would match strings of four *bytes*.  Which is not the correct
 behavior.

  From what I can see the code is quite careful about when it calls 
 NextByte vs NextChar, and after _ it calls NextChar.

Except that the entire point of this patch is to dumb down NextChar to
be the same as NextByte for UTF8 strings.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan



Tom Lane wrote:

Andrew Dunstan [EMAIL PROTECTED] writes:
  

Tom Lane wrote:


Wait a second ... I just thought of a counterexample that destroys the
entire concept.  Consider the pattern 'A__B', which clearly is supposed
to match strings of four *characters*.  With the proposed patch in
place, it would match strings of four *bytes*.  Which is not the correct
behavior.
  


  
 From what I can see the code is quite careful about when it calls 
NextByte vs NextChar, and after _ it calls NextChar.



Except that the entire point of this patch is to dumb down NextChar to
be the same as NextByte for UTF8 strings.


  


That's not what I see in (what I think is) the latest submission, which 
includes this snippet:


+ /* Set up for utf8 characters */
+ #define CHAREQ(p1, p2)wchareq(p1, p2)
+ #define NextChar(p, plen) \
+   do { int __l = pg_utf_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
+
+ /*
+  * UTF8MatchText -- specialized version of MBMatchText for UTF8
+  */
+ static int
+ UTF8MatchText(char *t, int tlen, char *p, int plen)

Am I looking at the wrong thing? This is from around April 9th I think.


cheers

andrew


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 Except that the entire point of this patch is to dumb down NextChar to
 be the same as NextByte for UTF8 strings.

 That's not what I see in (what I think is) the latest submission, which 
 includes this snippet:

[ scratches head... ]  OK, then I think I totally missed what this patch
is trying to accomplish; because this code looks just the same as the
existing multibyte-character operations.  Where does the performance
improvement come from?

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan



Tom Lane wrote:

Andrew Dunstan [EMAIL PROTECTED] writes:
  

Tom Lane wrote:


Except that the entire point of this patch is to dumb down NextChar to
be the same as NextByte for UTF8 strings.
  


  
That's not what I see in (what I think is) the latest submission, which 
includes this snippet:



[ scratches head... ]  OK, then I think I totally missed what this patch
is trying to accomplish; because this code looks just the same as the
existing multibyte-character operations.  Where does the performance
improvement come from?


  


That's what bothered me. The trouble is that we have so much code that 
looks *almost* identical.


From my WIP patch, here's where the difference appears to be - note 
that UTF8 branch has two NextByte calls at the bottom, unlike the other 
branch:



#ifdef UTF8_OPT
   /*
* UTF8 is optimised to do byte at a time matching in most cases,
* thus saving expensive calls to NextChar.
*
* UTF8 has disjoint representations for first-bytes and
* not-first-bytes of MB characters, and thus it is
* impossible to make a false match in which an MB pattern
* character is matched to the end of one data character
* plus the start of another.
* In character sets without that property, we have to use the
* slow way to ensure we don't make out-of-sync matches.
*/
   else if (*p == '_')
   {
   NextChar(t, tlen);
   NextByte(p, plen);
   continue;
   }
   else if (!BYTEEQ(t, p))
   {
   /*
* Not the single-character wildcard and no explicit match? Then
* time to quit...
*/
   return LIKE_FALSE;
   }

   NextByte(t, tlen);
   NextByte(p, plen);
#else
   /*
* Branch for non-utf8 multi-byte charsets and also for single-byte
* charsets which don't gain any benefit from the above 
optimisation.

*/
 
   else if ((*p != '_')  !CHAREQ(t, p))

   {
   /*
* Not the single-character wildcard and no explicit match? Then
* time to quit...
*/
   return LIKE_FALSE;
   }

   NextChar(t, tlen);
   NextChar(p, plen);

#endif /* UTF8_OPT */


cheers

andrew



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 From my WIP patch, here's where the difference appears to be - note 
 that UTF8 branch has two NextByte calls at the bottom, unlike the other 
 branch:

Oh, I see: NextChar is still real but the patch is willing to have t
and p pointing into the middle of an MB character.  That's a bit
risky.  I think it works but it's making at least the following
undocumented assumptions:

* At a pattern backslash, it applies CHAREQ() but then advances
byte-by-byte over the matched characters (implicitly assuming that none
of these bytes will look like the magic characters).  While that works
for backend-safe encodings, it seems a bit strange; you've already paid
the price of determining the character length once, not to mention
matching the bytes of the characters once, and then throw that knowledge
away.  I think BYTEEQ would make more sense in the backslash path.

* At pattern % or _, it's critical that we do NextChar not NextByte
on the data side.  Else t is pointing into the middle of an MB sequence
when p isn't, and we have various out-of-sync conditions to worry about,
notably possibly calling NextChar when t is not pointing at the first
byte of the character, which will result in a wrong answer about the
character length.

* We *must* do NextChar not NextByte for _ since we have to match it to
exactly one logical character, not byte.  You could imagine trying to do
% a byte at a time (and indeed that's what I'd been thinking it did)
but that gets you out of sync which breaks the _ case.

So the actual optimization here is that we do bytewise comparison and
advancing, but only when we are either at the start of a character
(on both sides, and the pattern char is not wildcard) or we are in the
middle of a character (on both sides) and we've already proven that both
sides matched for the previous byte(s) of the character.

On the strength of this closer reading, I would say that the patch isn't
relying on UTF8's first-byte-vs-not-first-byte property after all.
All that it's relying on is that no MB character is a prefix of another
one, which seems like a necessary property for any sane encoding; plus
that characters are considered equal only if they're bytewise equal.
So are we sure it doesn't work for non-UTF8 encodings?  Maybe that
earlier conclusion was based on a misunderstanding of what the patch
really does.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan



Tom Lane wrote:


* At a pattern backslash, it applies CHAREQ() but then advances
byte-by-byte over the matched characters (implicitly assuming that none
of these bytes will look like the magic characters).  While that works
for backend-safe encodings, it seems a bit strange; you've already paid
the price of determining the character length once, not to mention
matching the bytes of the characters once, and then throw that knowledge
away.  I think BYTEEQ would make more sense in the backslash path.
  


Probably, although the use of CHAREQ is in the present code.

Is it legal to follow escape by anything other than _ % or escape?



So the actual optimization here is that we do bytewise comparison and
advancing, but only when we are either at the start of a character
(on both sides, and the pattern char is not wildcard) or we are in the
middle of a character (on both sides) and we've already proven that both
sides matched for the previous byte(s) of the character.
  


I think that's correct.


On the strength of this closer reading, I would say that the patch isn't
relying on UTF8's first-byte-vs-not-first-byte property after all.
All that it's relying on is that no MB character is a prefix of another
one, which seems like a necessary property for any sane encoding; plus
that characters are considered equal only if they're bytewise equal.
So are we sure it doesn't work for non-UTF8 encodings?  Maybe that
earlier conclusion was based on a misunderstanding of what the patch
really does.


  



Indeed.

One more thing - I'm thinking of rolling up the bytea matching routines 
as well as the text routines to eliminate all the duplication of logic. 
I can do it by a little type casting from bytea* to text* and back 
again, or if that's not acceptable by some preprocessor magic. I think 
the casting is likely to be safe enough in this case - I don't think a 
null byte will hurt us anywhere in this code - and presumably the 
varlena stuff is all the same. Does that sound reasonable?



cheers

andrew

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 Is it legal to follow escape by anything other than _ % or escape?

Certainly, but once you've compared the first byte you can handle any
remaining bytes via the main loop.  And in fact the code is already
depending on being able to do that --- the use of CHAREQ rather than
BYTEEQ is just wasted cycles.

 One more thing - I'm thinking of rolling up the bytea matching routines 
 as well as the text routines to eliminate all the duplication of logic. 

+1, I think, but I wonder why we had the duplication in the first
place.  Is there any likelihood that bytea's semantics should diverge
from the text case?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] UTF8MatchText

2007-05-17 Thread ITAGAKI Takahiro

Tom Lane [EMAIL PROTECTED] wrote:

 On the strength of this closer reading, I would say that the patch isn't
 relying on UTF8's first-byte-vs-not-first-byte property after all.
 All that it's relying on is that no MB character is a prefix of another
 one, which seems like a necessary property for any sane encoding; plus
 that characters are considered equal only if they're bytewise equal.
 So are we sure it doesn't work for non-UTF8 encodings?  Maybe that
 earlier conclusion was based on a misunderstanding of what the patch
 really does.

Yes, I only used the 'disjoint representations for first-bytes and
not-first-bytes of MB characters' feature in UTF8. Other encodings
allows both [AB] and [BA] for MB character patterns. UTF8Match() does
not cope with those encodings; If we have '[AB][AB]' in a table and
search it with LIKE '%[BA]%', we judge that they are matched by mistake.

I've also misunderstood it before, and Dennis Bjorklund pointed out.
http://archives.postgresql.org/pgsql-hackers/2007-03/msg01377.php

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Tom Lane
ITAGAKI Takahiro [EMAIL PROTECTED] writes:
 Yes, I only used the 'disjoint representations for first-bytes and
 not-first-bytes of MB characters' feature in UTF8. Other encodings
 allows both [AB] and [BA] for MB character patterns. UTF8Match() does
 not cope with those encodings; If we have '[AB][AB]' in a table and
 search it with LIKE '%[BA]%', we judge that they are matched by mistake.

AFAICS, the patch does *not* make that mistake because % will not
advance over a fractional character.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan



Tom Lane wrote:

ITAGAKI Takahiro [EMAIL PROTECTED] writes:
  

Yes, I only used the 'disjoint representations for first-bytes and
not-first-bytes of MB characters' feature in UTF8. Other encodings
allows both [AB] and [BA] for MB character patterns. UTF8Match() does
not cope with those encodings; If we have '[AB][AB]' in a table and
search it with LIKE '%[BA]%', we judge that they are matched by mistake.



AFAICS, the patch does *not* make that mistake because % will not
advance over a fractional character.

  



Yeah, I think that's right.

Attached is my current WIP patch. If we decide that this optimisation 
can in fact be applied to all backend encodings, that will be easily 
incorporated. It will simplify the code further. Note that all the 
common code in the MatchText and do_like_escape functions has been 
factored - and the bytea functions just call the single-byte text 
versions - AFAICS the effect will be identical to having the specialised 
versions. (I'm always happy when code volume can be reduced.)


cheers

andrew

Index: src/backend/utils/adt/like.c
===
RCS file: /cvsroot/pgsql/src/backend/utils/adt/like.c,v
retrieving revision 1.68
diff -c -r1.68 like.c
*** src/backend/utils/adt/like.c	27 Feb 2007 23:48:08 -	1.68
--- src/backend/utils/adt/like.c	18 May 2007 02:47:41 -
***
*** 28,48 
  #define LIKE_ABORT		(-1)
  
  
! static int	MatchText(char *t, int tlen, char *p, int plen);
! static int	MatchTextIC(char *t, int tlen, char *p, int plen);
! static int	MatchBytea(char *t, int tlen, char *p, int plen);
! static text *do_like_escape(text *, text *);
  
! static int	MBMatchText(char *t, int tlen, char *p, int plen);
! static int	MBMatchTextIC(char *t, int tlen, char *p, int plen);
  static text *MB_do_like_escape(text *, text *);
  
  /*
   * Support routine for MatchText. Compares given multibyte streams
   * as wide characters. If they match, returns 1 otherwise returns 0.
   *
   */
! static int
  wchareq(char *p1, char *p2)
  {
  	int			p1_len;
--- 28,50 
  #define LIKE_ABORT		(-1)
  
  
! static int	SB_MatchText(char *t, int tlen, char *p, int plen);
! static int	SB_MatchTextIC(char *t, int tlen, char *p, int plen);
! static text *SB_do_like_escape(text *, text *);
  
! static int	MB_MatchText(char *t, int tlen, char *p, int plen);
  static text *MB_do_like_escape(text *, text *);
  
+ static int	UTF8_MatchText(char *t, int tlen, char *p, int plen);
+ static int	GenericMatchText(char *s, int slen, char* p, int plen);
+ static int	mbtexticlike(text *str, text *pat);
+ 
  /*
   * Support routine for MatchText. Compares given multibyte streams
   * as wide characters. If they match, returns 1 otherwise returns 0.
   *
   */
! static __inline__ int
  wchareq(char *p1, char *p2)
  {
  	int			p1_len;
***
*** 72,86 
   * of getting a single character transformed to the system's wchar_t format.
   * So now, we just downcase the strings using lower() and apply regular LIKE
   * comparison.	This should be revisited when we install better locale support.
-  *
-  * Note that MBMatchText and MBMatchTextIC do exactly the same thing now.
-  * Is it worth refactoring to avoid duplicated code?  They might become
-  * different again in the future.
   */
  
  /* Set up to compile like_match.c for multibyte characters */
  #define CHAREQ(p1, p2) wchareq(p1, p2)
- #define ICHAREQ(p1, p2) wchareq(p1, p2)
  #define NextChar(p, plen) \
  	do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
  #define CopyAdvChar(dst, src, srclen) \
--- 74,86 
   * of getting a single character transformed to the system's wchar_t format.
   * So now, we just downcase the strings using lower() and apply regular LIKE
   * comparison.	This should be revisited when we install better locale support.
   */
  
+ #define NextByte(p, plen)	((p)++, (plen)--)
+ #define BYTEEQ(p1, p2)		(*(p1) == *(p2))
+ 
  /* Set up to compile like_match.c for multibyte characters */
  #define CHAREQ(p1, p2) wchareq(p1, p2)
  #define NextChar(p, plen) \
  	do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
  #define CopyAdvChar(dst, src, srclen) \
***
*** 90,122 
  			 *(dst)++ = *(src)++; \
  	   } while (0)
  
! #define MatchText	MBMatchText
! #define MatchTextIC MBMatchTextIC
  #define do_like_escape	MB_do_like_escape
  
  #include like_match.c
  
- #undef CHAREQ
- #undef ICHAREQ
- #undef NextChar
- #undef CopyAdvChar
- #undef MatchText
- #undef MatchTextIC
- #undef do_like_escape
- 
  /* Set up to compile like_match.c for single-byte characters */
! #define CHAREQ(p1, p2) (*(p1) == *(p2))
! #define ICHAREQ(p1, p2) (tolower((unsigned char) *(p1)) == tolower((unsigned char) *(p2)))
! #define NextChar(p, plen) ((p)++, (plen)--)
  #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--)
  
  

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 Attached is my current WIP patch.

A few quick eyeball comments:

 ! static __inline__ int

Under *no* circumstances use __inline__, as it will certainly break
every non-gcc compiler.  Use inline, which we #define appropriately
at need.

 !  * UTF8 has disjoint representations for first-bytes and 
 !  * not-first-bytes of MB characters, and thus it is
 !  * impossible to make a false match in which an MB pattern 
 !  * character is matched to the end of one data character 
 !  * plus the start of another.
 !  * In character sets without that property, we have to use the 
 !  * slow way to ensure we don't make out-of-sync matches.

I thought we'd concluded that this explanation is pseudo-science?

 !  * Branch for non-utf8 multi-byte charsets and also for 
 single-byte
 !  * charsets which don't gain any benefir from the above 
 optimisation.

spellcheck...

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan



Tom Lane wrote:

Under *no* circumstances use __inline__, as it will certainly break
every non-gcc compiler.  Use inline, which we #define appropriately
at need.
  


OK. (this was from upstream patch.)
  
I thought we'd concluded that this explanation is pseudo-science?


  

[...]


spellcheck...


  


Right. I'm waiting on a consensus about the UTF8-ness of the solution 
before I adjust comments.


cheers

andrew

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] UTF8MatchText

2007-04-09 Thread Bruce Momjian

Patch removed, updated version submitted.

---

ITAGAKI Takahiro wrote:
 Andrew - Supernews [EMAIL PROTECTED] wrote:
 
   ITAGAKI I think all safe ASCII-supersets encodings are comparable
   ITAGAKI by bytes, not only UTF-8.
  
  This is false, particularly for EUC.
 
 Umm, I see. I updated the optimization to be used only for UTF8 case.
 I also added some inlining hints that are useful on my machine (Pentium 4).
 
 
 x1000 of LIKE '%foo% on 1 rows tables [ms]
  encoding  | HEAD  |  P1   |  P2   |  P3  
 ---+---+---+---+---
  SQL_ASCII |  7094 |  7120 |  7063 |  7031
  LATIN1|  7083 |  7130 |  7057 |  7031
  UTF8  | 17974 | 10859 | 10839 |  9682
  EUC_JP| 17032 | 17557 | 17599 | 15240
 
 - P1: UTF8MatchText()
 - P2: P1 + __inline__ GenericMatchText()
 - P3: P2 + __inline__ wchareq()
   (The attached patch is P3.)
 
 Regards,
 ---
 ITAGAKI Takahiro
 NTT Open Source Software Center
 

[ Attachment, skipping... ]

 
 ---(end of broadcast)---
 TIP 7: You can help support the PostgreSQL project by donating at
 
 http://www.postgresql.org/about/donate

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] UTF8MatchText

2007-04-09 Thread Bruce Momjian

Your patch has been added to the PostgreSQL unapplied patches list at:

http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---


ITAGAKI Takahiro wrote:
 Bruce Momjian [EMAIL PROTECTED] wrote:
 
   I do not understand this patch.  You have defined two functions,
   UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
   is that one calls CHAREQ and the other calls ICHAREQ, but just above
   those two functions you define the macros identically:
  
  Why are there two functions?  Also, can't you use one function and just
  pass a boolean to indicate whether case should be ignored?
 
 The same is true of MBMatchText() and MBMatchTextIC().
 Now, I'll split the patch into two changes.
 
 1. DropMBMatchTextIC.patch
 Drop MBMatchTextIC() and use MBMatchText() instead.
 
 2. UTF8MatchText.patch
 Add UTF8MatchText() as a specialized version of MBMatchText().
 
 
 As a future work, it might be good to research the performance of rewriting
 col ILIKE 'pattern' to lower(col) LIKE lower('pattern') in planner so that
 we can avoid to call lower() for constant pattern in the right-hand side and
 can use functional indexes (lower(col)). I think we never need MBMatchTextIC()
 in the future unless we move to wide-character server encoding :)
 
 Regards,
 ---
 ITAGAKI Takahiro
 NTT Open Source Software Center
 

[ Attachment, skipping... ]

[ Attachment, skipping... ]

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] UTF8MatchText

2007-04-08 Thread ITAGAKI Takahiro
Bruce Momjian [EMAIL PROTECTED] wrote:

  I do not understand this patch.  You have defined two functions,
  UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
  is that one calls CHAREQ and the other calls ICHAREQ, but just above
  those two functions you define the macros identically:
 
 Why are there two functions?  Also, can't you use one function and just
 pass a boolean to indicate whether case should be ignored?

The same is true of MBMatchText() and MBMatchTextIC().
Now, I'll split the patch into two changes.

1. DropMBMatchTextIC.patch
Drop MBMatchTextIC() and use MBMatchText() instead.

2. UTF8MatchText.patch
Add UTF8MatchText() as a specialized version of MBMatchText().


As a future work, it might be good to research the performance of rewriting
col ILIKE 'pattern' to lower(col) LIKE lower('pattern') in planner so that
we can avoid to call lower() for constant pattern in the right-hand side and
can use functional indexes (lower(col)). I think we never need MBMatchTextIC()
in the future unless we move to wide-character server encoding :)

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



DropMBMatchTextIC.patch
Description: Binary data


UTF8MatchText.patch
Description: Binary data

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] UTF8MatchText

2007-04-06 Thread Bruce Momjian

I do not understand this patch.  You have defined two functions,
UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
is that one calls CHAREQ and the other calls ICHAREQ, but just above
those two functions you define the macros identically:

#define CHAREQ(p1, p2)wchareq(p1, p2)
#define ICHAREQ(p1, p2)   wchareq(p1, p2)

Why are there two functions?  Also, can't you use one function and just
pass a boolean to indicate whether case it be ignored?

---

ITAGAKI Takahiro wrote:
 Andrew - Supernews [EMAIL PROTECTED] wrote:
 
   ITAGAKI I think all safe ASCII-supersets encodings are comparable
   ITAGAKI by bytes, not only UTF-8.
  
  This is false, particularly for EUC.
 
 Umm, I see. I updated the optimization to be used only for UTF8 case.
 I also added some inlining hints that are useful on my machine (Pentium 4).
 
 
 x1000 of LIKE '%foo% on 1 rows tables [ms]
  encoding  | HEAD  |  P1   |  P2   |  P3  
 ---+---+---+---+---
  SQL_ASCII |  7094 |  7120 |  7063 |  7031
  LATIN1|  7083 |  7130 |  7057 |  7031
  UTF8  | 17974 | 10859 | 10839 |  9682
  EUC_JP| 17032 | 17557 | 17599 | 15240
 
 - P1: UTF8MatchText()
 - P2: P1 + __inline__ GenericMatchText()
 - P3: P2 + __inline__ wchareq()
   (The attached patch is P3.)
 
 Regards,
 ---
 ITAGAKI Takahiro
 NTT Open Source Software Center
 

[ Attachment, skipping... ]

 
 ---(end of broadcast)---
 TIP 7: You can help support the PostgreSQL project by donating at
 
 http://www.postgresql.org/about/donate

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] UTF8MatchText

2007-04-06 Thread Bruce Momjian
Bruce Momjian wrote:
 
 I do not understand this patch.  You have defined two functions,
 UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
 is that one calls CHAREQ and the other calls ICHAREQ, but just above
 those two functions you define the macros identically:
 
   #define CHAREQ(p1, p2)wchareq(p1, p2)
   #define ICHAREQ(p1, p2)   wchareq(p1, p2)
 
 Why are there two functions?  Also, can't you use one function and just
 pass a boolean to indicate whether case it be ignored?

Sorry, typo:

Why are there two functions?  Also, can't you use one function and just
pass a boolean to indicate whether case should be ignored?
--

 
 ---
 
 ITAGAKI Takahiro wrote:
  Andrew - Supernews [EMAIL PROTECTED] wrote:
  
ITAGAKI I think all safe ASCII-supersets encodings are comparable
ITAGAKI by bytes, not only UTF-8.
   
   This is false, particularly for EUC.
  
  Umm, I see. I updated the optimization to be used only for UTF8 case.
  I also added some inlining hints that are useful on my machine (Pentium 4).
  
  
  x1000 of LIKE '%foo% on 1 rows tables [ms]
   encoding  | HEAD  |  P1   |  P2   |  P3  
  ---+---+---+---+---
   SQL_ASCII |  7094 |  7120 |  7063 |  7031
   LATIN1|  7083 |  7130 |  7057 |  7031
   UTF8  | 17974 | 10859 | 10839 |  9682
   EUC_JP| 17032 | 17557 | 17599 | 15240
  
  - P1: UTF8MatchText()
  - P2: P1 + __inline__ GenericMatchText()
  - P3: P2 + __inline__ wchareq()
(The attached patch is P3.)
  
  Regards,
  ---
  ITAGAKI Takahiro
  NTT Open Source Software Center
  
 
 [ Attachment, skipping... ]
 
  
  ---(end of broadcast)---
  TIP 7: You can help support the PostgreSQL project by donating at
  
  http://www.postgresql.org/about/donate
 
 -- 
   Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
   EnterpriseDB   http://www.enterprisedb.com
 
   + If your life is a hard drive, Christ can be your backup. +
 
 ---(end of broadcast)---
 TIP 5: don't forget to increase your free space map settings

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] UTF8MatchText

2007-04-02 Thread Bruce Momjian

I assume this replaces all your earlier multi-byte LIKE patches.

Your patch has been added to the PostgreSQL unapplied patches list at:

http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---


ITAGAKI Takahiro wrote:
 Andrew - Supernews [EMAIL PROTECTED] wrote:
 
   ITAGAKI I think all safe ASCII-supersets encodings are comparable
   ITAGAKI by bytes, not only UTF-8.
  
  This is false, particularly for EUC.
 
 Umm, I see. I updated the optimization to be used only for UTF8 case.
 I also added some inlining hints that are useful on my machine (Pentium 4).
 
 
 x1000 of LIKE '%foo% on 1 rows tables [ms]
  encoding  | HEAD  |  P1   |  P2   |  P3  
 ---+---+---+---+---
  SQL_ASCII |  7094 |  7120 |  7063 |  7031
  LATIN1|  7083 |  7130 |  7057 |  7031
  UTF8  | 17974 | 10859 | 10839 |  9682
  EUC_JP| 17032 | 17557 | 17599 | 15240
 
 - P1: UTF8MatchText()
 - P2: P1 + __inline__ GenericMatchText()
 - P3: P2 + __inline__ wchareq()
   (The attached patch is P3.)
 
 Regards,
 ---
 ITAGAKI Takahiro
 NTT Open Source Software Center
 

[ Attachment, skipping... ]

 
 ---(end of broadcast)---
 TIP 7: You can help support the PostgreSQL project by donating at
 
 http://www.postgresql.org/about/donate

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings