Re: [HACKERS] LIKE optimization in UTF-8 and locale-C

2007-03-25 Thread Hannu Krosing
Ühel kenal päeval, R, 2007-03-23 kell 06:10, kirjutas Andrew -
Supernews:
 On 2007-03-23, ITAGAKI Takahiro [EMAIL PROTECTED] wrote:
  Thanks, it all made sense to me. My proposal was completely wrong.
 
 Actually, I think your proposal is fundamentally correct, merely incomplete.
 
 Doing octet-based rather than character-based matching of strings is a
 _design goal_ of UTF8. Treating UTF8 like any other multibyte charset and
 converting everything to wide-chars is, in my opinion, always going to
 result in suboptimal performance.

Yes, that was what I meant by proposing a utf8 specific UTF8MatchText(),
which should not convert everything to wide char, but instead do
byte-by-byte comparison and just be aware of UTF encoding, where it is
easy to know how wide (how maby bytes/octets) each encoded character
takes.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] LIKE optimization in UTF-8 and locale-C

2007-03-22 Thread Tom Lane
ITAGAKI Takahiro [EMAIL PROTECTED] writes:
 I found LIKE operators are slower on multi-byte encoding databases
 than single-byte encoding ones. It comes from difference between
 MatchText() and MBMatchText().

 We've had an optimization for single-byte encodings using 
 pg_database_encoding_max_length() == 1 test. I'll propose to extend it
 in UTF-8 with locale-C case.

If this works for UTF8, won't it work for all the backend-legal
encodings?

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] LIKE optimization in UTF-8 and locale-C

2007-03-22 Thread Hannu Krosing
Ühel kenal päeval, N, 2007-03-22 kell 11:08, kirjutas Tom Lane:
 ITAGAKI Takahiro [EMAIL PROTECTED] writes:
  I found LIKE operators are slower on multi-byte encoding databases
  than single-byte encoding ones. It comes from difference between
  MatchText() and MBMatchText().
 
  We've had an optimization for single-byte encodings using 
  pg_database_encoding_max_length() == 1 test. I'll propose to extend it
  in UTF-8 with locale-C case.
 
 If this works for UTF8, won't it work for all the backend-legal
 encodings?

I guess it works well for % but not for _ , the latter has to know, how
many bytes the current (multibyte) character covers.

The length is still easy to find out for UTF8 encoding, so it may be
feasible to write UTF8MatchText() that is still faster than
MBMatchText().

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] LIKE optimization in UTF-8 and locale-C

2007-03-22 Thread ITAGAKI Takahiro
Hannu Krosing [EMAIL PROTECTED] wrote:

   We've had an optimization for single-byte encodings using 
   pg_database_encoding_max_length() == 1 test. I'll propose to extend it
   in UTF-8 with locale-C case.
  
  If this works for UTF8, won't it work for all the backend-legal
  encodings?
 
 I guess it works well for % but not for _ , the latter has to know, how
 many bytes the current (multibyte) character covers.

Yes, % is not used in trailing bytes for all encodings, but _ is
used in some of them. I think we can use the optimization for all
of the server encodings except JOHAB. 

Also, I took notice that locale settings are not used in LIKE matching,
so the following is enough for checking availability of byte-wise matching
functions. or am I missing something?

#define sb_match_available()(GetDatabaseEncoding() == PG_JOHAB))



Multi-byte encodings supported by a server encoding.

  | % 0x25 | _ 0x5f | \ 0x5c |
--++++-
EUC_JP| unused | unused | unused |
EUC_CN| unused | unused | unused |
EUC_KR| unused | unused | unused |
EUC_TW| unused | unused | unused |
JOHAB | unused | *used* | *used* |
UTF8  | unused | unused | unused |
MULE_INTERNAL | unused | unused | unused |

Just for reference, encodings only supported as a client encoding.

  | % 0x25 | _ 0x5f | \ 0x5c |
--++++-
SJIS  | unused | *used* | *used* |
BIG5  | unused | *used* | *used* |
GBK   | unused | *used* | *used* |
UHC   | unused | unused | unused |
GB18030   | unused | *used* | *used* |

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] LIKE optimization in UTF-8 and locale-C

2007-03-22 Thread ITAGAKI Takahiro

Dennis Bjorklund [EMAIL PROTECTED] wrote:

 The problem with the like pattern _ is that it has to know how long the 
 single caracter is that it should pass over. Say you have a UTF-8 string 
 with 2 characters encoded in 3 bytes ('ÖA'). Where the first character 
 is 2 bytes:
 
 0xC3 0x96 'A'
 
 and now you want to match that with the LIKE pattern:
 
 '_A'

Thanks, it all made sense to me. My proposal was completely wrong.
The optimization of MBMatchText() seems to be the right way...

 Maybe one should simply write a special version of LIKE for the UTF-8 
 encoding since it's probably the most used encoding today. But I don't 
 think you can use the C locale and that it would work for UTF-8.

But then, present LIKE matching is not locale aware. we treat multi-byte
characters properly, but always perform a char-by-char comparison.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] LIKE optimization in UTF-8 and locale-C

2007-03-22 Thread Dennis Bjorklund

ITAGAKI Takahiro skrev:


I guess it works well for % but not for _ , the latter has to know, how
many bytes the current (multibyte) character covers.


Yes, % is not used in trailing bytes for all encodings, but _ is
used in some of them. I think we can use the optimization for all
of the server encodings except JOHAB. 


The problem with the like pattern _ is that it has to know how long the 
single caracter is that it should pass over. Say you have a UTF-8 string 
with 2 characters encoded in 3 bytes ('ÖA'). Where the first character 
is 2 bytes:


0xC3 0x96 'A'

and now you want to match that with the LIKE pattern:

'_A'

How would that work in the C locale?

Maybe one should simply write a special version of LIKE for the UTF-8 
encoding since it's probably the most used encoding today. But I don't 
think you can use the C locale and that it would work for UTF-8.


/Dennis

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] LIKE optimization in UTF-8 and locale-C

2007-03-22 Thread Andrew - Supernews
On 2007-03-22, Tom Lane [EMAIL PROTECTED] wrote:
 ITAGAKI Takahiro [EMAIL PROTECTED] writes:
 I found LIKE operators are slower on multi-byte encoding databases
 than single-byte encoding ones. It comes from difference between
 MatchText() and MBMatchText().

 We've had an optimization for single-byte encodings using 
 pg_database_encoding_max_length() == 1 test. I'll propose to extend it
 in UTF-8 with locale-C case.

 If this works for UTF8, won't it work for all the backend-legal
 encodings?

It works for UTF8 only because UTF8 has special properties which are not
shared by, for example, EUC_*. Specifically, in UTF8 the octet sequence
for a multibyte character will never appear as a subsequence of the octet
sequence of a string of other multibyte characters. i.e. given a string
of two two-octet characters AB, the second octet of A plus the first octet
of B is not a valid UTF8 character (and likewise for longer characters).

(And while I haven't tested it, it looks like the patch posted doesn't
account properly for the use of _, so it needs a bit more work.)

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] LIKE optimization in UTF-8 and locale-C

2007-03-22 Thread Andrew - Supernews
On 2007-03-23, ITAGAKI Takahiro [EMAIL PROTECTED] wrote:
 Thanks, it all made sense to me. My proposal was completely wrong.

Actually, I think your proposal is fundamentally correct, merely incomplete.

Doing octet-based rather than character-based matching of strings is a
_design goal_ of UTF8. Treating UTF8 like any other multibyte charset and
converting everything to wide-chars is, in my opinion, always going to
result in suboptimal performance.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match