Re: [HACKERS] LIKE optimization in UTF-8 and locale-C
Ühel kenal päeval, R, 2007-03-23 kell 06:10, kirjutas Andrew - Supernews: On 2007-03-23, ITAGAKI Takahiro [EMAIL PROTECTED] wrote: Thanks, it all made sense to me. My proposal was completely wrong. Actually, I think your proposal is fundamentally correct, merely incomplete. Doing octet-based rather than character-based matching of strings is a _design goal_ of UTF8. Treating UTF8 like any other multibyte charset and converting everything to wide-chars is, in my opinion, always going to result in suboptimal performance. Yes, that was what I meant by proposing a utf8 specific UTF8MatchText(), which should not convert everything to wide char, but instead do byte-by-byte comparison and just be aware of UTF encoding, where it is easy to know how wide (how maby bytes/octets) each encoded character takes. -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] LIKE optimization in UTF-8 and locale-C
ITAGAKI Takahiro [EMAIL PROTECTED] writes: I found LIKE operators are slower on multi-byte encoding databases than single-byte encoding ones. It comes from difference between MatchText() and MBMatchText(). We've had an optimization for single-byte encodings using pg_database_encoding_max_length() == 1 test. I'll propose to extend it in UTF-8 with locale-C case. If this works for UTF8, won't it work for all the backend-legal encodings? regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] LIKE optimization in UTF-8 and locale-C
Ühel kenal päeval, N, 2007-03-22 kell 11:08, kirjutas Tom Lane: ITAGAKI Takahiro [EMAIL PROTECTED] writes: I found LIKE operators are slower on multi-byte encoding databases than single-byte encoding ones. It comes from difference between MatchText() and MBMatchText(). We've had an optimization for single-byte encodings using pg_database_encoding_max_length() == 1 test. I'll propose to extend it in UTF-8 with locale-C case. If this works for UTF8, won't it work for all the backend-legal encodings? I guess it works well for % but not for _ , the latter has to know, how many bytes the current (multibyte) character covers. The length is still easy to find out for UTF8 encoding, so it may be feasible to write UTF8MatchText() that is still faster than MBMatchText(). -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] LIKE optimization in UTF-8 and locale-C
Hannu Krosing [EMAIL PROTECTED] wrote: We've had an optimization for single-byte encodings using pg_database_encoding_max_length() == 1 test. I'll propose to extend it in UTF-8 with locale-C case. If this works for UTF8, won't it work for all the backend-legal encodings? I guess it works well for % but not for _ , the latter has to know, how many bytes the current (multibyte) character covers. Yes, % is not used in trailing bytes for all encodings, but _ is used in some of them. I think we can use the optimization for all of the server encodings except JOHAB. Also, I took notice that locale settings are not used in LIKE matching, so the following is enough for checking availability of byte-wise matching functions. or am I missing something? #define sb_match_available()(GetDatabaseEncoding() == PG_JOHAB)) Multi-byte encodings supported by a server encoding. | % 0x25 | _ 0x5f | \ 0x5c | --++++- EUC_JP| unused | unused | unused | EUC_CN| unused | unused | unused | EUC_KR| unused | unused | unused | EUC_TW| unused | unused | unused | JOHAB | unused | *used* | *used* | UTF8 | unused | unused | unused | MULE_INTERNAL | unused | unused | unused | Just for reference, encodings only supported as a client encoding. | % 0x25 | _ 0x5f | \ 0x5c | --++++- SJIS | unused | *used* | *used* | BIG5 | unused | *used* | *used* | GBK | unused | *used* | *used* | UHC | unused | unused | unused | GB18030 | unused | *used* | *used* | Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] LIKE optimization in UTF-8 and locale-C
Dennis Bjorklund [EMAIL PROTECTED] wrote: The problem with the like pattern _ is that it has to know how long the single caracter is that it should pass over. Say you have a UTF-8 string with 2 characters encoded in 3 bytes ('ÖA'). Where the first character is 2 bytes: 0xC3 0x96 'A' and now you want to match that with the LIKE pattern: '_A' Thanks, it all made sense to me. My proposal was completely wrong. The optimization of MBMatchText() seems to be the right way... Maybe one should simply write a special version of LIKE for the UTF-8 encoding since it's probably the most used encoding today. But I don't think you can use the C locale and that it would work for UTF-8. But then, present LIKE matching is not locale aware. we treat multi-byte characters properly, but always perform a char-by-char comparison. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] LIKE optimization in UTF-8 and locale-C
ITAGAKI Takahiro skrev: I guess it works well for % but not for _ , the latter has to know, how many bytes the current (multibyte) character covers. Yes, % is not used in trailing bytes for all encodings, but _ is used in some of them. I think we can use the optimization for all of the server encodings except JOHAB. The problem with the like pattern _ is that it has to know how long the single caracter is that it should pass over. Say you have a UTF-8 string with 2 characters encoded in 3 bytes ('ÖA'). Where the first character is 2 bytes: 0xC3 0x96 'A' and now you want to match that with the LIKE pattern: '_A' How would that work in the C locale? Maybe one should simply write a special version of LIKE for the UTF-8 encoding since it's probably the most used encoding today. But I don't think you can use the C locale and that it would work for UTF-8. /Dennis ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] LIKE optimization in UTF-8 and locale-C
On 2007-03-22, Tom Lane [EMAIL PROTECTED] wrote: ITAGAKI Takahiro [EMAIL PROTECTED] writes: I found LIKE operators are slower on multi-byte encoding databases than single-byte encoding ones. It comes from difference between MatchText() and MBMatchText(). We've had an optimization for single-byte encodings using pg_database_encoding_max_length() == 1 test. I'll propose to extend it in UTF-8 with locale-C case. If this works for UTF8, won't it work for all the backend-legal encodings? It works for UTF8 only because UTF8 has special properties which are not shared by, for example, EUC_*. Specifically, in UTF8 the octet sequence for a multibyte character will never appear as a subsequence of the octet sequence of a string of other multibyte characters. i.e. given a string of two two-octet characters AB, the second octet of A plus the first octet of B is not a valid UTF8 character (and likewise for longer characters). (And while I haven't tested it, it looks like the patch posted doesn't account properly for the use of _, so it needs a bit more work.) -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] LIKE optimization in UTF-8 and locale-C
On 2007-03-23, ITAGAKI Takahiro [EMAIL PROTECTED] wrote: Thanks, it all made sense to me. My proposal was completely wrong. Actually, I think your proposal is fundamentally correct, merely incomplete. Doing octet-based rather than character-based matching of strings is a _design goal_ of UTF8. Treating UTF8 like any other multibyte charset and converting everything to wide-chars is, in my opinion, always going to result in suboptimal performance. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] LIKE optimization in UTF-8 and locale-C
Hello, I found LIKE operators are slower on multi-byte encoding databases than single-byte encoding ones. It comes from difference between MatchText() and MBMatchText(). We've had an optimization for single-byte encodings using pg_database_encoding_max_length() == 1 test. I'll propose to extend it in UTF-8 with locale-C case. All of trailing bytes are different from first bytes in UTF-8 multi-byte characters, so we can use functions for single-bytes and byte-wise comparison in the case. With the attached patch, the performance of UTF-8 LIKE operators are pushed up to near other single-bytes encodings. Databases initialized with locale-C are widely used in Japan, because Japanese locale are broken in almost of platforms. Japanese user can choose EUC-jp or UTF-8 as a server encoding, but I think UTF-8 will be more and more used in the future. test initdb --no-locale --encoding=encoding [HEAD] SQL_ASCII : 7171 ms / 7203 ms / 7187 ms LATIN1: 7172 ms / 7156 ms / 7141 ms UTF8 : 16235 ms / 16281 ms / 16281 ms EUC_JP: 17454 ms / 17453 ms / 17438 ms [with patch] SQL_ASCII : 7062 ms / 7125 ms / 7125 ms LATIN1: 7047 ms / 7063 ms / 7047 ms UTF8 : 7188 ms / 7234 ms / 7235 ms EUC_JP: 17468 ms / 17453 ms / 17453 ms CREATE OR REPLACE FUNCTION test() RETURNS INTEGER AS $$ DECLARE cnt integer; BEGIN FOR i IN 1..1000 LOOP SELECT count(*) INTO cnt FROM item WHERE i_title LIKE '%BABABABABARIBA%' LIMIT 50; END LOOP; RETURN cnt; END; $$ LANGUAGE plpgsql; SELECT count(*) FROM item; -- borrowed from DBT-1 (TPC-W) count --- 1 (1 row) patch Index: src/backend/utils/adt/like.c === --- src/backend/utils/adt/like.c(head) +++ src/backend/utils/adt/like.c(working copy) @@ -21,6 +21,7 @@ #include mb/pg_wchar.h #include utils/builtins.h +#include utils/pg_locale.h #define LIKE_TRUE 1 @@ -119,6 +120,13 @@ /* + * true iff match functions for single-byte characters are available. + */ +#define sb_match_available() \ + (pg_database_encoding_max_length() == 1 || \ +(lc_collate_is_c() GetDatabaseEncoding() == PG_UTF8)) + +/* * interface routines called by the function manager */ @@ -138,7 +146,7 @@ p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); - if (pg_database_encoding_max_length() == 1) + if (sb_match_available()) result = (MatchText(s, slen, p, plen) == LIKE_TRUE); else result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE); @@ -162,7 +170,7 @@ p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); - if (pg_database_encoding_max_length() == 1) + if (sb_match_available()) result = (MatchText(s, slen, p, plen) != LIKE_TRUE); else result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE); @@ -186,7 +194,7 @@ p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); - if (pg_database_encoding_max_length() == 1) + if (sb_match_available()) result = (MatchText(s, slen, p, plen) == LIKE_TRUE); else result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE); @@ -210,7 +218,7 @@ p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); - if (pg_database_encoding_max_length() == 1) + if (sb_match_available()) result = (MatchText(s, slen, p, plen) != LIKE_TRUE); else result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE); @@ -275,7 +283,7 @@ int slen, plen; - if (pg_database_encoding_max_length() == 1) + if (sb_match_available()) { s = NameStr(*str); slen = strlen(s); @@ -316,7 +324,7 @@ int slen, plen; - if (pg_database_encoding_max_length() == 1) + if (sb_match_available()) { s = NameStr(*str); slen = strlen(s); @@ -357,7 +365,7 @@ int slen, plen; - if (pg_database_encoding_max_length() == 1) + if (sb_match_available()) { s = VARDATA(str); slen = (VARSIZE(str) - VARHDRSZ); @@ -393,7 +401,7 @@ int slen, plen; - if (pg_database_encoding_max_length() == 1) + if (sb_match_available()) { s = VARDATA(str); slen = (VARSIZE(str) - VARHDRSZ); @@ -429,7 +437,7 @@ text *esc = PG_GETARG_TEXT_P(1); text *result; - if (pg_database_encoding_max_length() == 1) + if (sb_match_available()) result = do_like_escape(pat, esc); else