Re: [HACKERS] like/ilike improvements

2007-09-22 Thread Guillaume Smet
On 9/21/07, Andrew Dunstan [EMAIL PROTECTED] wrote: It applied cleanly for me. Yes, it seems something was screwed in my tree. I didn't notice you commited the patch I applied before Greg's patch. Anyway, I'm starting with a clean tree containing your fix and what Tom commited but I have to

Re: [HACKERS] like/ilike improvements

2007-09-22 Thread Guillaume Smet
On 9/22/07, Guillaume Smet [EMAIL PROTECTED] wrote: Anyway, I'm starting with a clean tree containing your fix and what Tom commited but I have to import the data again due to the catalog version bump :). I have some good news. After Andrew's and Greg's patches, CVS HEAD is as fast as 8.2 with

Re: [HACKERS] like/ilike improvements

2007-09-21 Thread Guillaume Smet
Andrew, On 9/20/07, Andrew Dunstan [EMAIL PROTECTED] wrote: Please try the attached patch, which goes back to using a special case for single-byte ILIKE. I want to make sure that at the very least we don't cause a performance regression with the code done this release. I can't see an obvious

Re: [HACKERS] like/ilike improvements

2007-09-21 Thread ITAGAKI Takahiro
Guillaume Smet [EMAIL PROTECTED] wrote: It's better but still slower than 8.2. It probablly comes from 'var-varlena' feature in 8.3. Now we store text fields in a compact format on disks and extract them on access. It consumes some CPU cycles. If all of data are in buffer cache and the

Re: [HACKERS] like/ilike improvements

2007-09-21 Thread Gregory Stark
ITAGAKI Takahiro [EMAIL PROTECTED] writes: Guillaume Smet [EMAIL PROTECTED] wrote: It's better but still slower than 8.2. It probablly comes from 'var-varlena' feature in 8.3. Now we store text fields in a compact format on disks and extract them on access. It consumes some CPU cycles. If

Re: [HACKERS] like/ilike improvements

2007-09-21 Thread Guillaume Smet
Gregory, On 9/21/07, Gregory Stark [EMAIL PROTECTED] wrote: Hm, it does seem I missed like.c when I converted all the text operators to avoid detoasting packed varlenas. I'll send a patch in a few minutes to do that. I'm surprised it would have such a large effect though. The patch doesn't

Re: [HACKERS] like/ilike improvements

2007-09-21 Thread Gregory Stark
Guillaume Smet [EMAIL PROTECTED] writes: Gregory, On 9/21/07, Gregory Stark [EMAIL PROTECTED] wrote: Hm, it does seem I missed like.c when I converted all the text operators to avoid detoasting packed varlenas. I'll send a patch in a few minutes to do that. I'm surprised it would have such

Re: [HACKERS] like/ilike improvements

2007-09-21 Thread Andrew Dunstan
Guillaume Smet wrote: Gregory, On 9/21/07, Gregory Stark [EMAIL PROTECTED] wrote: Hm, it does seem I missed like.c when I converted all the text operators to avoid detoasting packed varlenas. I'll send a patch in a few minutes to do that. I'm surprised it would have such a large effect

Re: [HACKERS] like/ilike improvements

2007-09-20 Thread Guillaume Smet
On 9/20/07, Andrew Dunstan [EMAIL PROTECTED] wrote: Can you retry both sets of tests but this time in C locale? The lower() code works differently in C locale, and it might be that we need to look at tweaking just one case. Here we go with SQL_ASCII: ** 8.1 ** cityvox_c=# SELECT e.numeve

Re: [HACKERS] like/ilike improvements

2007-09-20 Thread Andrew Dunstan
Guillaume Smet wrote: app_hls On 9/20/07, Andrew Dunstan [EMAIL PROTECTED] wrote: Can you retry both sets of tests but this time in C locale? The lower() code works differently in C locale, and it might be that we need to look at tweaking just one case. Please try the

Re: [HACKERS] like/ilike improvements

2007-09-20 Thread Andrew Dunstan
I wrote: I can't see an obvious way around the problem for multi-byte case - lower() then requires converting to and from wchar, and I don't see a way of avoiding calling lower(). There is one way we could reduce the use of lower() by up to (almost) 50% in the common case where the

Re: [HACKERS] like/ilike improvements

2007-09-19 Thread Guillaume Smet
Andrew, All, On 5/22/07, Andrew Dunstan [EMAIL PROTECTED] wrote: But before I commit this I'd appreciate seeing some more testing, both for correctness and performance. I finally found some time to test this patch on our data. As our production database is still using 8.1, I made my tests

Re: [HACKERS] like/ilike improvements

2007-09-19 Thread Andrew Dunstan
Guillaume Smet wrote: Andrew, All, On 5/22/07, Andrew Dunstan [EMAIL PROTECTED] wrote: But before I commit this I'd appreciate seeing some more testing, both for correctness and performance. I finally found some time to test this patch on our data. As our production

Re: [HACKERS] like/ilike improvements

2007-09-19 Thread Guillaume Smet
On 9/19/07, Andrew Dunstan [EMAIL PROTECTED] wrote: It's at least good to see that the LIKE case has some useful speedup in 8.3. It can be due to your patch or to the varlena header patch. Seqscan is a bit faster too. Can you run the same set of tests in a single byte encoding like latin1?

Re: [HACKERS] like/ilike improvements

2007-09-19 Thread Guillaume Smet
On 9/19/07, Andrew Dunstan [EMAIL PROTECTED] wrote: Can you run the same set of tests in a single byte encoding like latin1? Here are the results (each query was executed several times before this result): ** 8.1 ** cityvox_latin1=# SELECT e.numeve FROM evenement e WHERE e.libgeseve ILIKE

Re: [HACKERS] like/ilike improvements

2007-09-19 Thread Andrew Dunstan
Guillaume Smet wrote: On 9/19/07, Andrew Dunstan [EMAIL PROTECTED] wrote: Can you run the same set of tests in a single byte encoding like latin1? Here are the results (each query was executed several times before this result): ** 8.1 ** cityvox_latin1=# SELECT e.numeve FROM

Re: [HACKERS] like/ilike improvements

2007-05-25 Thread Zeugswetter Andreas ADI SD
However, I have just about convinced myself that we don't need IsFirstByte for matching _ for UTF8, either preceded by % or not, as it should always be true. Can anyone come up with a counter example? You have to be on a first byte before you can meaningfully apply NextChar, and you

Re: [HACKERS] like/ilike improvements

2007-05-25 Thread Andrew Dunstan
[EMAIL PROTECTED] wrote: Is it worth the effort to pre-process the pattern? For example: %% - % This is already done, required by spec. %_ - _% If applied recursively, this would automatically cover: %_% - _% _%_ - __% The 'benefit' would be that the pattern

Re: [HACKERS] like/ilike improvements

2007-05-25 Thread Andrew Dunstan
Zeugswetter Andreas ADI SD wrote: You have to be on a first byte before you can meaningfully apply NextChar, and you have to use NextChar or else you don't count characters correctly (eg __ must match 2 chars not 2 bytes). Well, for utf8 NextChar could advance to the next char even

Re: [HACKERS] like/ilike improvements

2007-05-25 Thread Tom Lane
Zeugswetter Andreas ADI SD [EMAIL PROTECTED] writes: You have to be on a first byte before you can meaningfully apply NextChar, and you have to use NextChar or else you don't count characters correctly (eg __ must match 2 chars not 2 bytes). Well, for utf8 NextChar could advance to the

Re: [HACKERS] like/ilike improvements

2007-05-25 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes: do { (t)++; (tlen)--} while ((*(t) 0xC0) == 0x80 tlen 0) The while *must* test those two conditions in the other order. (Don't laugh --- we've had reproducible bugs before in which the backend dumped core because of running off the end of memory due

Re: [HACKERS] like/ilike improvements

2007-05-24 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: We should only be able to get out of step from the %_ case, I believe, so we should only need to do the first-byte test in that case (which is in a different code path from the normal _ case. Does that seem right? At least

Re: [HACKERS] like/ilike improvements

2007-05-24 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes: OK, Here is a patch that I am fairly confident does what's been discussed, as summarised by Tom. ! #define CHAREQ(p1, p2) (*p1 == *p2) ... + #define IsFirstByte(c) ((*c 0xC0) != 0x80) These macros are bugs waiting to happen. Please parenthesize

Re: [HACKERS] like/ilike improvements

2007-05-24 Thread Andrew Dunstan
Tom Lane wrote: I'm not sure I believe the new coding for %-matching at all, and I certainly don't like the 100% lack of comments explaining why the different cases are necessary and just how they differ. In particular, once we've advanced more than one character, why does it still matter

Re: [HACKERS] like/ilike improvements

2007-05-24 Thread Andrew Dunstan
Tom Lane wrote: There should somewhere be a block comment explaining all the reasoning we've so painfully gone through about why the three cases (SB, MB, UTF8) are needed and how they must differ. I'm working on a detailed description/rationale. However, I have

Re: [HACKERS] like/ilike improvements

2007-05-24 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes: However, I have just about convinced myself that we don't need IsFirstByte for matching _ for UTF8, either preceded by % or not, as it should always be true. Can anyone come up with a counter example? You have to be on a first byte before you can

Re: [HACKERS] like/ilike improvements

2007-05-24 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: However, I have just about convinced myself that we don't need IsFirstByte for matching _ for UTF8, either preceded by % or not, as it should always be true. Can anyone come up with a counter example? You have to be on a

Re: [HACKERS] like/ilike improvements

2007-05-24 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes: Tom Lane wrote: You have to be on a first byte before you can meaningfully apply NextChar, and you have to use NextChar or else you don't count characters correctly (eg __ must match 2 chars not 2 bytes). Yes, I agree completely. However it looks to

Re: [HACKERS] like/ilike improvements

2007-05-24 Thread Tom Lane
I wrote: Andrew Dunstan [EMAIL PROTECTED] writes: Yes, I agree completely. However it looks to me like IsFirstByte will in fact always be true when we get to call NextChar for matching _ for UTF8. If that's true, the patch is failing to achieve its goal of treating % bytewise ... OK, I

Re: [HACKERS] like/ilike improvements

2007-05-24 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: Tom Lane wrote: You have to be on a first byte before you can meaningfully apply NextChar, and you have to use NextChar or else you don't count characters correctly (eg __ must match 2 chars not 2 bytes). Yes, I

Re: [HACKERS] like/ilike improvements

2007-05-24 Thread Andrew Dunstan
Tom Lane wrote: OK, I studied it a bit more and now see what you're driving at: in this form of the patch, we treat % bytewise unless it is followed by _, in which case we treat it char-wise. That seems a good tradeoff, considering that such a pattern is probably pretty uncommon --- we should

Re: [HACKERS] like/ilike improvements

2007-05-24 Thread mark
On Thu, May 24, 2007 at 11:20:51PM -0400, Tom Lane wrote: I wrote: Andrew Dunstan [EMAIL PROTECTED] writes: Yes, I agree completely. However it looks to me like IsFirstByte will in fact always be true when we get to call NextChar for matching _ for UTF8. If that's true, the patch is

Re: [HACKERS] like/ilike improvements

2007-05-23 Thread db
And Dennis said: It is only when you have a pattern like '%_' when this is a problem and we could detect this and do byte by byte when it's not. Now we check (*p == '\\') || (*p == '_') in each iteration when we scan over characters for '%', and we could do it once and have different loops

Re: [HACKERS] like/ilike improvements

2007-05-23 Thread Andrew Dunstan
Tom Lane wrote: 3. UTF8: % can advance bytewise. _ must check it is on a first byte (else return match failure) and if so do NextChar. So primitives are NextChar, NextByte, ByteEq, IsFirstByte. We should only be able to get out of step from the %_ case, I believe, so we should only

Re: [HACKERS] like/ilike improvements

2007-05-23 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes: We should only be able to get out of step from the %_ case, I believe, so we should only need to do the first-byte test in that case (which is in a different code path from the normal _ case. Does that seem right? At least put Assert(IsFirstByte()) in

Re: [HACKERS] like/ilike improvements

2007-05-23 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: We should only be able to get out of step from the %_ case, I believe, so we should only need to do the first-byte test in that case (which is in a different code path from the normal _ case. Does that seem right? At least

Re: [HACKERS] like/ilike improvements

2007-05-23 Thread Andrew Dunstan
Andrew Dunstan wrote: Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: We should only be able to get out of step from the %_ case, I believe, so we should only need to do the first-byte test in that case (which is in a different code path from the normal _ case. Does that seem

Re: [HACKERS] like/ilike improvements

2007-05-23 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes: I am also wondering if it might be sensible to make this choice once at backend startup and store a function pointer, instead of doing it for every string processed by like/ilike: if (pg_database_encoding_max_length() == 1) return

Re: [HACKERS] like/ilike improvements

2007-05-23 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: We should only be able to get out of step from the %_ case, I believe, so we should only need to do the first-byte test in that case (which is in a different code path from the normal _ case. Does that seem right? At least

Re: [HACKERS] like/ilike improvements

2007-05-22 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes: ... It turns out (according to the analysis) that the only time we actually need to use NextChar is when we are matching an _ in a like/ilike pattern. I thought we'd determined that advancing bytewise for % was also risky, in two cases: 1. Multibyte

Re: [HACKERS] like/ilike improvements

2007-05-22 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: ... It turns out (according to the analysis) that the only time we actually need to use NextChar is when we are matching an _ in a like/ilike pattern. I thought we'd determined that advancing bytewise for % was also risky, in

Re: [HACKERS] like/ilike improvements

2007-05-22 Thread Andrew Dunstan
Andrew Dunstan wrote: Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: ... It turns out (according to the analysis) that the only time we actually need to use NextChar is when we are matching an _ in a like/ilike pattern. I thought we'd determined that advancing bytewise

Re: [HACKERS] like/ilike improvements

2007-05-22 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes: Tom Lane wrote: I thought we'd determined that advancing bytewise for % was also risky, in two cases: 1. Multibyte character set that is not UTF8 (more specifically, does not have a guarantee that first bytes and not-first bytes are distinct) I

Re: [HACKERS] like/ilike improvements

2007-05-22 Thread Guillaume Smet
On 5/22/07, Andrew Dunstan [EMAIL PROTECTED] wrote: But before I commit this I'd appreciate seeing some more testing, both for correctness and performance. Any chance the patch applies cleanly on a 8.2 code base? I can test it on a real life 8.2 db but I won't have the time to load the data in

Re: [HACKERS] like/ilike improvements

2007-05-22 Thread Andrew - Supernews
On 2007-05-22, Tom Lane [EMAIL PROTECTED] wrote: If % advances by bytes then this will find a spurious match. The only thing that prevents it is if B can't be both a leading and a trailing byte of validly-encoded MB characters. Which is (by design) true in UTF8, but is not true of most other

Re: [HACKERS] like/ilike improvements

2007-05-22 Thread Tom Lane
Andrew - Supernews [EMAIL PROTECTED] writes: On 2007-05-22, Tom Lane [EMAIL PROTECTED] wrote: If % advances by bytes then this will find a spurious match. The only thing that prevents it is if B can't be both a leading and a trailing byte of validly-encoded MB characters. Which is (by

Re: [HACKERS] like/ilike improvements

2007-05-22 Thread mark
On Tue, May 22, 2007 at 12:12:51PM -0400, Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: ... It turns out (according to the analysis) that the only time we actually need to use NextChar is when we are matching an _ in a like/ilike pattern. I thought we'd determined that

Re: [HACKERS] like/ilike improvements

2007-05-22 Thread Andrew Dunstan
Tom Lane wrote: Yeah. It seems we need three comparison functions after all: Yeah, that was my confusion. I thought we had concluded that we didn't, but clearly we do. 1. Single-byte character set: needs NextByte and ByteEq only. 2. Generic multi-byte character set: both % and _