Re: [PATCHES] UTF8MatchText

2007-05-21 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan <[EMAIL PROTECTED]> writes: But why are we doing that CHAREQ? To avoid the cost of the recursive call, just like it says. If it succeeds we'll just do it again when we recurse, I think. If you move the other two cases then you could advance

Re: [PATCHES] UTF8MatchText

2007-05-21 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes: > But why are we doing that CHAREQ? To avoid the cost of the recursive call, just like it says. > If it succeeds we'll > just do it again when we recurse, I think. If you move the other two cases then you could advance t and p before entering the recur

Re: [PATCHES] UTF8MatchText

2007-05-21 Thread Andrew Dunstan
[EMAIL PROTECTED] wrote: Doh, you're right ... but on third thought, what happens with a pattern containing "%_"? If % tries to advance bytewise then we'll be trying to apply NextChar in the middle of a data character, and bad things ensue. Right, when you have '_' after a '%' you need

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread db
> Doh, you're right ... but on third thought, what happens with a pattern > containing "%_"? If % tries to advance bytewise then we'll be trying to > apply NextChar in the middle of a data character, and bad things ensue. Right, when you have '_' after a '%' you need to make sure the '%' advances

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan <[EMAIL PROTECTED]> writes: Tom Lane wrote: On the strength of this analysis, shouldn't we drop the separate UTF8 match function and just use SB_MatchText for UTF8? We still call NextChar() after "_", and I think we probably need to, don't w

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> On the strength of this analysis, shouldn't we drop the separate >> UTF8 match function and just use SB_MatchText for UTF8? > We still call NextChar() after "_", and I think we probably need to, > don't we? If so we can't just marry

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Andrew Dunstan
Tom Lane wrote: On the strength of this analysis, shouldn't we drop the separate UTF8 match function and just use SB_MatchText for UTF8? We still call NextChar() after "_", and I think we probably need to, don't we? If so we can't just marry the cases. cheers andrew

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> Andrew Dunstan <[EMAIL PROTECTED]> writes: >>> Yeah, quite possibly. I'm also wondering if we are wasting effort >>> downcasing what will in most cases be the same pattern over and over >>> again. Maybe we need to look at memoizing t

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan <[EMAIL PROTECTED]> writes: Yeah, quite possibly. I'm also wondering if we are wasting effort downcasing what will in most cases be the same pattern over and over again. Maybe we need to look at memoizing that somehow, or at least test to see if that would b

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Yeah, quite possibly. I'm also wondering if we are wasting effort > downcasing what will in most cases be the same pattern over and over > again. Maybe we need to look at memoizing that somehow, or at least test > to see if that would be a gain. Some

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Andrew Dunstan
Tom Lane wrote: On the strength of this analysis, shouldn't we drop the separate UTF8 match function and just use SB_MatchText for UTF8? Possibly - IIRC I looked at that and there was some reason I didn't, but I'll look again. It strikes me that we may be overcomplicating matters in an

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Are you sure? The big remaining char-matching bottleneck will surely > be in the code that scans for a place to start matching a %. But > that's exactly where we can't use byte matching for cases where the > charset might include AB and BA as characte

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Andrew Dunstan
oops. patch attached this time Andrew Dunstan wrote: I wrote: It is only when you have a pattern like '%_' when this is a problem and we could detect this and do byte by byte when it's not. Now we check (*p == '\\') || (*p == '_') in each iteration when we scan over characters for '%

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Andrew Dunstan
I wrote: It is only when you have a pattern like '%_' when this is a problem and we could detect this and do byte by byte when it's not. Now we check (*p == '\\') || (*p == '_') in each iteration when we scan over characters for '%', and we could do it once and have different loops for

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Andrew Dunstan
Dennis Bjorklund wrote: Tom Lane skrev: You could imagine trying to do % a byte at a time (and indeed that's what I'd been thinking it did) but that gets you out of sync which breaks the _ case. It is only when you have a pattern like '%_' when this is a problem and we could detect this and

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Dennis Bjorklund
Tom Lane skrev: You could imagine trying to do % a byte at a time (and indeed that's what I'd been thinking it did) but that gets you out of sync which breaks the _ case. It is only when you have a pattern like '%_' when this is a problem and we could detect this and do byte by byte when it's

Re: [PATCHES] UTF8MatchText

2007-05-18 Thread Andrew Dunstan
Tom Lane wrote: ITAGAKI Takahiro <[EMAIL PROTECTED]> writes: Yes, I only used the 'disjoint representations for first-bytes and not-first-bytes of MB characters' feature in UTF8. Other encodings allows both [AB] and [BA] for MB character patterns. UTF8Match() does not cope with those encodi

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan
Tom Lane wrote: Under *no* circumstances use __inline__, as it will certainly break every non-gcc compiler. Use "inline", which we #define appropriately at need. OK. (this was from upstream patch.) I thought we'd concluded that this explanation is pseudo-science? [...] spellche

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Attached is my current WIP patch. A few quick eyeball comments: > ! static __inline__ int Under *no* circumstances use __inline__, as it will certainly break every non-gcc compiler. Use "inline", which we #define appropriately at need. > !

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan
Tom Lane wrote: ITAGAKI Takahiro <[EMAIL PROTECTED]> writes: Yes, I only used the 'disjoint representations for first-bytes and not-first-bytes of MB characters' feature in UTF8. Other encodings allows both [AB] and [BA] for MB character patterns. UTF8Match() does not cope with those encodi

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Tom Lane
ITAGAKI Takahiro <[EMAIL PROTECTED]> writes: > Yes, I only used the 'disjoint representations for first-bytes and > not-first-bytes of MB characters' feature in UTF8. Other encodings > allows both [AB] and [BA] for MB character patterns. UTF8Match() does > not cope with those encodings; If we have

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread ITAGAKI Takahiro
Tom Lane <[EMAIL PROTECTED]> wrote: > On the strength of this closer reading, I would say that the patch isn't > relying on UTF8's first-byte-vs-not-first-byte property after all. > All that it's relying on is that no MB character is a prefix of another > one, which seems like a necessary propert

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Is it legal to follow escape by anything other than _ % or escape? Certainly, but once you've compared the first byte you can handle any remaining bytes via the main loop. And in fact the code is already depending on being able to do that --- the use o

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan
Tom Lane wrote: * At a pattern backslash, it applies CHAREQ() but then advances byte-by-byte over the matched characters (implicitly assuming that none of these bytes will look like the magic characters). While that works for backend-safe encodings, it seems a bit strange; you've already paid

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes: > From my WIP patch, here's where the difference appears to be - note > that UTF8 branch has two NextByte calls at the bottom, unlike the other > branch: Oh, I see: NextChar is still "real" but the patch is willing to have t and p pointing into the midd

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan <[EMAIL PROTECTED]> writes: Tom Lane wrote: Except that the entire point of this patch is to dumb down NextChar to be the same as NextByte for UTF8 strings. That's not what I see in (what I think is) the latest submission, which includes thi

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> Except that the entire point of this patch is to dumb down NextChar to >> be the same as NextByte for UTF8 strings. > That's not what I see in (what I think is) the latest submission, which > includes this snippet: [ scratches head.

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan <[EMAIL PROTECTED]> writes: Tom Lane wrote: Wait a second ... I just thought of a counterexample that destroys the entire concept. Consider the pattern 'A__B', which clearly is supposed to match strings of four *characters*. With the proposed patch in

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> Wait a second ... I just thought of a counterexample that destroys the >> entire concept. Consider the pattern 'A__B', which clearly is supposed >> to match strings of four *characters*. With the proposed patch in >> place, it would

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan
Tom Lane wrote: Wait a second ... I just thought of a counterexample that destroys the entire concept. Consider the pattern 'A__B', which clearly is supposed to match strings of four *characters*. With the proposed patch in place, it would match strings of four *bytes*. Which is not the corr

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan
Tom Lane wrote: UTF8 has disjoint representations for first-bytes and not-first-bytes of MB characters, and thus it is impossible to make a false match in which an MB pattern character is matched to the end of one data character plus the start of another. In character sets without that property

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Tom Lane
Wait a second ... I just thought of a counterexample that destroys the entire concept. Consider the pattern 'A__B', which clearly is supposed to match strings of four *characters*. With the proposed patch in place, it would match strings of four *bytes*. Which is not the correct behavior.

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Ok, I have studied some more and I think I understand what's going on. > AIUI, we are switching from some expensive char-wise comparisons to > cheap byte-wise comparisons in the UTF8 case because we know that in > UTF8 the magic characters ('_', '%' a

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan
I wrote: ISTM we should generate all these match functions from one body of code plus some #define magic. As I understand it, we have three possible encoding switches: Single Byte, UTF8 and other Multi Byte Charsets, and two possible case settings: case Sensitive and Case Insensitive. Th

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan
Itagaki, I find this still fairly unclean. It certainly took me some time to get me head around what's going on. ISTM we should generate all these match functions from one body of code plus some #define magic. As I understand it, we have three possible encoding switches: Single Byte, UTF

Re: [PATCHES] UTF8MatchText

2007-04-09 Thread Bruce Momjian
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --- IT

Re: [PATCHES] UTF8MatchText

2007-04-09 Thread Bruce Momjian
Patch removed, updated version submitted. --- ITAGAKI Takahiro wrote: > "Andrew - Supernews" <[EMAIL PROTECTED]> wrote: > > > ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable > > ITAGAKI> by bytes, not

Re: [PATCHES] UTF8MatchText

2007-04-08 Thread ITAGAKI Takahiro
Bruce Momjian <[EMAIL PROTECTED]> wrote: > > I do not understand this patch. You have defined two functions, > > UTF8MatchText() and UTF8MatchTextIC(), and the difference between them > > is that one calls CHAREQ and the other calls ICHAREQ, but just above > > those two functions you define the m

Re: [PATCHES] UTF8MatchText

2007-04-06 Thread Bruce Momjian
Bruce Momjian wrote: > > I do not understand this patch. You have defined two functions, > UTF8MatchText() and UTF8MatchTextIC(), and the difference between them > is that one calls CHAREQ and the other calls ICHAREQ, but just above > those two functions you define the macros identically: > >

Re: [PATCHES] UTF8MatchText

2007-04-06 Thread Bruce Momjian
I do not understand this patch. You have defined two functions, UTF8MatchText() and UTF8MatchTextIC(), and the difference between them is that one calls CHAREQ and the other calls ICHAREQ, but just above those two functions you define the macros identically: #define CHAREQ(p1, p2)wch

Re: [PATCHES] UTF8MatchText

2007-04-02 Thread Bruce Momjian
I assume this replaces all your earlier multi-byte LIKE patches. Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --

[PATCHES] UTF8MatchText

2007-04-01 Thread ITAGAKI Takahiro
"Andrew - Supernews" <[EMAIL PROTECTED]> wrote: > ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable > ITAGAKI> by bytes, not only UTF-8. > > This is false, particularly for EUC. Umm, I see. I updated the optimization to be used only for UTF8 case. I also added some inlining