Re: [PATCHES] UTF8MatchText

2007-05-21 Thread db
Doh, you're right ... but on third thought, what happens with a pattern containing %_? If % tries to advance bytewise then we'll be trying to apply NextChar in the middle of a data character, and bad things ensue. Right, when you have '_' after a '%' you need to make sure the '%' advances

Re: [PATCHES] UTF8MatchText

2007-05-21 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes: But why are we doing that CHAREQ? To avoid the cost of the recursive call, just like it says. If it succeeds we'll just do it again when we recurse, I think. If you move the other two cases then you could advance t and p before entering the

Re: [PATCHES] UTF8MatchText

2007-05-21 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: But why are we doing that CHAREQ? To avoid the cost of the recursive call, just like it says. If it succeeds we'll just do it again when we recurse, I think. If you move the other two cases then you could advance t

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Dennis Bjorklund
Tom Lane skrev: You could imagine trying to do % a byte at a time (and indeed that's what I'd been thinking it did) but that gets you out of sync which breaks the _ case. It is only when you have a pattern like '%_' when this is a problem and we could detect this and do byte by byte when it's

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Andrew Dunstan
Dennis Bjorklund wrote: Tom Lane skrev: You could imagine trying to do % a byte at a time (and indeed that's what I'd been thinking it did) but that gets you out of sync which breaks the _ case. It is only when you have a pattern like '%_' when this is a problem and we could detect this

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Andrew Dunstan
oops. patch attached this time Andrew Dunstan wrote: I wrote: It is only when you have a pattern like '%_' when this is a problem and we could detect this and do byte by byte when it's not. Now we check (*p == '\\') || (*p == '_') in each iteration when we scan over characters for

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes: Are you sure? The big remaining char-matching bottleneck will surely be in the code that scans for a place to start matching a %. But that's exactly where we can't use byte matching for cases where the charset might include AB and BA as characters -

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Andrew Dunstan
Tom Lane wrote: On the strength of this analysis, shouldn't we drop the separate UTF8 match function and just use SB_MatchText for UTF8? Possibly - IIRC I looked at that and there was some reason I didn't, but I'll look again. It strikes me that we may be overcomplicating matters in

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes: Yeah, quite possibly. I'm also wondering if we are wasting effort downcasing what will in most cases be the same pattern over and over again. Maybe we need to look at memoizing that somehow, or at least test to see if that would be a gain. Someone

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: Yeah, quite possibly. I'm also wondering if we are wasting effort downcasing what will in most cases be the same pattern over and over again. Maybe we need to look at memoizing that somehow, or at least test to see if that would be

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes: Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: Yeah, quite possibly. I'm also wondering if we are wasting effort downcasing what will in most cases be the same pattern over and over again. Maybe we need to look at memoizing that somehow, or

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Andrew Dunstan
Tom Lane wrote: On the strength of this analysis, shouldn't we drop the separate UTF8 match function and just use SB_MatchText for UTF8? We still call NextChar() after _, and I think we probably need to, don't we? If so we can't just marry the cases. cheers andrew

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes: Tom Lane wrote: On the strength of this analysis, shouldn't we drop the separate UTF8 match function and just use SB_MatchText for UTF8? We still call NextChar() after _, and I think we probably need to, don't we? If so we can't just marry the cases.

Re: [PATCHES] UTF8MatchText

2007-05-20 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: Tom Lane wrote: On the strength of this analysis, shouldn't we drop the separate UTF8 match function and just use SB_MatchText for UTF8? We still call NextChar() after _, and I think we probably need to, don't we?

Re: [PATCHES] UTF8MatchText

2007-05-18 Thread Andrew Dunstan
Tom Lane wrote: ITAGAKI Takahiro [EMAIL PROTECTED] writes: Yes, I only used the 'disjoint representations for first-bytes and not-first-bytes of MB characters' feature in UTF8. Other encodings allows both [AB] and [BA] for MB character patterns. UTF8Match() does not cope with those

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan
Itagaki, I find this still fairly unclean. It certainly took me some time to get me head around what's going on. ISTM we should generate all these match functions from one body of code plus some #define magic. As I understand it, we have three possible encoding switches: Single Byte,

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan
I wrote: ISTM we should generate all these match functions from one body of code plus some #define magic. As I understand it, we have three possible encoding switches: Single Byte, UTF8 and other Multi Byte Charsets, and two possible case settings: case Sensitive and Case Insensitive.

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Tom Lane
Wait a second ... I just thought of a counterexample that destroys the entire concept. Consider the pattern 'A__B', which clearly is supposed to match strings of four *characters*. With the proposed patch in place, it would match strings of four *bytes*. Which is not the correct behavior.

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan
Tom Lane wrote: UTF8 has disjoint representations for first-bytes and not-first-bytes of MB characters, and thus it is impossible to make a false match in which an MB pattern character is matched to the end of one data character plus the start of another. In character sets without that

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan
Tom Lane wrote: Wait a second ... I just thought of a counterexample that destroys the entire concept. Consider the pattern 'A__B', which clearly is supposed to match strings of four *characters*. With the proposed patch in place, it would match strings of four *bytes*. Which is not the

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes: Tom Lane wrote: Wait a second ... I just thought of a counterexample that destroys the entire concept. Consider the pattern 'A__B', which clearly is supposed to match strings of four *characters*. With the proposed patch in place, it would match

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: Tom Lane wrote: Wait a second ... I just thought of a counterexample that destroys the entire concept. Consider the pattern 'A__B', which clearly is supposed to match strings of four *characters*. With the proposed patch in

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes: From my WIP patch, here's where the difference appears to be - note that UTF8 branch has two NextByte calls at the bottom, unlike the other branch: Oh, I see: NextChar is still real but the patch is willing to have t and p pointing into the middle of

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan
Tom Lane wrote: * At a pattern backslash, it applies CHAREQ() but then advances byte-by-byte over the matched characters (implicitly assuming that none of these bytes will look like the magic characters). While that works for backend-safe encodings, it seems a bit strange; you've already

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes: Is it legal to follow escape by anything other than _ % or escape? Certainly, but once you've compared the first byte you can handle any remaining bytes via the main loop. And in fact the code is already depending on being able to do that --- the use of

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread ITAGAKI Takahiro
Tom Lane [EMAIL PROTECTED] wrote: On the strength of this closer reading, I would say that the patch isn't relying on UTF8's first-byte-vs-not-first-byte property after all. All that it's relying on is that no MB character is a prefix of another one, which seems like a necessary property for

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes: Attached is my current WIP patch. A few quick eyeball comments: ! static __inline__ int Under *no* circumstances use __inline__, as it will certainly break every non-gcc compiler. Use inline, which we #define appropriately at need. ! *

Re: [PATCHES] UTF8MatchText

2007-05-17 Thread Andrew Dunstan
Tom Lane wrote: Under *no* circumstances use __inline__, as it will certainly break every non-gcc compiler. Use inline, which we #define appropriately at need. OK. (this was from upstream patch.) I thought we'd concluded that this explanation is pseudo-science? [...]

Re: [PATCHES] UTF8MatchText

2007-04-09 Thread Bruce Momjian
Patch removed, updated version submitted. --- ITAGAKI Takahiro wrote: Andrew - Supernews [EMAIL PROTECTED] wrote: ITAGAKI I think all safe ASCII-supersets encodings are comparable ITAGAKI by bytes, not only UTF-8.

Re: [PATCHES] UTF8MatchText

2007-04-09 Thread Bruce Momjian
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. ---

Re: [PATCHES] UTF8MatchText

2007-04-08 Thread ITAGAKI Takahiro
Bruce Momjian [EMAIL PROTECTED] wrote: I do not understand this patch. You have defined two functions, UTF8MatchText() and UTF8MatchTextIC(), and the difference between them is that one calls CHAREQ and the other calls ICHAREQ, but just above those two functions you define the macros

Re: [PATCHES] UTF8MatchText

2007-04-06 Thread Bruce Momjian
I do not understand this patch. You have defined two functions, UTF8MatchText() and UTF8MatchTextIC(), and the difference between them is that one calls CHAREQ and the other calls ICHAREQ, but just above those two functions you define the macros identically: #define CHAREQ(p1, p2)

Re: [PATCHES] UTF8MatchText

2007-04-06 Thread Bruce Momjian
Bruce Momjian wrote: I do not understand this patch. You have defined two functions, UTF8MatchText() and UTF8MatchTextIC(), and the difference between them is that one calls CHAREQ and the other calls ICHAREQ, but just above those two functions you define the macros identically:

Re: [PATCHES] UTF8MatchText

2007-04-02 Thread Bruce Momjian
I assume this replaces all your earlier multi-byte LIKE patches. Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it.