Doh, you're right ... but on third thought, what happens with a pattern
containing %_? If % tries to advance bytewise then we'll be trying to
apply NextChar in the middle of a data character, and bad things ensue.
Right, when you have '_' after a '%' you need to make sure the '%'
advances
Andrew Dunstan [EMAIL PROTECTED] writes:
But why are we doing that CHAREQ?
To avoid the cost of the recursive call, just like it says.
If it succeeds we'll
just do it again when we recurse, I think.
If you move the other two cases then you could advance t and p before
entering the
Tom Lane wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
But why are we doing that CHAREQ?
To avoid the cost of the recursive call, just like it says.
If it succeeds we'll
just do it again when we recurse, I think.
If you move the other two cases then you could advance t
Tom Lane skrev:
You could imagine trying to do
% a byte at a time (and indeed that's what I'd been thinking it did)
but that gets you out of sync which breaks the _ case.
It is only when you have a pattern like '%_' when this is a problem and
we could detect this and do byte by byte when it's
Dennis Bjorklund wrote:
Tom Lane skrev:
You could imagine trying to do
% a byte at a time (and indeed that's what I'd been thinking it did)
but that gets you out of sync which breaks the _ case.
It is only when you have a pattern like '%_' when this is a problem
and we could detect this
oops. patch attached this time
Andrew Dunstan wrote:
I wrote:
It is only when you have a pattern like '%_' when this is a problem
and we could detect this and do byte by byte when it's not. Now we
check (*p == '\\') || (*p == '_') in each iteration when we scan
over characters for
Andrew Dunstan [EMAIL PROTECTED] writes:
Are you sure? The big remaining char-matching bottleneck will surely
be in the code that scans for a place to start matching a %. But
that's exactly where we can't use byte matching for cases where the
charset might include AB and BA as characters -
Tom Lane wrote:
On the strength of this analysis, shouldn't we drop the separate
UTF8 match function and just use SB_MatchText for UTF8?
Possibly - IIRC I looked at that and there was some reason I didn't, but
I'll look again.
It strikes me that we may be overcomplicating matters in
Andrew Dunstan [EMAIL PROTECTED] writes:
Yeah, quite possibly. I'm also wondering if we are wasting effort
downcasing what will in most cases be the same pattern over and over
again. Maybe we need to look at memoizing that somehow, or at least test
to see if that would be a gain.
Someone
Tom Lane wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
Yeah, quite possibly. I'm also wondering if we are wasting effort
downcasing what will in most cases be the same pattern over and over
again. Maybe we need to look at memoizing that somehow, or at least test
to see if that would be
Andrew Dunstan [EMAIL PROTECTED] writes:
Tom Lane wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
Yeah, quite possibly. I'm also wondering if we are wasting effort
downcasing what will in most cases be the same pattern over and over
again. Maybe we need to look at memoizing that somehow, or
Tom Lane wrote:
On the strength of this analysis, shouldn't we drop the separate
UTF8 match function and just use SB_MatchText for UTF8?
We still call NextChar() after _, and I think we probably need to,
don't we? If so we can't just marry the cases.
cheers
andrew
Andrew Dunstan [EMAIL PROTECTED] writes:
Tom Lane wrote:
On the strength of this analysis, shouldn't we drop the separate
UTF8 match function and just use SB_MatchText for UTF8?
We still call NextChar() after _, and I think we probably need to,
don't we? If so we can't just marry the cases.
Tom Lane wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
Tom Lane wrote:
On the strength of this analysis, shouldn't we drop the separate
UTF8 match function and just use SB_MatchText for UTF8?
We still call NextChar() after _, and I think we probably need to,
don't we?
Tom Lane wrote:
ITAGAKI Takahiro [EMAIL PROTECTED] writes:
Yes, I only used the 'disjoint representations for first-bytes and
not-first-bytes of MB characters' feature in UTF8. Other encodings
allows both [AB] and [BA] for MB character patterns. UTF8Match() does
not cope with those
Itagaki,
I find this still fairly unclean. It certainly took me some time to get
me head around what's going on.
ISTM we should generate all these match functions from one body of code
plus some #define magic.
As I understand it, we have three possible encoding switches: Single
Byte,
I wrote:
ISTM we should generate all these match functions from one body of
code plus some #define magic.
As I understand it, we have three possible encoding switches: Single
Byte, UTF8 and other Multi Byte Charsets, and two possible case
settings: case Sensitive and Case Insensitive.
Wait a second ... I just thought of a counterexample that destroys the
entire concept. Consider the pattern 'A__B', which clearly is supposed
to match strings of four *characters*. With the proposed patch in
place, it would match strings of four *bytes*. Which is not the correct
behavior.
Tom Lane wrote:
UTF8 has disjoint representations for
first-bytes and not-first-bytes of MB characters, and thus it is
impossible to make a false match in which an MB pattern character is
matched to the end of one data character plus the start of another.
In character sets without that
Tom Lane wrote:
Wait a second ... I just thought of a counterexample that destroys the
entire concept. Consider the pattern 'A__B', which clearly is supposed
to match strings of four *characters*. With the proposed patch in
place, it would match strings of four *bytes*. Which is not the
Andrew Dunstan [EMAIL PROTECTED] writes:
Tom Lane wrote:
Wait a second ... I just thought of a counterexample that destroys the
entire concept. Consider the pattern 'A__B', which clearly is supposed
to match strings of four *characters*. With the proposed patch in
place, it would match
Tom Lane wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
Tom Lane wrote:
Wait a second ... I just thought of a counterexample that destroys the
entire concept. Consider the pattern 'A__B', which clearly is supposed
to match strings of four *characters*. With the proposed patch in
Andrew Dunstan [EMAIL PROTECTED] writes:
From my WIP patch, here's where the difference appears to be - note
that UTF8 branch has two NextByte calls at the bottom, unlike the other
branch:
Oh, I see: NextChar is still real but the patch is willing to have t
and p pointing into the middle of
Tom Lane wrote:
* At a pattern backslash, it applies CHAREQ() but then advances
byte-by-byte over the matched characters (implicitly assuming that none
of these bytes will look like the magic characters). While that works
for backend-safe encodings, it seems a bit strange; you've already
Andrew Dunstan [EMAIL PROTECTED] writes:
Is it legal to follow escape by anything other than _ % or escape?
Certainly, but once you've compared the first byte you can handle any
remaining bytes via the main loop. And in fact the code is already
depending on being able to do that --- the use of
Tom Lane [EMAIL PROTECTED] wrote:
On the strength of this closer reading, I would say that the patch isn't
relying on UTF8's first-byte-vs-not-first-byte property after all.
All that it's relying on is that no MB character is a prefix of another
one, which seems like a necessary property for
Andrew Dunstan [EMAIL PROTECTED] writes:
Attached is my current WIP patch.
A few quick eyeball comments:
! static __inline__ int
Under *no* circumstances use __inline__, as it will certainly break
every non-gcc compiler. Use inline, which we #define appropriately
at need.
! *
Tom Lane wrote:
Under *no* circumstances use __inline__, as it will certainly break
every non-gcc compiler. Use inline, which we #define appropriately
at need.
OK. (this was from upstream patch.)
I thought we'd concluded that this explanation is pseudo-science?
[...]
Patch removed, updated version submitted.
---
ITAGAKI Takahiro wrote:
Andrew - Supernews [EMAIL PROTECTED] wrote:
ITAGAKI I think all safe ASCII-supersets encodings are comparable
ITAGAKI by bytes, not only UTF-8.
Your patch has been added to the PostgreSQL unapplied patches list at:
http://momjian.postgresql.org/cgi-bin/pgpatches
It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.
---
Bruce Momjian [EMAIL PROTECTED] wrote:
I do not understand this patch. You have defined two functions,
UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
is that one calls CHAREQ and the other calls ICHAREQ, but just above
those two functions you define the macros
I do not understand this patch. You have defined two functions,
UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
is that one calls CHAREQ and the other calls ICHAREQ, but just above
those two functions you define the macros identically:
#define CHAREQ(p1, p2)
Bruce Momjian wrote:
I do not understand this patch. You have defined two functions,
UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
is that one calls CHAREQ and the other calls ICHAREQ, but just above
those two functions you define the macros identically:
I assume this replaces all your earlier multi-byte LIKE patches.
Your patch has been added to the PostgreSQL unapplied patches list at:
http://momjian.postgresql.org/cgi-bin/pgpatches
It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.
34 matches
Mail list logo