Tom Lane wrote:
Andrew Dunstan <[EMAIL PROTECTED]> writes:
But why are we doing that CHAREQ?
To avoid the cost of the recursive call, just like it says.
If it succeeds we'll
just do it again when we recurse, I think.
If you move the other two cases then you could advance
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> But why are we doing that CHAREQ?
To avoid the cost of the recursive call, just like it says.
> If it succeeds we'll
> just do it again when we recurse, I think.
If you move the other two cases then you could advance t and p before
entering the recur
[EMAIL PROTECTED] wrote:
Doh, you're right ... but on third thought, what happens with a pattern
containing "%_"? If % tries to advance bytewise then we'll be trying to
apply NextChar in the middle of a data character, and bad things ensue.
Right, when you have '_' after a '%' you need
> Doh, you're right ... but on third thought, what happens with a pattern
> containing "%_"? If % tries to advance bytewise then we'll be trying to
> apply NextChar in the middle of a data character, and bad things ensue.
Right, when you have '_' after a '%' you need to make sure the '%'
advances
Tom Lane wrote:
Andrew Dunstan <[EMAIL PROTECTED]> writes:
Tom Lane wrote:
On the strength of this analysis, shouldn't we drop the separate
UTF8 match function and just use SB_MatchText for UTF8?
We still call NextChar() after "_", and I think we probably need to,
don't w
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> On the strength of this analysis, shouldn't we drop the separate
>> UTF8 match function and just use SB_MatchText for UTF8?
> We still call NextChar() after "_", and I think we probably need to,
> don't we? If so we can't just marry
Tom Lane wrote:
On the strength of this analysis, shouldn't we drop the separate
UTF8 match function and just use SB_MatchText for UTF8?
We still call NextChar() after "_", and I think we probably need to,
don't we? If so we can't just marry the cases.
cheers
andrew
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> Andrew Dunstan <[EMAIL PROTECTED]> writes:
>>> Yeah, quite possibly. I'm also wondering if we are wasting effort
>>> downcasing what will in most cases be the same pattern over and over
>>> again. Maybe we need to look at memoizing t
Tom Lane wrote:
Andrew Dunstan <[EMAIL PROTECTED]> writes:
Yeah, quite possibly. I'm also wondering if we are wasting effort
downcasing what will in most cases be the same pattern over and over
again. Maybe we need to look at memoizing that somehow, or at least test
to see if that would b
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Yeah, quite possibly. I'm also wondering if we are wasting effort
> downcasing what will in most cases be the same pattern over and over
> again. Maybe we need to look at memoizing that somehow, or at least test
> to see if that would be a gain.
Some
Tom Lane wrote:
On the strength of this analysis, shouldn't we drop the separate
UTF8 match function and just use SB_MatchText for UTF8?
Possibly - IIRC I looked at that and there was some reason I didn't, but
I'll look again.
It strikes me that we may be overcomplicating matters in an
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Are you sure? The big remaining char-matching bottleneck will surely
> be in the code that scans for a place to start matching a %. But
> that's exactly where we can't use byte matching for cases where the
> charset might include AB and BA as characte
oops. patch attached this time
Andrew Dunstan wrote:
I wrote:
It is only when you have a pattern like '%_' when this is a problem
and we could detect this and do byte by byte when it's not. Now we
check (*p == '\\') || (*p == '_') in each iteration when we scan
over characters for '%
I wrote:
It is only when you have a pattern like '%_' when this is a problem
and we could detect this and do byte by byte when it's not. Now we
check (*p == '\\') || (*p == '_') in each iteration when we scan over
characters for '%', and we could do it once and have different loops
for
Dennis Bjorklund wrote:
Tom Lane skrev:
You could imagine trying to do
% a byte at a time (and indeed that's what I'd been thinking it did)
but that gets you out of sync which breaks the _ case.
It is only when you have a pattern like '%_' when this is a problem
and we could detect this and
Tom Lane skrev:
You could imagine trying to do
% a byte at a time (and indeed that's what I'd been thinking it did)
but that gets you out of sync which breaks the _ case.
It is only when you have a pattern like '%_' when this is a problem and
we could detect this and do byte by byte when it's
Tom Lane wrote:
ITAGAKI Takahiro <[EMAIL PROTECTED]> writes:
Yes, I only used the 'disjoint representations for first-bytes and
not-first-bytes of MB characters' feature in UTF8. Other encodings
allows both [AB] and [BA] for MB character patterns. UTF8Match() does
not cope with those encodi
Tom Lane wrote:
Under *no* circumstances use __inline__, as it will certainly break
every non-gcc compiler. Use "inline", which we #define appropriately
at need.
OK. (this was from upstream patch.)
I thought we'd concluded that this explanation is pseudo-science?
[...]
spellche
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Attached is my current WIP patch.
A few quick eyeball comments:
> ! static __inline__ int
Under *no* circumstances use __inline__, as it will certainly break
every non-gcc compiler. Use "inline", which we #define appropriately
at need.
> !
Tom Lane wrote:
ITAGAKI Takahiro <[EMAIL PROTECTED]> writes:
Yes, I only used the 'disjoint representations for first-bytes and
not-first-bytes of MB characters' feature in UTF8. Other encodings
allows both [AB] and [BA] for MB character patterns. UTF8Match() does
not cope with those encodi
ITAGAKI Takahiro <[EMAIL PROTECTED]> writes:
> Yes, I only used the 'disjoint representations for first-bytes and
> not-first-bytes of MB characters' feature in UTF8. Other encodings
> allows both [AB] and [BA] for MB character patterns. UTF8Match() does
> not cope with those encodings; If we have
Tom Lane <[EMAIL PROTECTED]> wrote:
> On the strength of this closer reading, I would say that the patch isn't
> relying on UTF8's first-byte-vs-not-first-byte property after all.
> All that it's relying on is that no MB character is a prefix of another
> one, which seems like a necessary propert
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Is it legal to follow escape by anything other than _ % or escape?
Certainly, but once you've compared the first byte you can handle any
remaining bytes via the main loop. And in fact the code is already
depending on being able to do that --- the use o
Tom Lane wrote:
* At a pattern backslash, it applies CHAREQ() but then advances
byte-by-byte over the matched characters (implicitly assuming that none
of these bytes will look like the magic characters). While that works
for backend-safe encodings, it seems a bit strange; you've already paid
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> From my WIP patch, here's where the difference appears to be - note
> that UTF8 branch has two NextByte calls at the bottom, unlike the other
> branch:
Oh, I see: NextChar is still "real" but the patch is willing to have t
and p pointing into the midd
Tom Lane wrote:
Andrew Dunstan <[EMAIL PROTECTED]> writes:
Tom Lane wrote:
Except that the entire point of this patch is to dumb down NextChar to
be the same as NextByte for UTF8 strings.
That's not what I see in (what I think is) the latest submission, which
includes thi
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> Except that the entire point of this patch is to dumb down NextChar to
>> be the same as NextByte for UTF8 strings.
> That's not what I see in (what I think is) the latest submission, which
> includes this snippet:
[ scratches head.
Tom Lane wrote:
Andrew Dunstan <[EMAIL PROTECTED]> writes:
Tom Lane wrote:
Wait a second ... I just thought of a counterexample that destroys the
entire concept. Consider the pattern 'A__B', which clearly is supposed
to match strings of four *characters*. With the proposed patch in
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> Wait a second ... I just thought of a counterexample that destroys the
>> entire concept. Consider the pattern 'A__B', which clearly is supposed
>> to match strings of four *characters*. With the proposed patch in
>> place, it would
Tom Lane wrote:
Wait a second ... I just thought of a counterexample that destroys the
entire concept. Consider the pattern 'A__B', which clearly is supposed
to match strings of four *characters*. With the proposed patch in
place, it would match strings of four *bytes*. Which is not the corr
Tom Lane wrote:
UTF8 has disjoint representations for
first-bytes and not-first-bytes of MB characters, and thus it is
impossible to make a false match in which an MB pattern character is
matched to the end of one data character plus the start of another.
In character sets without that property
Wait a second ... I just thought of a counterexample that destroys the
entire concept. Consider the pattern 'A__B', which clearly is supposed
to match strings of four *characters*. With the proposed patch in
place, it would match strings of four *bytes*. Which is not the correct
behavior.
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Ok, I have studied some more and I think I understand what's going on.
> AIUI, we are switching from some expensive char-wise comparisons to
> cheap byte-wise comparisons in the UTF8 case because we know that in
> UTF8 the magic characters ('_', '%' a
I wrote:
ISTM we should generate all these match functions from one body of
code plus some #define magic.
As I understand it, we have three possible encoding switches: Single
Byte, UTF8 and other Multi Byte Charsets, and two possible case
settings: case Sensitive and Case Insensitive. Th
Itagaki,
I find this still fairly unclean. It certainly took me some time to get
me head around what's going on.
ISTM we should generate all these match functions from one body of code
plus some #define magic.
As I understand it, we have three possible encoding switches: Single
Byte, UTF
Your patch has been added to the PostgreSQL unapplied patches list at:
http://momjian.postgresql.org/cgi-bin/pgpatches
It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.
---
IT
Patch removed, updated version submitted.
---
ITAGAKI Takahiro wrote:
> "Andrew - Supernews" <[EMAIL PROTECTED]> wrote:
>
> > ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable
> > ITAGAKI> by bytes, not
Bruce Momjian <[EMAIL PROTECTED]> wrote:
> > I do not understand this patch. You have defined two functions,
> > UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
> > is that one calls CHAREQ and the other calls ICHAREQ, but just above
> > those two functions you define the m
Bruce Momjian wrote:
>
> I do not understand this patch. You have defined two functions,
> UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
> is that one calls CHAREQ and the other calls ICHAREQ, but just above
> those two functions you define the macros identically:
>
>
I do not understand this patch. You have defined two functions,
UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
is that one calls CHAREQ and the other calls ICHAREQ, but just above
those two functions you define the macros identically:
#define CHAREQ(p1, p2)wch
I assume this replaces all your earlier multi-byte LIKE patches.
Your patch has been added to the PostgreSQL unapplied patches list at:
http://momjian.postgresql.org/cgi-bin/pgpatches
It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.
--
"Andrew - Supernews" <[EMAIL PROTECTED]> wrote:
> ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable
> ITAGAKI> by bytes, not only UTF-8.
>
> This is false, particularly for EUC.
Umm, I see. I updated the optimization to be used only for UTF8 case.
I also added some inlining
42 matches
Mail list logo