I have committed part of Patrice's patches with minor fixes.
Uncommitted changes are related to the backend side, and the reason
could be found in the previous discussions (basically this is due to
the fact that current regex code does not support UTF-8 chars =
0x1). Instead pg_veryfymbstr()
* Bruce Momjian [EMAIL PROTECTED] [011011 22:49]:
Can I ask about the status of this?
I have sent a patch a few days ago solving the client-side issue (on
the pgsql-patches mailing list) for review. I think Tatsuo said it
looked OK, however he should confirm/infirm this.
There is still the
* Bruce Momjian [EMAIL PROTECTED] [011011 22:49]:
Can I ask about the status of this?
I have sent a patch a few days ago solving the client-side issue (on
the pgsql-patches mailing list) for review. I think Tatsuo said it
looked OK, however he should confirm/infirm this.
I've been
* Bruce Momjian [EMAIL PROTECTED] [011011 22:49]:
Can I ask about the status of this?
I have sent a patch a few days ago solving the client-side issue (on
the pgsql-patches mailing list) for review. I think Tatsuo said it
looked OK, however he should confirm/infirm this.
OK, I saw the
Can I ask about the status of this?
Hi all,
while working on a new project involving PostgreSQL and making some
tests, I have come up with the following output from psql :
lang | length | length | text| text
--+++---+---
isl | 7 |
Maybe something like this: declare a plpgsql function that takes two
text parameters and has a body like
for (i = 0 to a million)
boolvar := $1 like $2;
Then call it with strings of different lengths and see how the runtime
varies. You need to apply the LIKE to
- shell script ---
for i in 32 64 128 256 512 1024 2048 4096 8192
do
psql -c explain analyze select liketest(a,'aaa') from
(select substring('very_long_text' from 0 for $i) as a) as a test
done
- shell script ---
I don't think your
I don't think your search string is sufficient for a test.
With 'aaa' it actually knows that it only needs to look at the
first three characters of a. Imho you need to try something
like liketest(a,'%aaa%').
Ok. I ran the modified test (now the iteration is reduced to 10 in
Ok. I ran the modified test (now the iteration is reduced to 10 in
liketest()). As you can see, there's huge difference. MB seems up to
~8 times slower:- There seems some problems existing in the
implementation. Considering REGEX is not so slow, maybe we should
employ the same design as
Tatsuo Ishii [EMAIL PROTECTED] writes:
To accomplish this, I moved MatchText etc. to a separate file and now
like.c includes it *twice* (similar technique used in regexec()). This
makes like.o a little bit larger, but I believe this is worth for the
optimization.
That sounds great.
What's
What's your feeling now about the original question: whether to enable
multibyte by default now, or not? I'm still thinking that Peter's
counsel is the wisest: plan to do it in 7.3, not today. But this fix
seems to eliminate the only hard reason we have not to do it today ...
If SQL99's
Tatsuo Ishii [EMAIL PROTECTED] writes:
What do you think?
I think that we were supposed to go beta a month ago, and so this is
no time to start adding new features to this release. Let's plan to
make this happen (one way or the other) in 7.3, instead.
regards, tom lane
Tatsuo Ishii [EMAIL PROTECTED] writes:
... There seems some problems existing in the
implementation. Considering REGEX is not so slow, maybe we should
employ the same design as REGEX. i.e. using wide charcters, not
multibyte streams...
Seems like a good thing to put on the to-do list. In
Tatsuo Ishii [EMAIL PROTECTED] writes:
... There seems some problems existing in the
implementation. Considering REGEX is not so slow, maybe we should
employ the same design as REGEX. i.e. using wide charcters, not
multibyte streams...
Seems like a good thing to put on the to-do
Tatsuo Ishii [EMAIL PROTECTED] writes:
... There seems some problems existing in the
implementation. Considering REGEX is not so slow, maybe we should
employ the same design as REGEX. i.e. using wide charcters, not
multibyte streams...
Seems like a good thing to put on the to-do list.
Tom Lane writes:
In the meantime, we still have the question of whether to enable
multibyte in the default configuration.
This would make more sense if all of multibyte, locale, and NLS became
defaults in one release. I haven't quite sold people in the second item
yet, although I have a
Peter Eisentraut [EMAIL PROTECTED] writes:
Tom Lane writes:
In the meantime, we still have the question of whether to enable
multibyte in the default configuration.
Perhaps we could make it a release goal for 7.3
Yeah, that's probably the best way to proceed... it's awfully late
in the 7.2
Bruce Momjian [EMAIL PROTECTED] writes:
Added to TODO:
* Use wide characters to evaluate regular expressions, for performance
(Tatsuo)
Regexes are fine; it's LIKE that's slow.
regards, tom lane
---(end of
I think that we were supposed to go beta a month ago, and so this is
no time to start adding new features to this release. Let's plan to
make this happen (one way or the other) in 7.3, instead.
Agreed.
--
Tatsuo Ishii
---(end of broadcast)---
Ok. I ran the modified test (now the iteration is reduced to 10 in
liketest()). As you can see, there's huge difference. MB seems up to
~8 times slower:- There seems some problems existing in the
implementation. Considering REGEX is not so slow, maybe we should
employ the same
Bruce Momjian [EMAIL PROTECTED] writes:
If no one can find a case where multibyte is slower, I think we should
enable it by default. Comments?
Well, he just did point out such a case:
no MB with MB
LIKE 0.09 msec 0.08 msec
REGEX0.09 msec
Bruce Momjian [EMAIL PROTECTED] writes:
If no one can find a case where multibyte is slower, I think we should
enable it by default. Comments?
Well, he just did point out such a case:
no MB with MB
LIKE 0.09 msec 0.08 msec
REGEX 0.09 msec
Bruce Momjian [EMAIL PROTECTED] writes:
But the strange thing is that LIKE is faster, perhaps meaning his
measurements can't even see the difference,
Yeah, I suspect there's 10% or more noise in these numbers. But then
one could read the results as saying we can't reliably measure any
Tatsuo Ishii writes:
LIKE with MB seemed to be resonably fast, but REGEX with MB seemed a
little bit slow. Probably this is due the wide character conversion
overhead.
Could this conversion be optimized to recognize when it's dealing with a
single-byte character encoding?
--
Peter
Yeah, I suspect there's 10% or more noise in these numbers. But then
one could read the results as saying we can't reliably measure any
difference at all ...
I'd feel more confident if the measurements were done using operators
repeated enough times to yield multiple-second runtimes. I
LIKE with MB seemed to be resonably fast, but REGEX with MB seemed a
little bit slow. Probably this is due the wide character conversion
overhead.
Could this conversion be optimized to recognize when it's dealing with a
single-byte character encoding?
Not sure, will look into...
--
Tatsuo Ishii [EMAIL PROTECTED] writes:
I'd feel more confident if the measurements were done using operators
repeated enough times to yield multiple-second runtimes.
Any idea to do that?
Maybe something like this: declare a plpgsql function that takes two
text parameters and has a body like
Also, have we decided if multibyte should be the configure default now?
Not sure.
Anyway I have tested LIKE/REGEX query test using current. The query
executed is:
explain analyze select '000 5089 474e...( 16475
bytes long text containing only 0-9a-z chars) like 'aaa';
and
explain
If no one can find a case where multibyte is slower, I think we should
enable it by default. Comments?
Also, have we decided if multibyte should be the configure default now?
Not sure.
Anyway I have tested LIKE/REGEX query test using current. The query
executed is:
explain analyze
Can someone give me TODO items for this discussion?
What about:
Improve Unicode combined character handling
--
Tatsuo Ishii
So, this shows two problems :
- length() on the server side doesn't handle correctly Unicode [I have
the same result with char_length()], and returns the
Can someone give me TODO items for this discussion?
What about:
Improve Unicode combined character handling
Done. I can't update the web version because I don't have permission.
Also, have we decided if multibyte should be the configure default now?
--
Bruce Momjian
Can someone give me TODO items for this discussion?
So, this shows two problems :
- length() on the server side doesn't handle correctly Unicode [I have
the same result with char_length()], and returns the number of chars
(as it is however advertised to do), rather the length of
I would like to see SQL99's charset, collate functionality for 7.3 (or
later). If this happens, current multibyte implementation would be
dramatically changed...
I'm *still* interested in working on this (an old story I know). I'm
working on date/time stuff for 7.2, but hopefully 7.3
Hi,
* Tatsuo Ishii [EMAIL PROTECTED] [010925 18:18]:
So, this shows two problems :
- length() on the server side doesn't handle correctly Unicode [I
have the same result with char_length()], and returns the number
of chars (as it is however advertised to do), rather the length
- length() on the server side doesn't handle correctly Unicode [I
have the same result with char_length()], and returns the number
of chars (as it is however advertised to do), rather the length
of the string.
This is a known limitation.
To solve this, we could use
Looks like a good project for 7.3
Probably the best starting point would be to develope contrib/unicode
with smooth transition to core.
Oleg
On Mon, 24 Sep 2001, Patrice [iso-8859-15] Hédé wrote:
Hi all,
while working on a new project involving PostgreSQL and making some
tests, I
I would like to see SQL99's charset, collate functionality for 7.3 (or
later). If this happens, current multibyte implementation would be
dramatically changed...
I'm *still* interested in working on this (an old story I know). I'm
working on date/time stuff for 7.2, but hopefully 7.3 will see
Hi all,
while working on a new project involving PostgreSQL and making some
tests, I have come up with the following output from psql :
lang | length | length | text| text
--+++---+---
isl | 7 | 6 | álíta | áleit
isl | 7 | 7 |
So, this shows two problems :
- length() on the server side doesn't handle correctly Unicode [I have
the same result with char_length()], and returns the number of chars
(as it is however advertised to do), rather the length of the
string.
This is a known limitation.
- the psql
39 matches
Mail list logo