Re: [PATCHES] [HACKERS] Unicode combining characters

2001-10-14 Thread Tatsuo Ishii
I have committed part of Patrice's patches with minor fixes. Uncommitted changes are related to the backend side, and the reason could be found in the previous discussions (basically this is due to the fact that current regex code does not support UTF-8 chars = 0x1). Instead pg_veryfymbstr()

Re: [HACKERS] Unicode combining characters

2001-10-12 Thread Patrice Hd
* Bruce Momjian [EMAIL PROTECTED] [011011 22:49]: Can I ask about the status of this? I have sent a patch a few days ago solving the client-side issue (on the pgsql-patches mailing list) for review. I think Tatsuo said it looked OK, however he should confirm/infirm this. There is still the

Re: [HACKERS] Unicode combining characters

2001-10-12 Thread Tatsuo Ishii
* Bruce Momjian [EMAIL PROTECTED] [011011 22:49]: Can I ask about the status of this? I have sent a patch a few days ago solving the client-side issue (on the pgsql-patches mailing list) for review. I think Tatsuo said it looked OK, however he should confirm/infirm this. I've been

Re: [HACKERS] Unicode combining characters

2001-10-12 Thread Bruce Momjian
* Bruce Momjian [EMAIL PROTECTED] [011011 22:49]: Can I ask about the status of this? I have sent a patch a few days ago solving the client-side issue (on the pgsql-patches mailing list) for review. I think Tatsuo said it looked OK, however he should confirm/infirm this. OK, I saw the

Re: [HACKERS] Unicode combining characters

2001-10-11 Thread Bruce Momjian
Can I ask about the status of this? Hi all, while working on a new project involving PostgreSQL and making some tests, I have come up with the following output from psql : lang | length | length | text| text --+++---+--- isl | 7 |

Re: [HACKERS] Unicode combining characters

2001-10-03 Thread Tatsuo Ishii
Maybe something like this: declare a plpgsql function that takes two text parameters and has a body like for (i = 0 to a million) boolvar := $1 like $2; Then call it with strings of different lengths and see how the runtime varies. You need to apply the LIKE to

Re: [HACKERS] Unicode combining characters

2001-10-03 Thread Zeugswetter Andreas SB SD
- shell script --- for i in 32 64 128 256 512 1024 2048 4096 8192 do psql -c explain analyze select liketest(a,'aaa') from (select substring('very_long_text' from 0 for $i) as a) as a test done - shell script --- I don't think your

Re: [HACKERS] Unicode combining characters

2001-10-03 Thread Tatsuo Ishii
I don't think your search string is sufficient for a test. With 'aaa' it actually knows that it only needs to look at the first three characters of a. Imho you need to try something like liketest(a,'%aaa%'). Ok. I ran the modified test (now the iteration is reduced to 10 in

Re: [HACKERS] Unicode combining characters

2001-10-03 Thread Tatsuo Ishii
Ok. I ran the modified test (now the iteration is reduced to 10 in liketest()). As you can see, there's huge difference. MB seems up to ~8 times slower:- There seems some problems existing in the implementation. Considering REGEX is not so slow, maybe we should employ the same design as

Re: [HACKERS] Unicode combining characters

2001-10-03 Thread Tom Lane
Tatsuo Ishii [EMAIL PROTECTED] writes: To accomplish this, I moved MatchText etc. to a separate file and now like.c includes it *twice* (similar technique used in regexec()). This makes like.o a little bit larger, but I believe this is worth for the optimization. That sounds great. What's

Re: [HACKERS] Unicode combining characters

2001-10-03 Thread Tatsuo Ishii
What's your feeling now about the original question: whether to enable multibyte by default now, or not? I'm still thinking that Peter's counsel is the wisest: plan to do it in 7.3, not today. But this fix seems to eliminate the only hard reason we have not to do it today ... If SQL99's

Re: [HACKERS] Unicode combining characters

2001-10-03 Thread Tom Lane
Tatsuo Ishii [EMAIL PROTECTED] writes: What do you think? I think that we were supposed to go beta a month ago, and so this is no time to start adding new features to this release. Let's plan to make this happen (one way or the other) in 7.3, instead. regards, tom lane

Re: [HACKERS] Unicode combining characters

2001-10-03 Thread Tom Lane
Tatsuo Ishii [EMAIL PROTECTED] writes: ... There seems some problems existing in the implementation. Considering REGEX is not so slow, maybe we should employ the same design as REGEX. i.e. using wide charcters, not multibyte streams... Seems like a good thing to put on the to-do list. In

Re: [HACKERS] Unicode combining characters

2001-10-03 Thread Zeugswetter Andreas SB SD
Tatsuo Ishii [EMAIL PROTECTED] writes: ... There seems some problems existing in the implementation. Considering REGEX is not so slow, maybe we should employ the same design as REGEX. i.e. using wide charcters, not multibyte streams... Seems like a good thing to put on the to-do

Re: [HACKERS] Unicode combining characters

2001-10-03 Thread Bruce Momjian
Tatsuo Ishii [EMAIL PROTECTED] writes: ... There seems some problems existing in the implementation. Considering REGEX is not so slow, maybe we should employ the same design as REGEX. i.e. using wide charcters, not multibyte streams... Seems like a good thing to put on the to-do list.

Re: [HACKERS] Unicode combining characters

2001-10-03 Thread Peter Eisentraut
Tom Lane writes: In the meantime, we still have the question of whether to enable multibyte in the default configuration. This would make more sense if all of multibyte, locale, and NLS became defaults in one release. I haven't quite sold people in the second item yet, although I have a

Re: [HACKERS] Unicode combining characters

2001-10-03 Thread Tom Lane
Peter Eisentraut [EMAIL PROTECTED] writes: Tom Lane writes: In the meantime, we still have the question of whether to enable multibyte in the default configuration. Perhaps we could make it a release goal for 7.3 Yeah, that's probably the best way to proceed... it's awfully late in the 7.2

Re: [HACKERS] Unicode combining characters

2001-10-03 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes: Added to TODO: * Use wide characters to evaluate regular expressions, for performance (Tatsuo) Regexes are fine; it's LIKE that's slow. regards, tom lane ---(end of

Re: [HACKERS] Unicode combining characters

2001-10-03 Thread Tatsuo Ishii
I think that we were supposed to go beta a month ago, and so this is no time to start adding new features to this release. Let's plan to make this happen (one way or the other) in 7.3, instead. Agreed. -- Tatsuo Ishii ---(end of broadcast)---

Re: [HACKERS] Unicode combining characters

2001-10-03 Thread Bruce Momjian
Ok. I ran the modified test (now the iteration is reduced to 10 in liketest()). As you can see, there's huge difference. MB seems up to ~8 times slower:- There seems some problems existing in the implementation. Considering REGEX is not so slow, maybe we should employ the same

Re: [HACKERS] Unicode combining characters

2001-10-02 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes: If no one can find a case where multibyte is slower, I think we should enable it by default. Comments? Well, he just did point out such a case: no MB with MB LIKE 0.09 msec 0.08 msec REGEX0.09 msec

Re: [HACKERS] Unicode combining characters

2001-10-02 Thread Bruce Momjian
Bruce Momjian [EMAIL PROTECTED] writes: If no one can find a case where multibyte is slower, I think we should enable it by default. Comments? Well, he just did point out such a case: no MB with MB LIKE 0.09 msec 0.08 msec REGEX 0.09 msec

Re: [HACKERS] Unicode combining characters

2001-10-02 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes: But the strange thing is that LIKE is faster, perhaps meaning his measurements can't even see the difference, Yeah, I suspect there's 10% or more noise in these numbers. But then one could read the results as saying we can't reliably measure any

Re: [HACKERS] Unicode combining characters

2001-10-02 Thread Peter Eisentraut
Tatsuo Ishii writes: LIKE with MB seemed to be resonably fast, but REGEX with MB seemed a little bit slow. Probably this is due the wide character conversion overhead. Could this conversion be optimized to recognize when it's dealing with a single-byte character encoding? -- Peter

Re: [HACKERS] Unicode combining characters

2001-10-02 Thread Tatsuo Ishii
Yeah, I suspect there's 10% or more noise in these numbers. But then one could read the results as saying we can't reliably measure any difference at all ... I'd feel more confident if the measurements were done using operators repeated enough times to yield multiple-second runtimes. I

Re: [HACKERS] Unicode combining characters

2001-10-02 Thread Tatsuo Ishii
LIKE with MB seemed to be resonably fast, but REGEX with MB seemed a little bit slow. Probably this is due the wide character conversion overhead. Could this conversion be optimized to recognize when it's dealing with a single-byte character encoding? Not sure, will look into... --

Re: [HACKERS] Unicode combining characters

2001-10-02 Thread Tom Lane
Tatsuo Ishii [EMAIL PROTECTED] writes: I'd feel more confident if the measurements were done using operators repeated enough times to yield multiple-second runtimes. Any idea to do that? Maybe something like this: declare a plpgsql function that takes two text parameters and has a body like

Re: [HACKERS] Unicode combining characters

2001-10-02 Thread Tatsuo Ishii
Also, have we decided if multibyte should be the configure default now? Not sure. Anyway I have tested LIKE/REGEX query test using current. The query executed is: explain analyze select '000 5089 474e...( 16475 bytes long text containing only 0-9a-z chars) like 'aaa'; and explain

Re: [HACKERS] Unicode combining characters

2001-10-02 Thread Bruce Momjian
If no one can find a case where multibyte is slower, I think we should enable it by default. Comments? Also, have we decided if multibyte should be the configure default now? Not sure. Anyway I have tested LIKE/REGEX query test using current. The query executed is: explain analyze

Re: [HACKERS] Unicode combining characters

2001-10-01 Thread Tatsuo Ishii
Can someone give me TODO items for this discussion? What about: Improve Unicode combined character handling -- Tatsuo Ishii So, this shows two problems : - length() on the server side doesn't handle correctly Unicode [I have the same result with char_length()], and returns the

Re: [HACKERS] Unicode combining characters

2001-10-01 Thread Bruce Momjian
Can someone give me TODO items for this discussion? What about: Improve Unicode combined character handling Done. I can't update the web version because I don't have permission. Also, have we decided if multibyte should be the configure default now? -- Bruce Momjian

Re: [HACKERS] Unicode combining characters

2001-10-01 Thread Bruce Momjian
Can someone give me TODO items for this discussion? So, this shows two problems : - length() on the server side doesn't handle correctly Unicode [I have the same result with char_length()], and returns the number of chars (as it is however advertised to do), rather the length of

Re: [HACKERS] Unicode combining characters

2001-09-26 Thread Tatsuo Ishii
I would like to see SQL99's charset, collate functionality for 7.3 (or later). If this happens, current multibyte implementation would be dramatically changed... I'm *still* interested in working on this (an old story I know). I'm working on date/time stuff for 7.2, but hopefully 7.3

Re: [HACKERS] Unicode combining characters

2001-09-25 Thread Patrice Hd
Hi, * Tatsuo Ishii [EMAIL PROTECTED] [010925 18:18]: So, this shows two problems : - length() on the server side doesn't handle correctly Unicode [I have the same result with char_length()], and returns the number of chars (as it is however advertised to do), rather the length

Re: [HACKERS] Unicode combining characters

2001-09-25 Thread Tatsuo Ishii
- length() on the server side doesn't handle correctly Unicode [I have the same result with char_length()], and returns the number of chars (as it is however advertised to do), rather the length of the string. This is a known limitation. To solve this, we could use

Re: [HACKERS] Unicode combining characters

2001-09-25 Thread Oleg Bartunov
Looks like a good project for 7.3 Probably the best starting point would be to develope contrib/unicode with smooth transition to core. Oleg On Mon, 24 Sep 2001, Patrice [iso-8859-15] Hédé wrote: Hi all, while working on a new project involving PostgreSQL and making some tests, I

Re: [HACKERS] Unicode combining characters

2001-09-25 Thread Thomas Lockhart
I would like to see SQL99's charset, collate functionality for 7.3 (or later). If this happens, current multibyte implementation would be dramatically changed... I'm *still* interested in working on this (an old story I know). I'm working on date/time stuff for 7.2, but hopefully 7.3 will see

[HACKERS] Unicode combining characters

2001-09-24 Thread Patrice Hd
Hi all, while working on a new project involving PostgreSQL and making some tests, I have come up with the following output from psql : lang | length | length | text| text --+++---+--- isl | 7 | 6 | álíta | áleit isl | 7 | 7 |

Re: [HACKERS] Unicode combining characters

2001-09-24 Thread Tatsuo Ishii
So, this shows two problems : - length() on the server side doesn't handle correctly Unicode [I have the same result with char_length()], and returns the number of chars (as it is however advertised to do), rather the length of the string. This is a known limitation. - the psql