Re: [HACKERS] Status report: regex replacement

2003-02-11 Thread Peter Eisentraut
Tatsuo Ishii writes:

  UTF-8 seems to be the most popular, but even XML standard requires all
  compliant implementations to deal with at least both UTF-8 and UTF-16.

 I don't think PostgreSQL is going to natively support UTF-16.

At FOSDEM it was claimed that Windows natively uses UCS-2, and there are
also continuing rumours that the Java Unicode encoding is not quite UTF-8,
so there is going to be a certain pressure to support other Unicode
encodings besides UTF-8.

As for the names, the SQL standard defines most of those.

-- 
Peter Eisentraut   [EMAIL PROTECTED]


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Status report: regex replacement

2003-02-10 Thread Peter Eisentraut
Tom Lane writes:

 code is concerned: the regex library actually offers three regex
 flavors, advanced, extended, and basic, where extended matches
 what we had before (extended and basic correspond to different
 levels of the POSIX 1003.2 standard).  We just need a way to expose
 that knob to the user.  I am thinking about inventing yet another GUC
 parameter, say

Perhaps it should be exposed through different operators.  If someone uses
packages (especially functions) provided externally, they might have a
hard time coordinating what flavor is required by which part of what he is
using.

By analogy, imagine there was an environment variable that switched all
grep's to egrep's.  That would be a complete mess.

-- 
Peter Eisentraut   [EMAIL PROTECTED]


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Status report: regex replacement

2003-02-10 Thread Tom Lane
Peter Eisentraut [EMAIL PROTECTED] writes:
 Tom Lane writes:
 code is concerned: the regex library actually offers three regex
 flavors, advanced, extended, and basic, where extended matches
 what we had before (extended and basic correspond to different
 levels of the POSIX 1003.2 standard).  We just need a way to expose
 that knob to the user.  I am thinking about inventing yet another GUC
 parameter, say

 Perhaps it should be exposed through different operators.  If someone uses
 packages (especially functions) provided externally, they might have a
 hard time coordinating what flavor is required by which part of what he is
 using.

But one could argue the contrary, too: if you've got an
externally-provided package there may be no convenient way to get it to
use, say, ~!#@ in place of ~.  GUC variables can come in awfully handy
in scenarios like that.

Also, if one *can* alter the SQL context in which a regexp is used, there
is a solution already provided by Spencer's regex metasyntax hack --- see
http://developer.postgresql.org/docs/postgres/functions-matching.html#POSIX-METASYNTAX
That is, one could write something like

foo ~ ('(?b)' || basic_regex_expression)

to force basic_regex_expression to be taken as a BRE and not the
extended syntax.  This is a tad uglier than changing the operator name,
perhaps, but it has some advantages too --- for one, the option could be
plugged into the string further upstream than where the SQL syntax is
determined.

Basically I think the flavor-as-GUC-variable approach is orthogonal to
inventing some new operator names.  We could do the latter too, but
I don't really see a need for it given the metasyntax feature.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Status report: regex replacement

2003-02-07 Thread Hannu Krosing
Tatsuo Ishii kirjutas R, 07.02.2003 kell 04:03:

  UTF-8 seems to be the most popular, but even XML standard requires all
  compliant implementations to deal with at least both UTF-8 and UTF-16.
 
 I don't think PostgreSQL is going to natively support UTF-16.

By natively, do you mean as backend storage format or as supported
client encoding ?

-- 
Hannu Krosing [EMAIL PROTECTED]

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Status report: regex replacement

2003-02-06 Thread Tatsuo Ishii
 I have just committed the latest version of Henry Spencer's regex
 package (lifted from Tcl 8.4.1) into CVS HEAD.  This code is natively
 able to handle wide characters efficiently, and so it avoids the
 multibyte performance problems recently exhibited by Wade Klaver.
 I have not done extensive performance testing, but the new code seems
 at least as fast as the old, and much faster in some cases.

I have tested the new regex with src/test/mb and it all passed. So the
new code looks safe at least for EUC_CN, EUC_JP, EUC_KR, EUC_TW,
MULE_INTERNAL, UNICODE, though the test does not include all possible
regex patterns.
--
Tatsuo Ishii

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Status report: regex replacement

2003-02-06 Thread Tim Allen
On Fri, 7 Feb 2003 00:49, Hannu Krosing wrote:
 Tatsuo Ishii kirjutas N, 06.02.2003 kell 17:05:
   Perhaps we should not call the encoding UNICODE but UTF8 (which it
   really is). UNICODE is a character set which has half a dozen official
   encodings and calling one of them UNICODE does not make things very
   clear.
 
  Right. Also we perhaps should call LATIN1 or ISO-8859-1 more precisely
  way since ISO-8859-1 can be encoded in either 7 bit or 8 bit(we use
  this). I don't know what it is called though.

 I don't think that calling 8-bit ISO-8859-1 ISO-8859-1 can confuse
 anybody, but UCS-2 (ISO-10646-1), UTF-8 and UTF-16 are all widely used.

 UTF-8 seems to be the most popular, but even XML standard requires all
 compliant implementations to deal with at least both UTF-8 and UTF-16.

Strong agreement from me, for whatever value you wish to place on my opinion. 
UTF-8 is a preferable name to UNICODE. The case for distinguishing 7-bit from 
8-bit latin1 seems much weaker.

Tim

-- 
---
Tim Allen  [EMAIL PROTECTED]
Proximity Pty Ltd  http://www.proximity.com.au/
  http://www4.tpg.com.au/users/rita_tim/


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Status report: regex replacement

2003-02-06 Thread Hannu Krosing
On Thu, 2003-02-06 at 13:25, Tatsuo Ishii wrote:
  I have just committed the latest version of Henry Spencer's regex
  package (lifted from Tcl 8.4.1) into CVS HEAD.  This code is natively
  able to handle wide characters efficiently, and so it avoids the
  multibyte performance problems recently exhibited by Wade Klaver.
  I have not done extensive performance testing, but the new code seems
  at least as fast as the old, and much faster in some cases.
 
 I have tested the new regex with src/test/mb and it all passed. So the
 new code looks safe at least for EUC_CN, EUC_JP, EUC_KR, EUC_TW,
 MULE_INTERNAL, UNICODE, though the test does not include all possible
 regex patterns.

Perhaps we should not call the encoding UNICODE but UTF8 (which it
really is). UNICODE is a character set which has half a dozen official
encodings and calling one of them UNICODE does not make things very
clear.

-- 
Hannu Krosing [EMAIL PROTECTED]

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Status report: regex replacement

2003-02-06 Thread Tatsuo Ishii
 Perhaps we should not call the encoding UNICODE but UTF8 (which it
 really is). UNICODE is a character set which has half a dozen official
 encodings and calling one of them UNICODE does not make things very
 clear.

Right. Also we perhaps should call LATIN1 or ISO-8859-1 more precisely
way since ISO-8859-1 can be encoded in either 7 bit or 8 bit(we use
this). I don't know what it is called though.
--
Tatsuo Ishii

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Status report: regex replacement

2003-02-06 Thread Hannu Krosing
Tatsuo Ishii kirjutas N, 06.02.2003 kell 17:05:
  Perhaps we should not call the encoding UNICODE but UTF8 (which it
  really is). UNICODE is a character set which has half a dozen official
  encodings and calling one of them UNICODE does not make things very
  clear.
 
 Right. Also we perhaps should call LATIN1 or ISO-8859-1 more precisely
 way since ISO-8859-1 can be encoded in either 7 bit or 8 bit(we use
 this). I don't know what it is called though.

I don't think that calling 8-bit ISO-8859-1 ISO-8859-1 can confuse
anybody, but UCS-2 (ISO-10646-1), UTF-8 and UTF-16 are all widely used. 

UTF-8 seems to be the most popular, but even XML standard requires all
compliant implementations to deal with at least both UTF-8 and UTF-16.

-- 
Hannu Krosing [EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Status report: regex replacement

2003-02-06 Thread Tatsuo Ishii
  Right. Also we perhaps should call LATIN1 or ISO-8859-1 more precisely
  way since ISO-8859-1 can be encoded in either 7 bit or 8 bit(we use
  this). I don't know what it is called though.
 
 I don't think that calling 8-bit ISO-8859-1 ISO-8859-1 can confuse
 anybody, but UCS-2 (ISO-10646-1), UTF-8 and UTF-16 are all widely used. 

I just pointed out that ISO-8859-1 is *not* an encoding, but a
character set.

 UTF-8 seems to be the most popular, but even XML standard requires all
 compliant implementations to deal with at least both UTF-8 and UTF-16.

I don't think PostgreSQL is going to natively support UTF-16.
--
Tatsuo Ishii

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Status report: regex replacement

2003-02-05 Thread Jon Jensen
On Wed, 5 Feb 2003, Tom Lane wrote:

 1. There are a couple of minor incompatibilities between the advanced
 regex syntax implemented by this package and the syntax handled by our
 old code; in particular, backslash is now a special character within
 bracket expressions.  It seems to me that we'd better offer a switch
 to allow backwards compatibility.  This is easily done as far as the
 code is concerned: the regex library actually offers three regex
 flavors, advanced, extended, and basic, where extended matches
 what we had before (extended and basic correspond to different
 levels of the POSIX 1003.2 standard).  We just need a way to expose
 that knob to the user.  I am thinking about inventing yet another GUC
 parameter, say
 
   set regex_flavor = advanced
   set regex_flavor = extended
   set regex_flavor = basic
[snip]
 Any suggestions about the name of the parameter?

Actually I think 'regex_flavor' sounds fine.

Jon

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Status report: regex replacement

2003-02-05 Thread Christopher Kings-Lynne
  set regex_flavor = advanced
  set regex_flavor = extended
  set regex_flavor = basic
 [snip]
  Any suggestions about the name of the parameter?
 
 Actually I think 'regex_flavor' sounds fine.

Not more Americanisms in our config files!! :P

Chris


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Status report: regex replacement

2003-02-05 Thread Hannu Krosing
Christopher Kings-Lynne kirjutas N, 06.02.2003 kell 03:56:
 set regex_flavor = advanced
 set regex_flavor = extended
 set regex_flavor = basic
  [snip]
   Any suggestions about the name of the parameter?
  
  Actually I think 'regex_flavor' sounds fine.
 
 Not more Americanisms in our config files!! :P

Maybe support both, like for ANALYZE/ANALYSE ?

While at it, could we make another variant - ANALÜÜSI - which 
would be native for me ;)

-- 
Hannu Krosing [EMAIL PROTECTED]

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]