Re: [HACKERS] PATCH: CITEXT 2.0 v4

2008-07-22 Thread David E. Wheeler

On Jul 18, 2008, at 01:39, Michael Paesold wrote:

Calling regex functions with the case-insensitivity option would be  
great. It should also be possible to rewrite replace() into  
regexp_replace() by first escaping the regex meta characters.


Actually re-implementing those functions in a case insensitive way  
would still be an option, but of course some amount of work. The  
question is, how much use case there is.


I've figured out how to make all the functions work using SQL function  
workarounds, converting things and re-dispatching to the text versions  
as appropriate. They work quite well, and can be converted to C later  
if that becomes a requirement.


Meanwhile, on the question of whether or not regular expression and  
LIKE comparisons *should* match case-insensitively, I have a couple  
more observations:


* Thinking about how a true case-insensitive collation would work, I'm  
quite certain that it would match case-insensitively. Anything else  
would just be unexpected, because in a case-insensitive collation,  
lowercase characters are, in practice, identical to uppercase  
characters. As far as matching is concerned, there is no difference  
between them. So the matching operators and functions against CITEXT  
should follow that assumption.


* I tried a few matches on MySQL, where the collation is case- 
insensitive by default, and it confirms my impression:


mysql select 'Foo' regexp 'o$';
+---+
| 'Foo' regexp 'o$' |
+---+
| 1 |
+---+
1 row in set (0.00 sec)

mysql select 'Foo' regexp 'O$';
+---+
| 'Foo' regexp 'O$' |
+---+
| 1 |
+---+
1 row in set (0.00 sec)

mysql select 'Foo' like '%o';
+-+
| 'Foo' like '%o' |
+-+
|   1 |
+-+
1 row in set (0.00 sec)

mysql select 'Foo' like '%O';
+-+
| 'Foo' like '%O' |
+-+
|   1 |
+-+
1 row in set (0.00 sec)

I'll grant that MySQL may not be the best model for how things should  
work, but it's something, at least. Anyone else got access to another  
database with case-insensitive collations to see how LIKE and regular  
expressions work?


Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v4

2008-07-21 Thread David E. Wheeler

On Jul 18, 2008, at 09:53, David E. Wheeler wrote:

However, if someone with a lot more C and Pg core knowledge wanted  
to sit down with me for a couple hours next week and help me bang  
out these functions, that would be great. I'd love to have the  
implementation be that much more complete.


I've implemented fixes for the regexp_* functions and strpos() in pure  
SQL, like so:


CREATE OR REPLACE FUNCTION regexp_matches( citext, citext ) RETURNS  
TEXT[] AS '

SELECT regexp_matches( $1::text, $2::text, ''i'' );
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_matches( citext, citext, text )  
RETURNS TEXT[] AS '
SELECT regexp_matches( $1::text, $2::text, CASE WHEN strpos($3,  
''c'') = 0 THEN  $3 || ''i'' ELSE $3 END );

' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_replace( citext, citext, text )  
returns TEXT AS '

SELECT regexp_replace( $1::text, $2::text, $3, ''i'');
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_replace( citext, citext, text,  
text ) returns TEXT AS '
SELECT regexp_replace( $1::text, $2::text, $3, CASE WHEN  
strpos($4, ''c'') = 0 THEN  $4 || ''i'' ELSE $4 END);

' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_split_to_array( citext, citext )  
RETURNS TEXT[] AS '

SELECT regexp_split_to_array( $1::text, $2::text, ''i'' );
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_split_to_array( citext, citext,  
text ) RETURNS TEXT[] AS '
SELECT regexp_split_to_array( $1::text, $2::text, CASE WHEN  
strpos($3, ''c'') = 0 THEN  $3 || ''i'' ELSE $3 END );

' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_split_to_table( citext, citext )  
RETURNS SETOF TEXT AS '

SELECT regexp_split_to_table( $1::text, $2::text, ''i'' );
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_split_to_table( citext, citext,  
text ) RETURNS SETOF TEXT AS '
SELECT regexp_split_to_table( $1::text, $2::text, CASE WHEN  
strpos($3, ''c'') = 0 THEN  $3 || ''i'' ELSE $3 END );

' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION strpos( citext, citext ) RETURNS INT AS '
SELECT strpos( LOWER( $1::text ), LOWER( $2::text ) );
' LANGUAGE SQL IMMUTABLE STRICT;

Not so bad, though it'd be nice to have C functions that just did  
these things. Still not case-insensitive are:


-- replace()
-- split_part()
-- translate()

So, anyone at OSCON this week want to help me with these? Or to  
convert the above functions to C? Greg? Bruce?


Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v4

2008-07-18 Thread Michael Paesold

David E. Wheeler writes:


On Jul 17, 2008, at 03:45, Michael Paesold wrote:

Wouldn't it be possible to create a variant of regexp_replace, i.e.  
regexp_replace(citext,citext,text), which would again lower-case  
the first two arguments before passing the input to  
regexp_replace(text,text,text)?


Sure, but then you end up with this:

template1=# select regexp_replace( 'Fxx'::citext, 'X'::citext, 'o');
regexp_replace

foo
(1 row)


Yeah, you are right, I see. :-)


Which is just wrong. I'm going to look at the regex C functions  
today and see if there's an easy way to just always pass them the  
'i' flag, which would do the trick. That still won't help replace(),  
split_part(), or translate(), however.


Calling regex functions with the case-insensitivity option would be  
great. It should also be possible to rewrite replace() into  
regexp_replace() by first escaping the regex meta characters.


Actually re-implementing those functions in a case insensitive way  
would still be an option, but of course some amount of work. The  
question is, how much use case there is.


Best Regards
Michael Paesold

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v4

2008-07-18 Thread David E. Wheeler

On Jul 18, 2008, at 01:39, Michael Paesold wrote:

Calling regex functions with the case-insensitivity option would be  
great. It should also be possible to rewrite replace() into  
regexp_replace() by first escaping the regex meta characters.


Actually re-implementing those functions in a case insensitive way  
would still be an option, but of course some amount of work. The  
question is, how much use case there is.


Not much for me. I might use the regex functions, but would be happy  
to manually pass the i flag.


However, if someone with a lot more C and Pg core knowledge wanted to  
sit down with me for a couple hours next week and help me bang out  
these functions, that would be great. I'd love to have the  
implementation be that much more complete.


I do believe that, as it stands now in the v4 patch, citext is pretty  
close to ready, and certainly commit-able.


Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v4

2008-07-17 Thread Michael Paesold


Am 16.07.2008 um 20:38 schrieb David E. Wheeler:


The trouble is that, right now:

template1=# select regexp_replace( 'fxx'::citext, 'X'::citext, 'o');
regexp_replace

fxx
(1 row)

So there's an inconsistency there. I don't know how to make that  
work case-insensitively.


Wouldn't it be possible to create a variant of regexp_replace, i.e.  
regexp_replace(citext,citext,text), which would again lower-case the  
first two arguments before passing the input to  
regexp_replace(text,text,text)?


Best Regards
Michael Paesold

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v4

2008-07-17 Thread David E. Wheeler

On Jul 17, 2008, at 03:45, Michael Paesold wrote:

Wouldn't it be possible to create a variant of regexp_replace, i.e.  
regexp_replace(citext,citext,text), which would again lower-case the  
first two arguments before passing the input to  
regexp_replace(text,text,text)?


Sure, but then you end up with this:

template1=# select regexp_replace( 'Fxx'::citext, 'X'::citext, 'o');
regexp_replace

foo
(1 row)

Which is just wrong. I'm going to look at the regex C functions today  
and see if there's an easy way to just always pass them the 'i' flag,  
which would do the trick. That still won't help replace(),  
split_part(), or translate(), however.


Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PATCH: CITEXT 2.0 v4

2008-07-16 Thread David E. Wheeler

Howdy,

I've attached a new patch with the latest revisions of for the citext  
contrib module patch. The changes include:


* Using strlen() to pass string lengths to the comparison function,
  since lowercasing the value can change the length. Per Tom Lane.
* Made citextcmp consistently return int32, per Tom Lane.
* Made the hash index function return the proper value, per Tom Lane.
* Removed the COMMENTs and GRANTs from citext.sql.in.
* Added a cast function from bpchar to citext, as suggested by Tom Lane.
* Set the storage type for CITEXT to extended, to ensure that it will
  be toastable. Per Tom Lane.
* Fixed the COMMUTATOR of =.
* Changed the cast from citext to bpchar from implicit to assignment.
  This eliminates ambiguous function resolutions.
* Eliminated superflous functions, per Tom Lane.
* Removed unnecessary `OPERATOR()` calls in NEGATORs and the like.
* Added binary in/out functions. Per Tom Lane
* Added an explicit shell type to make the output a bit quieter.
* Converted tests to pure SQL and omitted multibyte tests (though a
  few remain commented-out).
* Reorganized and expanded the documentation a bit.

This version is far better than I started with, and I'm very grateful  
for the feedback.


Now, I have a few remaining questions to ask, mostly just to get your  
opinions:


* The README for citext 1.0 on pgFoundry says:

I had to make a decision on casting between types for regular  
expressions and
decided that if any parameter is of citext type then case  
insensitive applies.
For example applying regular expressions with a varchar and a citext  
will

produce a case-insensitive result.

Having thought about this afterwards I realised that since we have  
the option
to use case-insensitive results with regular expressions I should  
have left the
behaviour exactly as text and then you have the best of both  
worlds... oh well

not hard to change for any of you perfectionists!


I followed the original and made all the regex and LIKE comparisons  
case-insensitive. But maybe I should not have? Especially since the  
regular expression functions (e.g., regexp_replace()) and a few non- 
regex functions (e.g., replace()) still don't behave case-insensitively?


* If the answer is no, how can I make those functions behave case- 
insensitively? (See the TODO tests.)


* Should there be any other casts? To and from name, perhaps?

Thanks!

David


citext4.patch.gz
Description: GNU Zip compressed data



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v4

2008-07-16 Thread David E. Wheeler

On Jul 15, 2008, at 22:23, David E. Wheeler wrote:


* The README for citext 1.0 on pgFoundry says:

I had to make a decision on casting between types for regular  
expressions and
decided that if any parameter is of citext type then case  
insensitive applies.
For example applying regular expressions with a varchar and a  
citext will

produce a case-insensitive result.

Having thought about this afterwards I realised that since we have  
the option
to use case-insensitive results with regular expressions I should  
have left the
behaviour exactly as text and then you have the best of both  
worlds... oh well

not hard to change for any of you perfectionists!


I followed the original and made all the regex and LIKE comparisons  
case-insensitive. But maybe I should not have? Especially since the  
regular expression functions (e.g., regexp_replace()) and a few non- 
regex functions (e.g., replace()) still don't behave case- 
insensitively?


I was thinking about this a bit last night and wanted to fill things  
out a bit.


As a programmer, I find Donald Fraser's hindsight to be more  
appealing, because at least that way I have the option to do matching  
against CITEXT strings case-sensitively when I want to.


OTOH, if what we want is to have CITEXT work more like a case- 
insensitive collation, then the expectation from the matching  
operators and functions might be different. Does anyone have any idea  
whether regex and LIKE matching against a string in a case-insensitive  
collation would match case-insensitively or not? If so, then maybe the  
regex and LIKE ops and funcs *should* match case-insensitively? If  
not, or if only for some collations, then I would think not.


Either way, I know of no way, currently, to allow functions like  
replace(), split_part(), strpos(), and translate() to match case- 
insensitiely, even if we wanted to. Anyone have any ideas?


* If the answer is no, how can I make those functions behave case- 
insensitively? (See the TODO tests.)


* Should there be any other casts? To and from name, perhaps?


Thanks,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v4

2008-07-16 Thread Robert Treat
On Wednesday 16 July 2008 13:54:25 David E. Wheeler wrote:
 On Jul 15, 2008, at 22:23, David E. Wheeler wrote:
  * The README for citext 1.0 on pgFoundry says:
  I had to make a decision on casting between types for regular
  expressions and
  decided that if any parameter is of citext type then case
  insensitive applies.
  For example applying regular expressions with a varchar and a
  citext will
  produce a case-insensitive result.
 
  Having thought about this afterwards I realised that since we have
  the option
  to use case-insensitive results with regular expressions I should
  have left the
  behaviour exactly as text and then you have the best of both
  worlds... oh well
  not hard to change for any of you perfectionists!
 
  I followed the original and made all the regex and LIKE comparisons
  case-insensitive. But maybe I should not have? Especially since the
  regular expression functions (e.g., regexp_replace()) and a few non-
  regex functions (e.g., replace()) still don't behave case-
  insensitively?

 I was thinking about this a bit last night and wanted to fill things
 out a bit.

 As a programmer, I find Donald Fraser's hindsight to be more
 appealing, because at least that way I have the option to do matching
 against CITEXT strings case-sensitively when I want to.

 OTOH, if what we want is to have CITEXT work more like a case-
 insensitive collation, then the expectation from the matching
 operators and functions might be different. Does anyone have any idea
 whether regex and LIKE matching against a string in a case-insensitive
 collation would match case-insensitively or not? If so, then maybe the
 regex and LIKE ops and funcs *should* match case-insensitively? If
 not, or if only for some collations, then I would think not.

 Either way, I know of no way, currently, to allow functions like
 replace(), split_part(), strpos(), and translate() to match case-
 insensitiely, even if we wanted to. Anyone have any ideas?

  * If the answer is no, how can I make those functions behave case-
  insensitively? (See the TODO tests.)
 
  * Should there be any other casts? To and from name, perhaps?


AIUI, your propsing the following:

select 'x'::citext = 'X'::citext;
 ?column?
--
 t
(1 row)

select 'x'::citext ~ 'X'::citext;
 ?column?
--
 f
(1 row)

I understand the desire for flexibility, but the above seems wierd to me. 

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v4

2008-07-16 Thread David E. Wheeler

On Jul 16, 2008, at 11:20, Robert Treat wrote:


I was thinking about this a bit last night and wanted to fill things
out a bit.

As a programmer, I find Donald Fraser's hindsight to be more
appealing, because at least that way I have the option to do matching
against CITEXT strings case-sensitively when I want to.

OTOH, if what we want is to have CITEXT work more like a case-
insensitive collation, then the expectation from the matching
operators and functions might be different. Does anyone have any idea
whether regex and LIKE matching against a string in a case- 
insensitive
collation would match case-insensitively or not? If so, then maybe  
the

regex and LIKE ops and funcs *should* match case-insensitively? If
not, or if only for some collations, then I would think not.

Either way, I know of no way, currently, to allow functions like
replace(), split_part(), strpos(), and translate() to match case-
insensitiely, even if we wanted to. Anyone have any ideas?


* If the answer is no, how can I make those functions behave case-
insensitively? (See the TODO tests.)

* Should there be any other casts? To and from name, perhaps?




AIUI, your propsing the following:

select 'x'::citext = 'X'::citext;
?column?
--
t
(1 row)

select 'x'::citext ~ 'X'::citext;
?column?
--
f
(1 row)

I understand the desire for flexibility, but the above seems wierd  
to me.


That's what Donald Fraser suggested, and I see some value in that, but  
wanted to get some other opinions. And you're right, that does seem a  
bit weird.


The trouble is that, right now:

template1=# select regexp_replace( 'fxx'::citext, 'X'::citext, 'o');
 regexp_replace

 fxx
(1 row)

So there's an inconsistency there. I don't know how to make that work  
case-insensitively.


Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers