Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-09 Thread David E. Wheeler

On Jul 7, 2008, at 12:06, David E. Wheeler wrote:

I understand it but there is parallel project which should solve  
this problem completely I guess in close future (2-3years).  
Afterward this module will be obsolete and it will takes times to  
remove it from contrib. It seems to me that have citext in contrib  
only for two releases is not optimal solution.


I guess that'd be the reason to keep it on pgFoundry, but I have two  
comments:


* 2-3 years is a *long* time in Internet time.

* There is on guarantee that it will be finished in that time or,  
indeed ever (no disrespect to Radek Strnad, it's just there are  
always unforeseen issues).


One other thing that occurred to me yesterday: Given that the feature  
will ultimately support column-level collations, I suspect that it  
will be much easier to migrate CITEXT to a case-insensitive collation  
(perhaps using an updated CITEXT contrib module that just does so  
transparently) than to migrate application code from using LOWER() all  
over the place to not using. One transition requires a change to the  
schema, the other requires a change to application code, of which  
there is generally a lot more.


Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-08 Thread Martijn van Oosterhout
On Mon, Jul 07, 2008 at 12:06:08PM -0700, David E. Wheeler wrote:
 I guess that'd be the reason to keep it on pgFoundry, but I have two  
 comments:
 
 * 2-3 years is a *long* time in Internet time.

There have been patches over the years, but they tend not to get looked
at. Recently someone pulled up the COLLATE patch from a couple of years
ago but it didn't get much feedback either. (I can't find the link
right now).

It's disappointing that the discussions get hung up on the ICU library
when it's not required or even needed for COLLATE support. My original
patch never even mentioned it.

I note that Firebird added COLLATE using ICU a few years back now. I
think PostgreSQL is the only large DBMS to not support it.

 * There is on guarantee that it will be finished in that time or,  
 indeed ever (no disrespect to Radek Strnad, it's just there are always  
 unforeseen issues).

I think that with concerted coding effort it could be done in 2-3
months (judging by how long it took to write the first version). The
problem is it needs some planner kung-fu which not many people have.

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-08 Thread Zdenek Kotala

Martijn van Oosterhout napsal(a):

On Mon, Jul 07, 2008 at 12:06:08PM -0700, David E. Wheeler wrote:
I guess that'd be the reason to keep it on pgFoundry, but I have two  
comments:


* 2-3 years is a *long* time in Internet time.


There have been patches over the years, but they tend not to get looked
at. Recently someone pulled up the COLLATE patch from a couple of years
ago but it didn't get much feedback either. (I can't find the link
right now).


I know about it. I have printed your proposal on my desk. I think It is linked 
from TODO list.



It's disappointing that the discussions get hung up on the ICU library
when it's not required or even needed for COLLATE support. My original
patch never even mentioned it.

I note that Firebird added COLLATE using ICU a few years back now. I
think PostgreSQL is the only large DBMS to not support it.


Complete agree. Collation missing support is big problem for many users.

* There is on guarantee that it will be finished in that time or,  
indeed ever (no disrespect to Radek Strnad, it's just there are always  
unforeseen issues).


I think that with concerted coding effort it could be done in 2-3
months (judging by how long it took to write the first version). The
problem is it needs some planner kung-fu which not many people have.


I agree that 2-3 months on fulltime is good estimation, problem is that you need 
kung-fu master which has time to do it :(. What we currently have is student 
which works on it in free time.


Zdenek

--
Zdenek Kotala  Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread Zdenek Kotala

David E. Wheeler napsal(a):

On Jun 27, 2008, at 18:22, David E. Wheeler wrote:

Please find attached a patch adding a locale-aware, case-insensitive 
text type, called citext, as a contrib module.


Here is a new version of the patch, with the following changes:

* Fixed formatting to be more like core.
* Added appropriate NEGATORs to operators.
* Removed NEGATOR from the || operator.
* Added hash index function and operator class.
* The = operator now supports HASHES and MERGES.
* citext_cmp and citextcmp both return int32.
* Changed // comments to /* comments */.
* Added test confirming láska'::citext  'laská'::citext.
* A few other organizational, formatting, and pasto fixes.
* Updated the FAQ entry on case-insensitive queries to recommend citext 
(it would, of course, need to be translated).


Stuff I was asked about but didn't change:

* citext_cmp() still uses varstr_cmp() instead of strncmp(). When I 
tried the latter, everything seemed to be equivalent.


citext_eq (operator =) should use strncmp and citext_cmp() should use 
varstr_cmp().


* citext_smaller() and citext_larger() don't have memory leaks, says 
Tom, so I added no calls to PG_FREE_IF_COPY().


Yeah, it is correct. FGMR does clean job for you.


However, It seems to me that code is ok now (exclude citex_eq). I think there 
two open problems/questions:


1) regression test -

  a) I think that regresion is not correct. It depends on en_US locale, but it 
uses characters which is not in related character repertoire. It means comparing 
is not defined and I guess it could generate different result on different OS - 
depends on locale implementation.


  b) pgTap is something new. Need make a decision if this framework is 
acceptable or not.



2) contrib vs. pgFoundry

There is unresolved answer if we want to have this in contrib or not. Good to 
mention that citext type will be obsoleted with full collation implementation in 
a future. I personally prefer to keep it on pgFoundry because it is temporally 
workaround (by my opinion), but I can live with contrib module as well.



Zdenek



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread Andrew Dunstan



Zdenek Kotala wrote:



2) contrib vs. pgFoundry

There is unresolved answer if we want to have this in contrib or not. 
Good to mention that citext type will be obsoleted with full collation 
implementation in a future. I personally prefer to keep it on 
pgFoundry because it is temporally workaround (by my opinion), but I 
can live with contrib module as well.



  


Is there a concrete plan to get to full collation support (i.e. per 
column) in any known time frame? If not, then I think a citext module 
would be acceptable.


What does still bother me is its performance. I'd like to know if any 
measurement has been done of using citext vs. a functional index on 
lower(foo).


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread David E. Wheeler

On Jul 7, 2008, at 07:41, Zdenek Kotala wrote:

However, It seems to me that code is ok now (exclude citex_eq). I  
think there two open problems/questions:


1) regression test -

 a) I think that regresion is not correct. It depends on en_US  
locale, but it uses characters which is not in related character  
repertoire. It means comparing is not defined and I guess it could  
generate different result on different OS - depends on locale  
implementation.


That I don't know about. The test requires en_US.UTF-8, at least at  
this point. How are tests run on the build farm? And how else could I  
ensure that comparisons are case-insensitive for non-ASCII characters  
other than requiring a Unicode locale? Or is it just an issue for the  
sort order tests? For those, I could potentially remove accented  
characters, just as long as I'm verifying in other tests that  
comparisons are case-insensitive (without worrying about collation).


 b) pgTap is something new. Need make a decision if this framework  
is acceptable or not.


Well, from the point of view of `make installcheck`, it's invisible.  
I've submitted a talk proposal for pgDay.US on ptTAP. I'm happy to  
discuss it further here though, if folks are interested.



2) contrib vs. pgFoundry

There is unresolved answer if we want to have this in contrib or  
not. Good to mention that citext type will be obsoleted with full  
collation implementation in a future. I personally prefer to keep it  
on pgFoundry because it is temporally workaround (by my opinion),  
but I can live with contrib module as well.


I second what Andrew has said in reply to this issue. I'll also say  
that, since people *so* often end up using `WHERE LOWER(col) =  
LOWER(?)`, that it'd be very valuable to have citext in contrib,  
especially since so few folks seem to even know about pgFoundry, let  
alone be able to find it. I mean, look at these search results:


  http://www.google.com/search?q=PostgreSQL%20case-insensitive%20text

My blog entry about this patch is hit #3. pgFoundry (and CITEXT 1) is  
#7. Last time I did a query like this, it didn't turn up at all.


Belive me, I'd love for pgFoundry (or something like it) to become the  
CPAN for PostgreSQL. But without some love and SEO, I don't think  
that's gonna happen.


Besides, CITEXT 2 would be a PITA to maintain for both 8.3 and 8.4,  
given the changes in the string comparison API in CVS.


Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread David E. Wheeler

On Jul 7, 2008, at 08:01, Andrew Dunstan wrote:

What does still bother me is its performance. I'd like to know if  
any measurement has been done of using citext vs. a functional index  
on lower(foo).


That's a good question. I can whip up a quick test by populating a  
column full of data and trying both, but I'm no performance testing  
expert. Anyone else know more about such performance testing who can  
help out?


Many Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread Zdenek Kotala

David E. Wheeler napsal(a):

On Jul 7, 2008, at 07:41, Zdenek Kotala wrote:

However, It seems to me that code is ok now (exclude citex_eq). I 
think there two open problems/questions:


1) regression test -

 a) I think that regresion is not correct. It depends on en_US locale, 
but it uses characters which is not in related character repertoire. 
It means comparing is not defined and I guess it could generate 
different result on different OS - depends on locale implementation.


That I don't know about. The test requires en_US.UTF-8, at least at this 
point. How are tests run on the build farm? And how else could I ensure 
that comparisons are case-insensitive for non-ASCII characters other 
than requiring a Unicode locale? Or is it just an issue for the sort 
order tests? For those, I could potentially remove accented characters, 
just as long as I'm verifying in other tests that comparisons are 
case-insensitive (without worrying about collation).


Hmm, it is complex area and I'm not sure if anybody on planet know correct 
answer :-). I think that best approach is to keep it as is and when a problem 
occur then it will be fixed.


 b) pgTap is something new. Need make a decision if this framework is 
acceptable or not.


Well, from the point of view of `make installcheck`, it's invisible. 
I've submitted a talk proposal for pgDay.US on ptTAP. I'm happy to 
discuss it further here though, if folks are interested.


Yeah, it is invisible, but question is why don't put it as a framework to common 
place. But it is for another discussion.



2) contrib vs. pgFoundry

There is unresolved answer if we want to have this in contrib or not. 
Good to mention that citext type will be obsoleted with full collation 
implementation in a future. I personally prefer to keep it on 
pgFoundry because it is temporally workaround (by my opinion), but I 
can live with contrib module as well.


I second what Andrew has said in reply to this issue. I'll also say 
that, since people *so* often end up using `WHERE LOWER(col) = 
LOWER(?)`, that it'd be very valuable to have citext in contrib, 
especially since so few folks seem to even know about pgFoundry, let 
alone be able to find it. I mean, look at these search results:


I understand it but there is parallel project which should solve this problem 
completely I guess in close future (2-3years). Afterward this module will be 
obsolete and it will takes times to remove it from contrib. It seems to me that 
have citext in contrib only for two releases is not optimal solution.


Zdenek



--
Zdenek Kotala  Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread Zdenek Kotala

Andrew Dunstan napsal(a):



Zdenek Kotala wrote:



2) contrib vs. pgFoundry

There is unresolved answer if we want to have this in contrib or not. 
Good to mention that citext type will be obsoleted with full collation 
implementation in a future. I personally prefer to keep it on 
pgFoundry because it is temporally workaround (by my opinion), but I 
can live with contrib module as well.



  


Is there a concrete plan to get to full collation support (i.e. per 
column) in any known time frame? If not, then I think a citext module 
would be acceptable.


Collation per database should be part of 8.4. Radek Strnad works on it as a SoC 
project. He plans to continue on this work and implement full support as his 
Master of Thesis in 2-3 years time frame.


Zdenek


--
Zdenek Kotala  Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread David E. Wheeler

On Jul 7, 2008, at 11:54, Zdenek Kotala wrote:

Hmm, it is complex area and I'm not sure if anybody on planet know  
correct answer :-). I think that best approach is to keep it as is  
and when a problem occur then it will be fixed.


Regression tests are really important, though.

b) pgTap is something new. Need make a decision if this framework  
is acceptable or not.
Well, from the point of view of `make installcheck`, it's  
invisible. I've submitted a talk proposal for pgDay.US on ptTAP.  
I'm happy to discuss it further here though, if folks are interested.


Yeah, it is invisible, but question is why don't put it as a  
framework to common place. But it is for another discussion.


It is in a common place as a project, on pgFoundry. Whether the core  
team wants to use it for testing other parts of PostgreSQL (core or  
contrib) and therefore put it in a central location is, as you say, a  
separate conversation. It'd be easy to move it in such a case, of  
course.


I understand it but there is parallel project which should solve  
this problem completely I guess in close future (2-3years).  
Afterward this module will be obsolete and it will takes times to  
remove it from contrib. It seems to me that have citext in contrib  
only for two releases is not optimal solution.


I guess that'd be the reason to keep it on pgFoundry, but I have two  
comments:


* 2-3 years is a *long* time in Internet time.

* There is on guarantee that it will be finished in that time or,  
indeed ever (no disrespect to Radek Strnad, it's just there are always  
unforeseen issues).


Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread Pavel Stehule
2008/7/7 Zdenek Kotala [EMAIL PROTECTED]:
 David E. Wheeler napsal(a):

 On Jul 7, 2008, at 07:41, Zdenek Kotala wrote:

 However, It seems to me that code is ok now (exclude citex_eq). I think
 there two open problems/questions:

 1) regression test -

  a) I think that regresion is not correct. It depends on en_US locale,
 but it uses characters which is not in related character repertoire. It
 means comparing is not defined and I guess it could generate different
 result on different OS - depends on locale implementation.

 That I don't know about. The test requires en_US.UTF-8, at least at this
 point. How are tests run on the build farm? And how else could I ensure that
 comparisons are case-insensitive for non-ASCII characters other than
 requiring a Unicode locale? Or is it just an issue for the sort order tests?
 For those, I could potentially remove accented characters, just as long as
 I'm verifying in other tests that comparisons are case-insensitive (without
 worrying about collation).

 Hmm, it is complex area and I'm not sure if anybody on planet know correct
 answer :-). I think that best approach is to keep it as is and when a
 problem occur then it will be fixed.


Maybe we can have some locale test outside our regress tests -

  b) pgTap is something new. Need make a decision if this framework is
 acceptable or not.

 Well, from the point of view of `make installcheck`, it's invisible. I've
 submitted a talk proposal for pgDay.US on ptTAP. I'm happy to discuss it
 further here though, if folks are interested.

 Yeah, it is invisible, but question is why don't put it as a framework to
 common place. But it is for another discussion.

 2) contrib vs. pgFoundry

 There is unresolved answer if we want to have this in contrib or not.
 Good to mention that citext type will be obsoleted with full collation
 implementation in a future. I personally prefer to keep it on pgFoundry
 because it is temporally workaround (by my opinion), but I can live with
 contrib module as well.

 I second what Andrew has said in reply to this issue. I'll also say that,
 since people *so* often end up using `WHERE LOWER(col) = LOWER(?)`, that
 it'd be very valuable to have citext in contrib, especially since so few
 folks seem to even know about pgFoundry, let alone be able to find it. I
 mean, look at these search results:

 I understand it but there is parallel project which should solve this
 problem completely I guess in close future (2-3years). Afterward this
 module will be obsolete and it will takes times to remove it from contrib.
 It seems to me that have citext in contrib only for two releases is not
 optimal solution.

Zdenek


Regards
Pavel


 --
 Zdenek Kotala  Sun Microsystems
 Prague, Czech Republic http://sun.com/postgresql


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread David E. Wheeler

On Jul 7, 2008, at 12:10, Pavel Stehule wrote:


Maybe we can have some locale test outside our regress tests -


I think that would be useful.

Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread David E. Wheeler

On Jul 7, 2008, at 08:01, Andrew Dunstan wrote:

What does still bother me is its performance. I'd like to know if  
any measurement has been done of using citext vs. a functional index  
on lower(foo).


Okay, here's a start. The attached script inserts random strings of  
1-10 space-delimited words into text and citext columns, and then  
compares the performance of queries with and without indexes. The  
output for me is as follows:


Loading words from dictionary.
Inserting into the table.

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 254.254 ms
SELECT * FROM try WHERE citext = 'food';
Time: 288.535 ms

Test LIKE and ILIKE
SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%');
Time: 209.385 ms
SELECT * FROM try WHERE citext ILIKE 'C%';
Time: 236.186 ms
SELECT * FROM try WHERE citext LIKE 'C%';
Time: 235.818 ms

Adding indexes...

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 1.260 ms
SELECT * FROM try WHERE citext = 'food';
Time: 277.755 ms

Test LIKE and ILIKE
SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%');
Time: 209.073 ms
SELECT * FROM try WHERE citext ILIKE 'C%';
Time: 238.430 ms
SELECT * FROM try WHERE citext LIKE 'C%';
Time: 238.685 ms
benedict%

So for some reason, after adding the indexes, the queries against the  
CITEXT column aren't using them. Furthermore, the `lower(text) LIKE  
lower(?)` query isn't using *its* index. Huh?


So this leaves me with two questions:

1. For what reason would the query against the citext column *not* use  
the index?


2. Is there some way to get the CITEXT index to behave like a LOWER()  
index, that is, so that its value is stored using the result of the  
str_tolower() function, thus removing some of the overhead of  
converting the values for each row fetched from the index? (Does this  
question make any sense?)


Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread David E. Wheeler

And here is the script. D'oh!

Thanks,

David



try.sql
Description: Binary data




On Jul 7, 2008, at 16:24, David E. Wheeler wrote:


On Jul 7, 2008, at 08:01, Andrew Dunstan wrote:

What does still bother me is its performance. I'd like to know if  
any measurement has been done of using citext vs. a functional  
index on lower(foo).


Okay, here's a start. The attached script inserts random strings of  
1-10 space-delimited words into text and citext columns, and then  
compares the performance of queries with and without indexes. The  
output for me is as follows:


Loading words from dictionary.
Inserting into the table.

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 254.254 ms
SELECT * FROM try WHERE citext = 'food';
Time: 288.535 ms

Test LIKE and ILIKE
SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%');
Time: 209.385 ms
SELECT * FROM try WHERE citext ILIKE 'C%';
Time: 236.186 ms
SELECT * FROM try WHERE citext LIKE 'C%';
Time: 235.818 ms

Adding indexes...

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 1.260 ms
SELECT * FROM try WHERE citext = 'food';
Time: 277.755 ms

Test LIKE and ILIKE
SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%');
Time: 209.073 ms
SELECT * FROM try WHERE citext ILIKE 'C%';
Time: 238.430 ms
SELECT * FROM try WHERE citext LIKE 'C%';
Time: 238.685 ms
benedict%

So for some reason, after adding the indexes, the queries against  
the CITEXT column aren't using them. Furthermore, the `lower(text)  
LIKE lower(?)` query isn't using *its* index. Huh?


So this leaves me with two questions:

1. For what reason would the query against the citext column *not*  
use the index?


2. Is there some way to get the CITEXT index to behave like a  
LOWER() index, that is, so that its value is stored using the result  
of the str_tolower() function, thus removing some of the overhead of  
converting the values for each row fetched from the index? (Does  
this question make any sense?)


Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread David E. Wheeler

No, *really*

Sheesh, sorry.

David



try.sql
Description: Binary data




On Jul 7, 2008, at 16:26, David E. Wheeler wrote:


And here is the script. D'oh!

Thanks,

David

try.sql


On Jul 7, 2008, at 16:24, David E. Wheeler wrote:


On Jul 7, 2008, at 08:01, Andrew Dunstan wrote:

What does still bother me is its performance. I'd like to know if  
any measurement has been done of using citext vs. a functional  
index on lower(foo).


Okay, here's a start. The attached script inserts random strings of  
1-10 space-delimited words into text and citext columns, and then  
compares the performance of queries with and without indexes. The  
output for me is as follows:


Loading words from dictionary.
Inserting into the table.

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 254.254 ms
SELECT * FROM try WHERE citext = 'food';
Time: 288.535 ms

Test LIKE and ILIKE
SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%');
Time: 209.385 ms
SELECT * FROM try WHERE citext ILIKE 'C%';
Time: 236.186 ms
SELECT * FROM try WHERE citext LIKE 'C%';
Time: 235.818 ms

Adding indexes...

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 1.260 ms
SELECT * FROM try WHERE citext = 'food';
Time: 277.755 ms

Test LIKE and ILIKE
SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%');
Time: 209.073 ms
SELECT * FROM try WHERE citext ILIKE 'C%';
Time: 238.430 ms
SELECT * FROM try WHERE citext LIKE 'C%';
Time: 238.685 ms
benedict%

So for some reason, after adding the indexes, the queries against  
the CITEXT column aren't using them. Furthermore, the `lower(text)  
LIKE lower(?)` query isn't using *its* index. Huh?


So this leaves me with two questions:

1. For what reason would the query against the citext column *not*  
use the index?


2. Is there some way to get the CITEXT index to behave like a  
LOWER() index, that is, so that its value is stored using the  
result of the str_tolower() function, thus removing some of the  
overhead of converting the values for each row fetched from the  
index? (Does this question make any sense?)


Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread David E. Wheeler

Thanks to help from RhodiumToad on IRC, I got some things improved here:

On Jul 7, 2008, at 16:24, David E. Wheeler wrote:

So for some reason, after adding the indexes, the queries against  
the CITEXT column aren't using them. Furthermore, the `lower(text)  
LIKE lower(?)` query isn't using *its* index. Huh?


I never knew what one needed to use the text_pattern_ops operator  
class to index a column for use with LIKE! I had no clue. Would that  
work for a citext column, too, since it's essentially the same as TEXT?



So this leaves me with two questions:

1. For what reason would the query against the citext column *not*  
use the index?


It turns out that it did use the index if I put `SET enable_seqscan =  
false;` into my script. So with RhodiumToad's direction, I added some  
`RESTRICT` and `JOIN` clauses to my comparison operators (copying them  
from ip4r). So now I have:


CREATE OPERATOR = (
LEFTARG= CITEXT,
RIGHTARG   = CITEXT,
COMMUTATOR = =,
NEGATOR= ,
PROCEDURE  = citext_eq,
RESTRICT   = eqsel,
JOIN   = eqjoinsel,
HASHES,
MERGES
);

CREATE OPERATOR  (
LEFTARG= CITEXT,
RIGHTARG   = CITEXT,
NEGATOR= =,
COMMUTATOR = ,
PROCEDURE  = citext_ne,
RESTRICT   = neqsel,
JOIN   = neqjoinsel
);

CREATE OPERATOR  (
LEFTARG= CITEXT,
RIGHTARG   = CITEXT,
NEGATOR= =,
COMMUTATOR = ,
PROCEDURE  = citext_lt,
RESTRICT   = scalarltsel,
JOIN   = scalarltjoinsel
);

CREATE OPERATOR = (
LEFTARG= CITEXT,
RIGHTARG   = CITEXT,
NEGATOR= ,
COMMUTATOR = =,
PROCEDURE  = citext_le,
RESTRICT   = scalarltsel,
JOIN   = scalarltjoinsel
);

CREATE OPERATOR = (
LEFTARG= CITEXT,
RIGHTARG   = CITEXT,
NEGATOR= ,
COMMUTATOR = =,
PROCEDURE  = citext_ge,
RESTRICT   = scalargtsel,
JOIN   = scalargtjoinsel
);

CREATE OPERATOR  (
LEFTARG= CITEXT,
RIGHTARG   = CITEXT,
NEGATOR= =,
COMMUTATOR = ,
PROCEDURE  = citext_gt,
RESTRICT   = scalargtsel,
JOIN   = scalargtjoinsel
);

With this change, the index was used:

Loading words from dictionary.
Inserting into the table.

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 261.295 ms
SELECT * FROM try WHERE citext = 'food';
Time: 289.304 ms
Time: 1228.961 ms

Adding indexes...

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 2.018 ms
SELECT * FROM try WHERE citext = 'food';
Time: 0.788 ms

Seems to be faster than the LOWER() version, too, which makes me  
happy. The output from EXPLAIN ANALYZE:


try=# EXPLAIN ANALYZE SELECT * FROM try WHERE citext = 'food';

  QUERY PLAN
--
 Index Scan using idx_try_citext on try  (cost=0.00..8.31 rows=1  
width=119) (actual time=0.324..0.324 rows=0 loops=1)

   Index Cond: (citext = 'food'::citext)
 Total runtime: 0.377 ms
(3 rows)

try=# EXPLAIN ANALYZE SELECT * FROM try WHERE LOWER(text) =  
LOWER('food');


   QUERY PLAN

 Bitmap Heap Scan on try  (cost=28.17..1336.10 rows=500 width=119)  
(actual time=0.170..0.170 rows=0 loops=1)

   Recheck Cond: (lower(text) = 'food'::text)
   -  Bitmap Index Scan on idx_try_text  (cost=0.00..28.04 rows=500  
width=0) (actual time=0.168..0.168 rows=0 loops=1)

 Index Cond: (lower(text) = 'food'::text)
 Total runtime: 0.211 ms
(5 rows)

So my only other question related to this is:

* Are the above RESTRICT and JOIN functions the ones to use, or is  
there some way to make use of those used by the TEXT type that would  
be more appropriate?


2. Is there some way to get the CITEXT index to behave like a  
LOWER() index, that is, so that its value is stored using the result  
of the str_tolower() function, thus removing some of the overhead of  
converting the values for each row fetched from the index? (Does  
this question make any sense?)


Given the performance with an index, I think that this is moot, yes?  
There is, of course, much more overhead for a table scan.


Best,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PATCH: CITEXT 2.0 v2

2008-07-06 Thread David E . Wheeler

On Jun 27, 2008, at 18:22, David E. Wheeler wrote:

Please find attached a patch adding a locale-aware, case-insensitive  
text type, called citext, as a contrib module.


Here is a new version of the patch, with the following changes:

* Fixed formatting to be more like core.
* Added appropriate NEGATORs to operators.
* Removed NEGATOR from the || operator.
* Added hash index function and operator class.
* The = operator now supports HASHES and MERGES.
* citext_cmp and citextcmp both return int32.
* Changed // comments to /* comments */.
* Added test confirming láska'::citext  'laská'::citext.
* A few other organizational, formatting, and pasto fixes.
* Updated the FAQ entry on case-insensitive queries to recommend  
citext (it would, of course, need to be translated).


Stuff I was asked about but didn't change:

* citext_cmp() still uses varstr_cmp() instead of strncmp(). When I  
tried the latter, everything seemed to be equivalent.
* citext_smaller() and citext_larger() don't have memory leaks, says  
Tom, so I added no calls to PG_FREE_IF_COPY().


Thank you everyone for your feedback and suggestions!

Best,

David



citext2.patch.gz
Description: GNU Zip compressed data





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers