Re: [HACKERS] Latest on CITEXT 2.0

2008-07-01 Thread Marko Kreen
On 6/26/08, Tom Lane [EMAIL PROTECTED] wrote:
 David E. Wheeler [EMAIL PROTECTED] writes:
   Datum citext_ne (PG_FUNCTION_ARGS) {
// Fast path for different-length inputs. Okay for canonical
   equivalence?
if (VARSIZE(PG_GETARG_TEXT_P(0)) != VARSIZE(PG_GETARG_TEXT_P(1)))
PG_RETURN_BOOL( 1 );
PG_RETURN_BOOL( citextcmp( PG_ARGS ) != 0 );
   }

 BTW, I don't think you can use that same-length optimization for
  citext.  There's no reason to think that upper/lowercase pairs will
  have the same length all the time in multibyte encodings.

What about this code in current str_tolower():

/* Output workspace cannot have more codes than input bytes */
workspace = (wchar_t *) palloc((nbytes + 1) * sizeof(wchar_t));

Bug?

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latest on CITEXT 2.0

2008-07-01 Thread Tom Lane
Marko Kreen [EMAIL PROTECTED] writes:
 On 6/26/08, Tom Lane [EMAIL PROTECTED] wrote:
 BTW, I don't think you can use that same-length optimization for
 citext.  There's no reason to think that upper/lowercase pairs will
 have the same length all the time in multibyte encodings.

 What about this code in current str_tolower():

 /* Output workspace cannot have more codes than input bytes */
 workspace = (wchar_t *) palloc((nbytes + 1) * sizeof(wchar_t));

That's working with wchars, not bytes.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latest on CITEXT 2.0

2008-07-01 Thread Marko Kreen
On 7/1/08, Tom Lane [EMAIL PROTECTED] wrote:
 Marko Kreen [EMAIL PROTECTED] writes:
   On 6/26/08, Tom Lane [EMAIL PROTECTED] wrote:

  BTW, I don't think you can use that same-length optimization for
   citext.  There's no reason to think that upper/lowercase pairs will
   have the same length all the time in multibyte encodings.

   What about this code in current str_tolower():

   /* Output workspace cannot have more codes than input bytes */
   workspace = (wchar_t *) palloc((nbytes + 1) * sizeof(wchar_t));


 That's working with wchars, not bytes.

Ah, I missed the point of char2wchar() line.

I'm rather unfamiliar with various MB API-s, sorry.

There's another thing I'm probably missing: does current code handle
multi-wchar codepoints?  Or is it guaranteed they don't happen?
(Wasn't wchar_t usually 16bit value?)

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latest on CITEXT 2.0

2008-07-01 Thread Bruce Momjian
Marko Kreen wrote:
 On 7/1/08, Tom Lane [EMAIL PROTECTED] wrote:
  Marko Kreen [EMAIL PROTECTED] writes:
On 6/26/08, Tom Lane [EMAIL PROTECTED] wrote:
 
   BTW, I don't think you can use that same-length optimization for
citext.  There's no reason to think that upper/lowercase pairs will
have the same length all the time in multibyte encodings.
 
What about this code in current str_tolower():
 
/* Output workspace cannot have more codes than input bytes */
workspace = (wchar_t *) palloc((nbytes + 1) * sizeof(wchar_t));
 
 
  That's working with wchars, not bytes.
 
 Ah, I missed the point of char2wchar() line.
 
 I'm rather unfamiliar with various MB API-s, sorry.
 
 There's another thing I'm probably missing: does current code handle
 multi-wchar codepoints?  Or is it guaranteed they don't happen?
 (Wasn't wchar_t usually 16bit value?)

If you want a simple example of wide character use look at
oracle_compat.c::upper() which calls str_toupper() in CVS HEAD.

-- 
  Bruce Momjian  [EMAIL PROTECTED]http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latest on CITEXT 2.0

2008-07-01 Thread Marko Kreen
On 7/1/08, Bruce Momjian [EMAIL PROTECTED] wrote:
 If you want a simple example of wide character use look at
  oracle_compat.c::upper() which calls str_toupper() in CVS HEAD.

ATM I'm looking at str_tolower/upper internal implementation.

They do:

  workspace[curr_char] = towlower(workspace[curr_char]);

where workspace is wchar_t but towlower() operates on wint_t.

Is such inconsistency ok?

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latest on CITEXT 2.0

2008-07-01 Thread Tom Lane
Marko Kreen [EMAIL PROTECTED] writes:
 ATM I'm looking at str_tolower/upper internal implementation.
 They do:
   workspace[curr_char] = towlower(workspace[curr_char]);
 where workspace is wchar_t but towlower() operates on wint_t.

IIRC this is exactly comparable to the type situation for the
traditional ctype.h macros.  The reason is that they are defined
to accept EOF in addition to actual char (or wchar) values.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latest on CITEXT 2.0

2008-07-01 Thread Tom Lane
Marko Kreen [EMAIL PROTECTED] writes:
 There's another thing I'm probably missing: does current code handle
 multi-wchar codepoints?  Or is it guaranteed they don't happen?

AFAIK we disallow multi-wchar situations (by rejecting the UTF8
combining codes).

 (Wasn't wchar_t usually 16bit value?)

Hmm.  It's unsigned int on my ancient HPUX box.  I think we could have a
problem on any machines whose mbstowcs doesn't support 4-byte UTF8
codes, though ... in particular, what about Windows?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latest on CITEXT 2.0

2008-07-01 Thread Marko Kreen
On 7/1/08, Tom Lane [EMAIL PROTECTED] wrote:
 Marko Kreen [EMAIL PROTECTED] writes:
  ATM I'm looking at str_tolower/upper internal implementation.
   They do:
 workspace[curr_char] = towlower(workspace[curr_char]);
   where workspace is wchar_t but towlower() operates on wint_t.

 IIRC this is exactly comparable to the type situation for the
  traditional ctype.h macros.  The reason is that they are defined
  to accept EOF in addition to actual char (or wchar) values.

I read SUS v3 and there is no hint on multi-wchar anything,
so for unix systems you are right, wint_t == wchar_t.

Seems stories how Windows and Java operate have affected me too much.

Then I browsed MSDN:

  http://msdn.microsoft.com/en-us/library/dtxesf6k.aspx

and they seem to strongly hint that wchar_t == 16 bits and
UTF-16 is used internally.

Probably some Windows developer should look into it
and decide if there is a #ifdef WIN32 branch needed.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latest on CITEXT 2.0

2008-06-27 Thread David E. Wheeler

On Jun 26, 2008, at 13:59, Tom Lane wrote:


David E. Wheeler [EMAIL PROTECTED] writes:

So, are your certain about this?


See Turkish --- in that locale i and I are not an upper/lower pair,
instead they pair with some non-ASCII letters.  There are likely
other cases but that's the counterexample I remember.


Perfect, thank you. I was able to add a failing test and make it pass  
by removing the string length optimization.


For future reference of anyone reading the list, this page has a great  
description of the problem:


  http://www.i18nguy.com/unicode/turkish-i18n.html

Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latest on CITEXT 2.0

2008-06-26 Thread Martijn van Oosterhout
On Wed, Jun 25, 2008 at 11:47:39AM -0700, David E. Wheeler wrote:
 * There seem to still be some implicit CASTS to text that I'd like to  
 duplicate. For example,  select '192.168.1.2'::cidr::text;` works, but  
 `select '192.168.1.2'::cidr::citext;` does not. Where can I find the C  
 functions that do these casts for TEXT so that I can put them to work  
 for citext, too? The internal cast functions used in the old citext  
 distribution don't exist at all on 8.3.

Hmm, casts to/from text are somewhat magic in postgres. They are
implemented by calling the usual type input/output function. I have no
idea how to extend that to other types.

 * There are casts from text that I'd also like to harness for use by  
 citext, like `cidr(text)`. Where can I find these C functions as well?  
 (The upshot of this and the previous points is that I'd like citext to  
 be as compatible with TEXT as possible, and I just need to figure out  
 how to fill in the gaps in that compatibility.)

As above, they're probably not as seperate functions but a special hack
inthe casting code.

 * Regular expression and LIKE comparisons using the the operators  
 properly work case-insensitively, but functions like replace() and  
 regexp_replace() do not. Should they? and if so, how can I make them  
 do so?

Regexes have case-insensetive modifiers, don't they? In which case I
don't think it'd be becessary.

 * As for my C programming, well, what's broken? I'm especially  
 concerned that I pfree variables appropriately, but I'm not at all  
 clear on what needs to be freed. Martijn mentioned before that btree  
 comparison functions free memory, but I'm such a C n00b that I don't  
 know what that actually means for my implementation. I'd actually  
 appreciate a bit of pedantry here. :-)

When creating an index, your comparison functions are going ot be
called O(N log N) times. If they leak into a context that isn't
regularly freed you may have a problem. I'd suggest loking at how the
text comparisons do it. PG_FREE_IF_COPY() is probably a good idea
because the incoming tuples may be detoasted.

 * Am I in fact getting an appropriate nul-terminated string in my  
 cilower() function using this code?
 
 char * str  = DatumGetCString(
 DirectFunctionCall1( textout, PointerGetDatum( arg ) )
 );

Yes.

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] Latest on CITEXT 2.0

2008-06-26 Thread David E. Wheeler

On Jun 26, 2008, at 03:28, Martijn van Oosterhout wrote:


Hmm, casts to/from text are somewhat magic in postgres. They are
implemented by calling the usual type input/output function. I have no
idea how to extend that to other types.


Oh. Okay. Perhaps I won't worry about it just now, then.

As above, they're probably not as seperate functions but a special  
hack

inthe casting code.


Okay.


Regexes have case-insensetive modifiers, don't they? In which case I
don't think it'd be becessary.


They do, but replace(), split_part(), strpos(), and translate() do not.


When creating an index, your comparison functions are going ot be
called O(N log N) times. If they leak into a context that isn't
regularly freed you may have a problem. I'd suggest loking at how the
text comparisons do it. PG_FREE_IF_COPY() is probably a good idea
because the incoming tuples may be detoasted.


Okay. I'll have a look at varlena.c, then.


* Am I in fact getting an appropriate nul-terminated string in my
cilower() function using this code?

   char * str  = DatumGetCString(
   DirectFunctionCall1( textout, PointerGetDatum( arg ) )
   );


Yes.


Great, I thought so (since it made the failures go away). Many thanks.

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latest on CITEXT 2.0

2008-06-26 Thread David E. Wheeler
Thanks a million for your answers, Martijn. I just have some more  
stupid questions, if you could bear with me.


On Jun 26, 2008, at 03:28, Martijn van Oosterhout wrote:


When creating an index, your comparison functions are going ot be
called O(N log N) times. If they leak into a context that isn't
regularly freed you may have a problem. I'd suggest loking at how the
text comparisons do it. PG_FREE_IF_COPY() is probably a good idea
because the incoming tuples may be detoasted.


Okay, I see that text_cmp in varlena doesn't use PG_FREE_IF_COPY(),  
and neither do text_smaller nor text_larger (which just dispatch to  
text_cmp anyway).


The operator functions *do* use PG_FREE_IF_COPY(). So I'm guessing  
it's these functions you're talking about. However, my implementation  
just looks like this:


Datum citext_ne (PG_FUNCTION_ARGS) {
// Fast path for different-length inputs. Okay for canonical  
equivalence?

if (VARSIZE(PG_GETARG_TEXT_P(0)) != VARSIZE(PG_GETARG_TEXT_P(1)))
PG_RETURN_BOOL( 1 );
PG_RETURN_BOOL( citextcmp( PG_ARGS ) != 0 );
}

I don't *thinkI any variables are copied there. citextcmp() is just  
this:


int citextcmp (PG_FUNCTION_ARGS) {
// XXX These are all just references to existing structures, right?
text * left  = PG_GETARG_TEXT_P(0);
text * right = PG_GETARG_TEXT_P(1);
return varstr_cmp(
cilower( left ),
VARSIZE_ANY_EXHDR(left),
cilower( right ),
VARSIZE_ANY_EXHDR(right)
);
}

Again, no copying. cilower() does copy:

intindex, len;
char * result;

index  = 0;
len= VARSIZE(arg) - VARHDRSZ;
result = (char *) palloc( strlen( str ) + 1 );

for (index = 0; index = len; index++) {
result[index] = tolower((unsigned char) str[index] );
}
// XXX I don't need to pfree result if I'm returning it, right?
return result;

But the copied value is returned. Hrm…it should probably be pfreed  
somewhere, yes?


So I'm wondering if I should change citextcmp to pfree values?  
Something like this:


text * left  = PG_GETARG_TEXT_P(0);
text * right = PG_GETARG_TEXT_P(1);
char * lcstr = cilower( left  );
char * rcstr = cilower( right );

int result = varstr_cmp(
cilower( left ),
VARSIZE_ANY_EXHDR(left),
cilower( right ),
VARSIZE_ANY_EXHDR(right)
);

pfree( lcstr );
pfree( rcstr );
return result;

This is the only function that calls cilower(). And I *think* it's the  
only place where values are copied or memory is allocated needing to  
be freed. Does that sound right to you?


On a side note, I've implemented this pretty differently from how the  
text functions are implemented in varlena.c, just to try to keep  
things succinct. But I'm wondering now if I shouldn't switch back to  
the style used by varlena.c, if only to keep the style the same, and  
thus perhaps to increase the chances that citext would be a welcome  
contrib addition. Thoughts?


Many thanks again. You're a great help to this C n00b.

Best,

David
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latest on CITEXT 2.0

2008-06-26 Thread Alvaro Herrera
David E. Wheeler wrote:

 The operator functions *do* use PG_FREE_IF_COPY(). So I'm guessing it's 
 these functions you're talking about. However, my implementation just 
 looks like this:

 Datum citext_ne (PG_FUNCTION_ARGS) {
 // Fast path for different-length inputs. Okay for canonical  
 equivalence?
 if (VARSIZE(PG_GETARG_TEXT_P(0)) != VARSIZE(PG_GETARG_TEXT_P(1)))
 PG_RETURN_BOOL( 1 );
 PG_RETURN_BOOL( citextcmp( PG_ARGS ) != 0 );
 }

PG_GETARG_TEXT_P can detoast the datum, which creates a copy.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latest on CITEXT 2.0

2008-06-26 Thread David E. Wheeler

On Jun 26, 2008, at 09:19, Alvaro Herrera wrote:


PG_GETARG_TEXT_P can detoast the datum, which creates a copy.


Thanks. I've just completely refactored things to look more like the  
approach taken by varlena.c, both in terms of when stuff gets freed  
and in terms of coding style. It's more verbose, but I feel much more  
comfortable with memory management now that I'm following a known  
implementation more closely. :-)


So now I've changed citextcmp to this:

static int
citextcmp (text * left, text * right)
{
char * lcstr, * rcstr;
intresult;

lcstr = cilower( left  );
rcstr = cilower( right );

result = varstr_cmp(
cilower( left ),
VARSIZE_ANY_EXHDR(left),
cilower( right ),
VARSIZE_ANY_EXHDR(right)
);

pfree( lcstr );
pfree( rcstr );
return result;
}

And now all of the operator functions are freeing memory using  
PG_FREE_IF_COPY() like this:


Datum
citext_cmp(PG_FUNCTION_ARGS)
{
text * left  = PG_GETARG_TEXT_PP(0);
text * right = PG_GETARG_TEXT_PP(1);
int32  result;

result = citextcmp(left, right);

PG_FREE_IF_COPY(left, 0);
PG_FREE_IF_COPY(right, 1);

PG_RETURN_INT32( result );
}


The only functions that don't do that are citext_smaller() and  
citext_larger():


Datum
citext_smaller(PG_FUNCTION_ARGS)
{
text * left  = PG_GETARG_TEXT_PP(0);
text * right = PG_GETARG_TEXT_PP(1);
text * result;

result = citextcmp(left, right)  0 ? left : right;
PG_RETURN_TEXT_P(result);
}

This is just how varlena.c does it, but I am wondering if something  
*should* be freed there.


Thanks a bunch!

Best,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latest on CITEXT 2.0

2008-06-26 Thread Tom Lane
David E. Wheeler [EMAIL PROTECTED] writes:
 Datum citext_ne (PG_FUNCTION_ARGS) {
  // Fast path for different-length inputs. Okay for canonical  
 equivalence?
  if (VARSIZE(PG_GETARG_TEXT_P(0)) != VARSIZE(PG_GETARG_TEXT_P(1)))
  PG_RETURN_BOOL( 1 );
  PG_RETURN_BOOL( citextcmp( PG_ARGS ) != 0 );
 }

BTW, I don't think you can use that same-length optimization for
citext.  There's no reason to think that upper/lowercase pairs will
have the same length all the time in multibyte encodings.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latest on CITEXT 2.0

2008-06-26 Thread David E. Wheeler

On Jun 26, 2008, at 10:02, Tom Lane wrote:


BTW, I don't think you can use that same-length optimization for
citext.  There's no reason to think that upper/lowercase pairs will
have the same length all the time in multibyte encodings.


I was wondering about that. I had been thinking of canonically- 
equivalent stings and combining marks. Doing a quick test it looks  
like combining marks are not equivalent. For example, this returns  
false:


  SELECT 'Ä'::text = 'Ä'::text;

At least with en_US.UTF-8. Hrm. It looks like my client makes them  
both canonical, so I've attached a script demonstrating this issue.


Anyway, I was aware of different byte counts for canonical  
equivalence, but not for differences between upper- and lowercase  
characters. I'd certainly defer to your knowledge of how these things  
truly work in PostgreSQL, Tom, and can of course easily remove that  
optimization. So, are your certain about this?


Many thanks,

David



try.sql
Description: Binary data



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latest on CITEXT 2.0

2008-06-26 Thread Josh Berkus
David,

 Thanks. I've just completely refactored things to look more like the
 approach taken by varlena.c, both in terms of when stuff gets freed
 and in terms of coding style. It's more verbose, but I feel much more
 comfortable with memory management now that I'm following a known
 implementation more closely. :-)

Will this be ready for the July CommitFest?

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latest on CITEXT 2.0

2008-06-26 Thread David E. Wheeler

On Jun 26, 2008, at 12:03, Josh Berkus wrote:


Will this be ready for the July CommitFest?


When is it due, July 1? If so, yes, it should be. I could use a close  
review by someone who knows this shit a whole lot better than I do.


Thanks,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latest on CITEXT 2.0

2008-06-26 Thread Josh Berkus
David,

 When is it due, July 1? If so, yes, it should be. I could use a close
 review by someone who knows this shit a whole lot better than I do.

Well, that's what the commitfest is for.  Go ahead and add yourself once you 
post the new patch.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latest on CITEXT 2.0

2008-06-26 Thread Tom Lane
David E. Wheeler [EMAIL PROTECTED] writes:
 So, are your certain about this?

See Turkish --- in that locale i and I are not an upper/lower pair,
instead they pair with some non-ASCII letters.  There are likely
other cases but that's the counterexample I remember.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Latest on CITEXT 2.0

2008-06-25 Thread David E. Wheeler

Howdy,

I just wanted to report the latest on my pet project: implementing a  
new case-insensitive text type, citext, to be locale-aware and to  
build and run on PostgreSQL 8.3. I'm not much of a C programmer (this  
is only the second time I've written *anything* in C), so I also have  
a few questions about my code, best practices, coverage, etc. You can  
grab the latest here:


  https://svn.kineticode.com/citext/trunk/

BTW, the tests in sql/citext.sql use the pgtap.sql file to run TAP  
regression tests. So you can run them using `make installcheck` or  
`make test`. The latter requires that pg_prove be installed; you can  
get it here:


  https://svn.kineticode.com/pgtap/trunk/

Anyway, I think I've got it pretty close to done. The tests cover a  
lot of stuff -- nearly everything I could figure out, anyway. But  
there are a few gaps.


As a result, I'd appreciate a little help with these questions, all in  
the name of making this a solid data type suitable for use on  
production systems:


* There seem to still be some implicit CASTS to text that I'd like to  
duplicate. For example,  select '192.168.1.2'::cidr::text;` works, but  
`select '192.168.1.2'::cidr::citext;` does not. Where can I find the C  
functions that do these casts for TEXT so that I can put them to work  
for citext, too? The internal cast functions used in the old citext  
distribution don't exist at all on 8.3.


* There are casts from text that I'd also like to harness for use by  
citext, like `cidr(text)`. Where can I find these C functions as well?  
(The upshot of this and the previous points is that I'd like citext to  
be as compatible with TEXT as possible, and I just need to figure out  
how to fill in the gaps in that compatibility.)


* Regular expression and LIKE comparisons using the the operators  
properly work case-insensitively, but functions like replace() and  
regexp_replace() do not. Should they? and if so, how can I make them  
do so?


* The tests assume that LC_COLLATE is set to en_US.UTF-8. Does that  
work well for standard PostgreSQL regression tests? How are locale- 
sensitive tests run in core regression tests?


* As for my C programming, well, what's broken? I'm especially  
concerned that I pfree variables appropriately, but I'm not at all  
clear on what needs to be freed. Martijn mentioned before that btree  
comparison functions free memory, but I'm such a C n00b that I don't  
know what that actually means for my implementation. I'd actually  
appreciate a bit of pedantry here. :-)


* Am I in fact getting an appropriate nul-terminated string in my  
cilower() function using this code?


char * str  = DatumGetCString(
DirectFunctionCall1( textout, PointerGetDatum( arg ) )
);

Those are all the questions I had about my implementation. I'd like to  
get this thing done and released soon, so that I can be done with this  
particular Yak and get back to what I'm *supposed* to be doing with my  
time.


BTW, would there be any interest in this code going into contrib/ in  
the distribution? I think that, if we can ensure that it works just  
like LOWER() = LOWER(), but without requiring that code, then it would  
be a great type to point people to to use instead of that SQL hack  
(with all the usual caveats about it being locale-sensitive and not  
canonically case-insensitive in the Unicode sense). If so, I'd be  
happy to make whatever changes are necessary to make it fit in with  
the coding and organization standards of the core and to submit it.


But please, don't expect a civarchar type from me anytime soon. ;-)

Many thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers