Re: [HACKERS] Latest on CITEXT 2.0
On 6/26/08, Tom Lane [EMAIL PROTECTED] wrote: David E. Wheeler [EMAIL PROTECTED] writes: Datum citext_ne (PG_FUNCTION_ARGS) { // Fast path for different-length inputs. Okay for canonical equivalence? if (VARSIZE(PG_GETARG_TEXT_P(0)) != VARSIZE(PG_GETARG_TEXT_P(1))) PG_RETURN_BOOL( 1 ); PG_RETURN_BOOL( citextcmp( PG_ARGS ) != 0 ); } BTW, I don't think you can use that same-length optimization for citext. There's no reason to think that upper/lowercase pairs will have the same length all the time in multibyte encodings. What about this code in current str_tolower(): /* Output workspace cannot have more codes than input bytes */ workspace = (wchar_t *) palloc((nbytes + 1) * sizeof(wchar_t)); Bug? -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latest on CITEXT 2.0
Marko Kreen [EMAIL PROTECTED] writes: On 6/26/08, Tom Lane [EMAIL PROTECTED] wrote: BTW, I don't think you can use that same-length optimization for citext. There's no reason to think that upper/lowercase pairs will have the same length all the time in multibyte encodings. What about this code in current str_tolower(): /* Output workspace cannot have more codes than input bytes */ workspace = (wchar_t *) palloc((nbytes + 1) * sizeof(wchar_t)); That's working with wchars, not bytes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latest on CITEXT 2.0
On 7/1/08, Tom Lane [EMAIL PROTECTED] wrote: Marko Kreen [EMAIL PROTECTED] writes: On 6/26/08, Tom Lane [EMAIL PROTECTED] wrote: BTW, I don't think you can use that same-length optimization for citext. There's no reason to think that upper/lowercase pairs will have the same length all the time in multibyte encodings. What about this code in current str_tolower(): /* Output workspace cannot have more codes than input bytes */ workspace = (wchar_t *) palloc((nbytes + 1) * sizeof(wchar_t)); That's working with wchars, not bytes. Ah, I missed the point of char2wchar() line. I'm rather unfamiliar with various MB API-s, sorry. There's another thing I'm probably missing: does current code handle multi-wchar codepoints? Or is it guaranteed they don't happen? (Wasn't wchar_t usually 16bit value?) -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latest on CITEXT 2.0
Marko Kreen wrote: On 7/1/08, Tom Lane [EMAIL PROTECTED] wrote: Marko Kreen [EMAIL PROTECTED] writes: On 6/26/08, Tom Lane [EMAIL PROTECTED] wrote: BTW, I don't think you can use that same-length optimization for citext. There's no reason to think that upper/lowercase pairs will have the same length all the time in multibyte encodings. What about this code in current str_tolower(): /* Output workspace cannot have more codes than input bytes */ workspace = (wchar_t *) palloc((nbytes + 1) * sizeof(wchar_t)); That's working with wchars, not bytes. Ah, I missed the point of char2wchar() line. I'm rather unfamiliar with various MB API-s, sorry. There's another thing I'm probably missing: does current code handle multi-wchar codepoints? Or is it guaranteed they don't happen? (Wasn't wchar_t usually 16bit value?) If you want a simple example of wide character use look at oracle_compat.c::upper() which calls str_toupper() in CVS HEAD. -- Bruce Momjian [EMAIL PROTECTED]http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latest on CITEXT 2.0
On 7/1/08, Bruce Momjian [EMAIL PROTECTED] wrote: If you want a simple example of wide character use look at oracle_compat.c::upper() which calls str_toupper() in CVS HEAD. ATM I'm looking at str_tolower/upper internal implementation. They do: workspace[curr_char] = towlower(workspace[curr_char]); where workspace is wchar_t but towlower() operates on wint_t. Is such inconsistency ok? -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latest on CITEXT 2.0
Marko Kreen [EMAIL PROTECTED] writes: ATM I'm looking at str_tolower/upper internal implementation. They do: workspace[curr_char] = towlower(workspace[curr_char]); where workspace is wchar_t but towlower() operates on wint_t. IIRC this is exactly comparable to the type situation for the traditional ctype.h macros. The reason is that they are defined to accept EOF in addition to actual char (or wchar) values. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latest on CITEXT 2.0
Marko Kreen [EMAIL PROTECTED] writes: There's another thing I'm probably missing: does current code handle multi-wchar codepoints? Or is it guaranteed they don't happen? AFAIK we disallow multi-wchar situations (by rejecting the UTF8 combining codes). (Wasn't wchar_t usually 16bit value?) Hmm. It's unsigned int on my ancient HPUX box. I think we could have a problem on any machines whose mbstowcs doesn't support 4-byte UTF8 codes, though ... in particular, what about Windows? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latest on CITEXT 2.0
On 7/1/08, Tom Lane [EMAIL PROTECTED] wrote: Marko Kreen [EMAIL PROTECTED] writes: ATM I'm looking at str_tolower/upper internal implementation. They do: workspace[curr_char] = towlower(workspace[curr_char]); where workspace is wchar_t but towlower() operates on wint_t. IIRC this is exactly comparable to the type situation for the traditional ctype.h macros. The reason is that they are defined to accept EOF in addition to actual char (or wchar) values. I read SUS v3 and there is no hint on multi-wchar anything, so for unix systems you are right, wint_t == wchar_t. Seems stories how Windows and Java operate have affected me too much. Then I browsed MSDN: http://msdn.microsoft.com/en-us/library/dtxesf6k.aspx and they seem to strongly hint that wchar_t == 16 bits and UTF-16 is used internally. Probably some Windows developer should look into it and decide if there is a #ifdef WIN32 branch needed. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latest on CITEXT 2.0
On Jun 26, 2008, at 13:59, Tom Lane wrote: David E. Wheeler [EMAIL PROTECTED] writes: So, are your certain about this? See Turkish --- in that locale i and I are not an upper/lower pair, instead they pair with some non-ASCII letters. There are likely other cases but that's the counterexample I remember. Perfect, thank you. I was able to add a failing test and make it pass by removing the string length optimization. For future reference of anyone reading the list, this page has a great description of the problem: http://www.i18nguy.com/unicode/turkish-i18n.html Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latest on CITEXT 2.0
On Wed, Jun 25, 2008 at 11:47:39AM -0700, David E. Wheeler wrote: * There seem to still be some implicit CASTS to text that I'd like to duplicate. For example, select '192.168.1.2'::cidr::text;` works, but `select '192.168.1.2'::cidr::citext;` does not. Where can I find the C functions that do these casts for TEXT so that I can put them to work for citext, too? The internal cast functions used in the old citext distribution don't exist at all on 8.3. Hmm, casts to/from text are somewhat magic in postgres. They are implemented by calling the usual type input/output function. I have no idea how to extend that to other types. * There are casts from text that I'd also like to harness for use by citext, like `cidr(text)`. Where can I find these C functions as well? (The upshot of this and the previous points is that I'd like citext to be as compatible with TEXT as possible, and I just need to figure out how to fill in the gaps in that compatibility.) As above, they're probably not as seperate functions but a special hack inthe casting code. * Regular expression and LIKE comparisons using the the operators properly work case-insensitively, but functions like replace() and regexp_replace() do not. Should they? and if so, how can I make them do so? Regexes have case-insensetive modifiers, don't they? In which case I don't think it'd be becessary. * As for my C programming, well, what's broken? I'm especially concerned that I pfree variables appropriately, but I'm not at all clear on what needs to be freed. Martijn mentioned before that btree comparison functions free memory, but I'm such a C n00b that I don't know what that actually means for my implementation. I'd actually appreciate a bit of pedantry here. :-) When creating an index, your comparison functions are going ot be called O(N log N) times. If they leak into a context that isn't regularly freed you may have a problem. I'd suggest loking at how the text comparisons do it. PG_FREE_IF_COPY() is probably a good idea because the incoming tuples may be detoasted. * Am I in fact getting an appropriate nul-terminated string in my cilower() function using this code? char * str = DatumGetCString( DirectFunctionCall1( textout, PointerGetDatum( arg ) ) ); Yes. Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Please line up in a tree and maintain the heap invariant while boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
Re: [HACKERS] Latest on CITEXT 2.0
On Jun 26, 2008, at 03:28, Martijn van Oosterhout wrote: Hmm, casts to/from text are somewhat magic in postgres. They are implemented by calling the usual type input/output function. I have no idea how to extend that to other types. Oh. Okay. Perhaps I won't worry about it just now, then. As above, they're probably not as seperate functions but a special hack inthe casting code. Okay. Regexes have case-insensetive modifiers, don't they? In which case I don't think it'd be becessary. They do, but replace(), split_part(), strpos(), and translate() do not. When creating an index, your comparison functions are going ot be called O(N log N) times. If they leak into a context that isn't regularly freed you may have a problem. I'd suggest loking at how the text comparisons do it. PG_FREE_IF_COPY() is probably a good idea because the incoming tuples may be detoasted. Okay. I'll have a look at varlena.c, then. * Am I in fact getting an appropriate nul-terminated string in my cilower() function using this code? char * str = DatumGetCString( DirectFunctionCall1( textout, PointerGetDatum( arg ) ) ); Yes. Great, I thought so (since it made the failures go away). Many thanks. David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latest on CITEXT 2.0
Thanks a million for your answers, Martijn. I just have some more stupid questions, if you could bear with me. On Jun 26, 2008, at 03:28, Martijn van Oosterhout wrote: When creating an index, your comparison functions are going ot be called O(N log N) times. If they leak into a context that isn't regularly freed you may have a problem. I'd suggest loking at how the text comparisons do it. PG_FREE_IF_COPY() is probably a good idea because the incoming tuples may be detoasted. Okay, I see that text_cmp in varlena doesn't use PG_FREE_IF_COPY(), and neither do text_smaller nor text_larger (which just dispatch to text_cmp anyway). The operator functions *do* use PG_FREE_IF_COPY(). So I'm guessing it's these functions you're talking about. However, my implementation just looks like this: Datum citext_ne (PG_FUNCTION_ARGS) { // Fast path for different-length inputs. Okay for canonical equivalence? if (VARSIZE(PG_GETARG_TEXT_P(0)) != VARSIZE(PG_GETARG_TEXT_P(1))) PG_RETURN_BOOL( 1 ); PG_RETURN_BOOL( citextcmp( PG_ARGS ) != 0 ); } I don't *thinkI any variables are copied there. citextcmp() is just this: int citextcmp (PG_FUNCTION_ARGS) { // XXX These are all just references to existing structures, right? text * left = PG_GETARG_TEXT_P(0); text * right = PG_GETARG_TEXT_P(1); return varstr_cmp( cilower( left ), VARSIZE_ANY_EXHDR(left), cilower( right ), VARSIZE_ANY_EXHDR(right) ); } Again, no copying. cilower() does copy: intindex, len; char * result; index = 0; len= VARSIZE(arg) - VARHDRSZ; result = (char *) palloc( strlen( str ) + 1 ); for (index = 0; index = len; index++) { result[index] = tolower((unsigned char) str[index] ); } // XXX I don't need to pfree result if I'm returning it, right? return result; But the copied value is returned. Hrm…it should probably be pfreed somewhere, yes? So I'm wondering if I should change citextcmp to pfree values? Something like this: text * left = PG_GETARG_TEXT_P(0); text * right = PG_GETARG_TEXT_P(1); char * lcstr = cilower( left ); char * rcstr = cilower( right ); int result = varstr_cmp( cilower( left ), VARSIZE_ANY_EXHDR(left), cilower( right ), VARSIZE_ANY_EXHDR(right) ); pfree( lcstr ); pfree( rcstr ); return result; This is the only function that calls cilower(). And I *think* it's the only place where values are copied or memory is allocated needing to be freed. Does that sound right to you? On a side note, I've implemented this pretty differently from how the text functions are implemented in varlena.c, just to try to keep things succinct. But I'm wondering now if I shouldn't switch back to the style used by varlena.c, if only to keep the style the same, and thus perhaps to increase the chances that citext would be a welcome contrib addition. Thoughts? Many thanks again. You're a great help to this C n00b. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latest on CITEXT 2.0
David E. Wheeler wrote: The operator functions *do* use PG_FREE_IF_COPY(). So I'm guessing it's these functions you're talking about. However, my implementation just looks like this: Datum citext_ne (PG_FUNCTION_ARGS) { // Fast path for different-length inputs. Okay for canonical equivalence? if (VARSIZE(PG_GETARG_TEXT_P(0)) != VARSIZE(PG_GETARG_TEXT_P(1))) PG_RETURN_BOOL( 1 ); PG_RETURN_BOOL( citextcmp( PG_ARGS ) != 0 ); } PG_GETARG_TEXT_P can detoast the datum, which creates a copy. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latest on CITEXT 2.0
On Jun 26, 2008, at 09:19, Alvaro Herrera wrote: PG_GETARG_TEXT_P can detoast the datum, which creates a copy. Thanks. I've just completely refactored things to look more like the approach taken by varlena.c, both in terms of when stuff gets freed and in terms of coding style. It's more verbose, but I feel much more comfortable with memory management now that I'm following a known implementation more closely. :-) So now I've changed citextcmp to this: static int citextcmp (text * left, text * right) { char * lcstr, * rcstr; intresult; lcstr = cilower( left ); rcstr = cilower( right ); result = varstr_cmp( cilower( left ), VARSIZE_ANY_EXHDR(left), cilower( right ), VARSIZE_ANY_EXHDR(right) ); pfree( lcstr ); pfree( rcstr ); return result; } And now all of the operator functions are freeing memory using PG_FREE_IF_COPY() like this: Datum citext_cmp(PG_FUNCTION_ARGS) { text * left = PG_GETARG_TEXT_PP(0); text * right = PG_GETARG_TEXT_PP(1); int32 result; result = citextcmp(left, right); PG_FREE_IF_COPY(left, 0); PG_FREE_IF_COPY(right, 1); PG_RETURN_INT32( result ); } The only functions that don't do that are citext_smaller() and citext_larger(): Datum citext_smaller(PG_FUNCTION_ARGS) { text * left = PG_GETARG_TEXT_PP(0); text * right = PG_GETARG_TEXT_PP(1); text * result; result = citextcmp(left, right) 0 ? left : right; PG_RETURN_TEXT_P(result); } This is just how varlena.c does it, but I am wondering if something *should* be freed there. Thanks a bunch! Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latest on CITEXT 2.0
David E. Wheeler [EMAIL PROTECTED] writes: Datum citext_ne (PG_FUNCTION_ARGS) { // Fast path for different-length inputs. Okay for canonical equivalence? if (VARSIZE(PG_GETARG_TEXT_P(0)) != VARSIZE(PG_GETARG_TEXT_P(1))) PG_RETURN_BOOL( 1 ); PG_RETURN_BOOL( citextcmp( PG_ARGS ) != 0 ); } BTW, I don't think you can use that same-length optimization for citext. There's no reason to think that upper/lowercase pairs will have the same length all the time in multibyte encodings. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latest on CITEXT 2.0
On Jun 26, 2008, at 10:02, Tom Lane wrote: BTW, I don't think you can use that same-length optimization for citext. There's no reason to think that upper/lowercase pairs will have the same length all the time in multibyte encodings. I was wondering about that. I had been thinking of canonically- equivalent stings and combining marks. Doing a quick test it looks like combining marks are not equivalent. For example, this returns false: SELECT 'Ä'::text = 'Ä'::text; At least with en_US.UTF-8. Hrm. It looks like my client makes them both canonical, so I've attached a script demonstrating this issue. Anyway, I was aware of different byte counts for canonical equivalence, but not for differences between upper- and lowercase characters. I'd certainly defer to your knowledge of how these things truly work in PostgreSQL, Tom, and can of course easily remove that optimization. So, are your certain about this? Many thanks, David try.sql Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latest on CITEXT 2.0
David, Thanks. I've just completely refactored things to look more like the approach taken by varlena.c, both in terms of when stuff gets freed and in terms of coding style. It's more verbose, but I feel much more comfortable with memory management now that I'm following a known implementation more closely. :-) Will this be ready for the July CommitFest? -- Josh Berkus PostgreSQL @ Sun San Francisco -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latest on CITEXT 2.0
On Jun 26, 2008, at 12:03, Josh Berkus wrote: Will this be ready for the July CommitFest? When is it due, July 1? If so, yes, it should be. I could use a close review by someone who knows this shit a whole lot better than I do. Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latest on CITEXT 2.0
David, When is it due, July 1? If so, yes, it should be. I could use a close review by someone who knows this shit a whole lot better than I do. Well, that's what the commitfest is for. Go ahead and add yourself once you post the new patch. -- Josh Berkus PostgreSQL @ Sun San Francisco -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latest on CITEXT 2.0
David E. Wheeler [EMAIL PROTECTED] writes: So, are your certain about this? See Turkish --- in that locale i and I are not an upper/lower pair, instead they pair with some non-ASCII letters. There are likely other cases but that's the counterexample I remember. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Latest on CITEXT 2.0
Howdy, I just wanted to report the latest on my pet project: implementing a new case-insensitive text type, citext, to be locale-aware and to build and run on PostgreSQL 8.3. I'm not much of a C programmer (this is only the second time I've written *anything* in C), so I also have a few questions about my code, best practices, coverage, etc. You can grab the latest here: https://svn.kineticode.com/citext/trunk/ BTW, the tests in sql/citext.sql use the pgtap.sql file to run TAP regression tests. So you can run them using `make installcheck` or `make test`. The latter requires that pg_prove be installed; you can get it here: https://svn.kineticode.com/pgtap/trunk/ Anyway, I think I've got it pretty close to done. The tests cover a lot of stuff -- nearly everything I could figure out, anyway. But there are a few gaps. As a result, I'd appreciate a little help with these questions, all in the name of making this a solid data type suitable for use on production systems: * There seem to still be some implicit CASTS to text that I'd like to duplicate. For example, select '192.168.1.2'::cidr::text;` works, but `select '192.168.1.2'::cidr::citext;` does not. Where can I find the C functions that do these casts for TEXT so that I can put them to work for citext, too? The internal cast functions used in the old citext distribution don't exist at all on 8.3. * There are casts from text that I'd also like to harness for use by citext, like `cidr(text)`. Where can I find these C functions as well? (The upshot of this and the previous points is that I'd like citext to be as compatible with TEXT as possible, and I just need to figure out how to fill in the gaps in that compatibility.) * Regular expression and LIKE comparisons using the the operators properly work case-insensitively, but functions like replace() and regexp_replace() do not. Should they? and if so, how can I make them do so? * The tests assume that LC_COLLATE is set to en_US.UTF-8. Does that work well for standard PostgreSQL regression tests? How are locale- sensitive tests run in core regression tests? * As for my C programming, well, what's broken? I'm especially concerned that I pfree variables appropriately, but I'm not at all clear on what needs to be freed. Martijn mentioned before that btree comparison functions free memory, but I'm such a C n00b that I don't know what that actually means for my implementation. I'd actually appreciate a bit of pedantry here. :-) * Am I in fact getting an appropriate nul-terminated string in my cilower() function using this code? char * str = DatumGetCString( DirectFunctionCall1( textout, PointerGetDatum( arg ) ) ); Those are all the questions I had about my implementation. I'd like to get this thing done and released soon, so that I can be done with this particular Yak and get back to what I'm *supposed* to be doing with my time. BTW, would there be any interest in this code going into contrib/ in the distribution? I think that, if we can ensure that it works just like LOWER() = LOWER(), but without requiring that code, then it would be a great type to point people to to use instead of that SQL hack (with all the usual caveats about it being locale-sensitive and not canonically case-insensitive in the Unicode sense). If so, I'd be happy to make whatever changes are necessary to make it fit in with the coding and organization standards of the core and to submit it. But please, don't expect a civarchar type from me anytime soon. ;-) Many thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers