Re: [HACKERS] Collation rules and multi-lingual databases
Stephan Szabo [EMAIL PROTECTED] writes: On Fri, 22 Aug 2003, Tom Lane wrote: I'd go so far as to make it a critical section --- that ensures that any ERROR will be turned to FATAL, even if it's in a subroutine you call. I didn't know we could do that, could be handy, although the comments imply that it turns into PANIC which would force a complete restart. Right, my imprecision, it actually goes to PANIC. Then again, it's better than a corrupted database. Exactly. regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Collation rules and multi-lingual databases
Greg Stark [EMAIL PROTECTED] writes: The glibc docs sample code suggests using 2x the original string length for the initial buffer. My testing showed that *always* triggered the exceptional case. A bit of experimentation lead to the 3x+4 which eliminates it except for 0 and 1 byte strings. I'm still tweaking it. But on another OS, or in a more complex collation locale maybe you would still trigger it a lot. On HPUX it seems you always need 4x. Also, *there are bugs* in some platforms' implementations of strxfrm, such that an undersized buffer may get overrun anyway. I had originally tried to optimize the buffer size like this in src/backend/utils/adt/selfuncs.c's use of strxfrm, and eventually was forced to give it up as hopeless. I strongly suggest using the same code now seen there: char *xfrmstr; size_txfrmlen; size_txfrmlen2; /* * Note: originally we guessed at a suitable output buffer size, * and only needed to call strxfrm twice if our guess was too * small. However, it seems that some versions of Solaris have * buggy strxfrm that can write past the specified buffer length * in that scenario. So, do it the dumb way for portability. * * Yet other systems (e.g., glibc) sometimes return a smaller value * from the second call than the first; thus the Assert must be = * not == as you'd expect. Can't any of these people program * their way out of a paper bag? */ xfrmlen = strxfrm(NULL, val, 0); xfrmstr = (char *) palloc(xfrmlen + 1); xfrmlen2 = strxfrm(xfrmstr, val, xfrmlen + 1); Assert(xfrmlen2 = xfrmlen); regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Collation rules and multi-lingual databases
On Fri, 22 Aug 2003, Stephan Szabo wrote: On Fri, 22 Aug 2003, Tom Lane wrote: Stephan Szabo [EMAIL PROTECTED] writes: On 22 Aug 2003, Greg Stark wrote: If it's deemed a reasonable approach and nobody has any fatal flaws then I expect it would be useful to put in the contrib directory? I'm not sure that ERROR if the locale cannot be put back is sufficient (although that case should be rare or non-existant). A bigger risk is that something might elog(ERROR) while you have the wrong locale set, denying you the chance to put back the right one. I think this code is not nearly paranoid enough about how much it does while the wrong locale is set. True, there are calls to palloc, elog, etc inside there, although the elog could be removed. Since most of that work is for an exceptional case, maybe it'd be safer (although slower) to structure the function as setlocale call strxfrm (and that's it) setlocale back if there wasn't enough space make a new buffer setlocale call strxfrm (and that's it) setlocale back Probably putting the sl/strxfrm/sl into its own function. ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Collation rules and multi-lingual databases
Stephan Szabo [EMAIL PROTECTED] writes: Since most of that work is for an exceptional case, maybe it'd be safer (although slower) to structure the function as Yeah I thought of that. But if making it a critical section is cheap then it's probably a better approach. The problem with restoring the locale for the palloc is that if the user is unlucky he might sort a table of thousands of strings that all trigger the exception case. The glibc docs sample code suggests using 2x the original string length for the initial buffer. My testing showed that *always* triggered the exceptional case. A bit of experimentation lead to the 3x+4 which eliminates it except for 0 and 1 byte strings. I'm still tweaking it. But on another OS, or in a more complex collation locale maybe you would still trigger it a lot. Even as it is if you happy to try to sort a large list of single character strings you would trigger it a lot. I have some documentation reading to do apparently before I can fix this up. setlocale call strxfrm (and that's it) setlocale back if there wasn't enough space make a new buffer setlocale call strxfrm (and that's it) setlocale back Probably putting the sl/strxfrm/sl into its own function. -- greg ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Collation rules and multi-lingual databases
On 23 Aug 2003, Greg Stark wrote: Stephan Szabo [EMAIL PROTECTED] writes: Since most of that work is for an exceptional case, maybe it'd be safer (although slower) to structure the function as Yeah I thought of that. But if making it a critical section is cheap then it's probably a better approach. The problem with restoring the locale for the palloc is that if the user is unlucky he might sort a table of thousands of strings that all trigger the exception case. True. I still worry about the critical section since an error will cause the entire database system to restart, but as I'd also forgotten about signals which might cause problems (uncertain without looking) I'm not sure you have a good alternative even not counting the speed issues. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Collation rules and multi-lingual databases
Greg Stark wrote: Yeah I thought of that. But if making it a critical section is cheap then it's probably a better approach. The problem with restoring the locale for the palloc is that if the user is unlucky he might sort a table of thousands of strings that all trigger the exception case. What about something like this? 8 #include setjmp.h #include string.h #include postgres.h #include fmgr.h #include tcop/tcopprot.h #include utils/builtins.h #define GET_STR(textp) \ DatumGetCString(DirectFunctionCall1(textout, PointerGetDatum(textp))) #define GET_BYTEA(str_) \ DatumGetTextP(DirectFunctionCall1(byteain, CStringGetDatum(str_))) #define MAX_BYTEA_LEN 0x3fff /* * pg_strxfrm - Function to convert string similar to the strxfrm C * function using a specified locale. */ extern Datum pg_strxfrm(PG_FUNCTION_ARGS); PG_FUNCTION_INFO_V1(pg_strxfrm); Datum pg_strxfrm(PG_FUNCTION_ARGS) { char *str = GET_STR(PG_GETARG_TEXT_P(0)); size_t str_len = strlen(str); char *localestr = GET_STR(PG_GETARG_TEXT_P(1)); size_t approx_trans_len = 4 + (str_len * 3); char *trans = (char *) palloc(approx_trans_len + 1); size_t actual_trans_len; char *oldlocale; char *newlocale; sigjmp_buf save_restart; if (approx_trans_len MAX_BYTEA_LEN) elog(ERROR, source string too long to transform); oldlocale = setlocale(LC_COLLATE, NULL); if (!oldlocale) elog(ERROR, setlocale failed to return a locale); /* catch elog while locale is set other than the default */ memcpy(save_restart, Warn_restart, sizeof(save_restart)); if (sigsetjmp(Warn_restart, 1) != 0) { memcpy(Warn_restart, save_restart, sizeof(Warn_restart)); newlocale = setlocale(LC_COLLATE, oldlocale); if (!newlocale) elog(PANIC, setlocale failed to reset locale: %s, localestr); siglongjmp(Warn_restart, 1); } newlocale = setlocale(LC_COLLATE, localestr); if (!newlocale) elog(ERROR, setlocale failed to set a locale: %s, localestr); actual_trans_len = strxfrm(trans, str, approx_trans_len + 1); /* if the buffer was not large enough, resize it and try again */ if (actual_trans_len = approx_trans_len) { approx_trans_len = actual_trans_len + 1; if (approx_trans_len MAX_BYTEA_LEN) elog(ERROR, source string too long to transform); trans = (char *) repalloc(trans, approx_trans_len + 1); actual_trans_len = strxfrm(trans, str, approx_trans_len + 1); /* if the buffer still not large enough, punt */ if (actual_trans_len = approx_trans_len) elog(ERROR, strxfrm failed, buffer insufficient); } newlocale = setlocale(LC_COLLATE, oldlocale); if (!newlocale) elog(PANIC, setlocale failed to reset locale: %s, localestr); PG_RETURN_BYTEA_P(GET_BYTEA(trans)); } 8 Joe ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Collation rules and multi-lingual databases
Joe Conway wrote: What about something like this? Oops! Forgot to restrore error handling. See below: Joe 8 #include setjmp.h #include string.h #include postgres.h #include fmgr.h #include tcop/tcopprot.h #include utils/builtins.h #define GET_STR(textp) \ DatumGetCString(DirectFunctionCall1(textout, PointerGetDatum(textp))) #define GET_BYTEA(str_) \ DatumGetTextP(DirectFunctionCall1(byteain, CStringGetDatum(str_))) #define MAX_BYTEA_LEN0x3fff /* * pg_strxfrm - Function to convert string similar to the strxfrm C * function using a specified locale. */ extern Datum pg_strxfrm(PG_FUNCTION_ARGS); PG_FUNCTION_INFO_V1(pg_strxfrm); Datum pg_strxfrm(PG_FUNCTION_ARGS) { char *str = GET_STR(PG_GETARG_TEXT_P(0)); size_t str_len = strlen(str); char *localestr = GET_STR(PG_GETARG_TEXT_P(1)); size_t approx_trans_len = 4 + (str_len * 3); char *trans = (char *) palloc(approx_trans_len + 1); size_t actual_trans_len; char *oldlocale; char *newlocale; sigjmp_buf save_restart; if (approx_trans_len MAX_BYTEA_LEN) elog(ERROR, source string too long to transform); oldlocale = setlocale(LC_COLLATE, NULL); if (!oldlocale) elog(ERROR, setlocale failed to return a locale); /* catch elog while locale is set other than the default */ memcpy(save_restart, Warn_restart, sizeof(save_restart)); if (sigsetjmp(Warn_restart, 1) != 0) { memcpy(Warn_restart, save_restart, sizeof(Warn_restart)); newlocale = setlocale(LC_COLLATE, oldlocale); if (!newlocale) elog(PANIC, setlocale failed to reset locale: %s, localestr); siglongjmp(Warn_restart, 1); } newlocale = setlocale(LC_COLLATE, localestr); if (!newlocale) elog(ERROR, setlocale failed to set a locale: %s, localestr); actual_trans_len = strxfrm(trans, str, approx_trans_len + 1); /* if the buffer was not large enough, resize it and try again */ if (actual_trans_len = approx_trans_len) { approx_trans_len = actual_trans_len + 1; if (approx_trans_len MAX_BYTEA_LEN) elog(ERROR, source string too long to transform); trans = (char *) repalloc(trans, approx_trans_len + 1); actual_trans_len = strxfrm(trans, str, approx_trans_len + 1); /* if the buffer still not large enough, punt */ if (actual_trans_len = approx_trans_len) elog(ERROR, strxfrm failed, buffer insufficient); } newlocale = setlocale(LC_COLLATE, oldlocale); if (!newlocale) elog(PANIC, setlocale failed to reset locale: %s, localestr); /* restore normal error handling */ memcpy(Warn_restart, save_restart, sizeof(Warn_restart)); PG_RETURN_BYTEA_P(GET_BYTEA(trans)); } 8 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Collation rules and multi-lingual databases
Joe Conway [EMAIL PROTECTED] writes: if (sigsetjmp(Warn_restart, 1) != 0) { memcpy(Warn_restart, save_restart, sizeof(Warn_restart)); newlocale = setlocale(LC_COLLATE, oldlocale); if (!newlocale) elog(PANIC, setlocale failed to reset locale: %s, localestr); siglongjmp(Warn_restart, 1); } Well presumably we want FATAL not PANIC. And do we still need HOLD_INTERRUPTS() .. RESUME_INTERRUPTS() ? I was afraid that was getting into bed too much with the error handling. I have an implementation that restores the locale around the palloc and increases the initial guess for future calls to avoid degenerate behaviour. I'm not sure which approach is preferable. -- greg ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Collation rules and multi-lingual databases
Greg Stark wrote: Joe Conway [EMAIL PROTECTED] writes: if (sigsetjmp(Warn_restart, 1) != 0) { memcpy(Warn_restart, save_restart, sizeof(Warn_restart)); newlocale = setlocale(LC_COLLATE, oldlocale); if (!newlocale) elog(PANIC, setlocale failed to reset locale: %s, localestr); siglongjmp(Warn_restart, 1); } Well presumably we want FATAL not PANIC. Yeah, that was a bit overzealous. I really intended FATAL. And do we still need HOLD_INTERRUPTS() .. RESUME_INTERRUPTS() ? I'm not sure, but I think not. I was afraid that was getting into bed too much with the error handling. I have an implementation that restores the locale around the palloc and increases the initial guess for future calls to avoid degenerate behaviour. Well the intention of the sigsetjmp was to avoid the need to flip the locale to-and-fro. Increasing the initial guess might be good, but it will further restrict the length of the input string you can work with. But I'd guess you'll not want to use this with extremely long strings anyway. Joe ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Collation rules and multi-lingual databases
So, I needed a way to sort using collation rules other than the one the database was built with. So I wrote up the following function exposing strxfrm with an extra parameter to specify the LC_COLLATE value to use. This is my first C function so I'm really unsure that I've done the right thing. For the most part I pattern-matched off the string_io code in the contrib directory. In particular I'm unsure about the code postgres-interfacing code in c_varcharxfrm which makes an extra copy of both parameters that are passed in and an extra copy of the result value. Are varchars guaranteed to be nul-terminated? If so I can dispose of two of the copies. And I can probably eliminate the copying of the result by alloting extra space when I allocate it initially. But more generally. Would it make more sense to use text or bytea or something else to store these opaque binary strings? At least with glibc they tend to be unreadable anyways. Other caveats: It's condemned to be permanently non-threadsafe because the whole locale system is a non thread-safe API. Also I fear some systems will leak memory like a sieve when you call setlocale a few thousand times instead of the 1 time at initialization that they foresaw. At least glibc doesn't seem to leak in my brief testing. If it's deemed a reasonable approach and nobody has any fatal flaws then I expect it would be useful to put in the contrib directory? /* * This software is distributed under the GNU General Public License * either version 2, or (at your option) any later version. */ #include postgres.h #include locale.h #include utils/builtins.h static unsigned char * xfrm(unsigned char *data, int size, const unsigned char *locale, int localesize); unsigned char * c_varcharxfrm(unsigned char *s, const unsigned char *locale); static unsigned char * xfrm(unsigned char *data, int size, const unsigned char *locale, int localesize) { size_t length = size*3+4; char *transformed; size_t transformed_length; char *oldlocale, *newlocale; /* First try a buffer perhaps big enough. */ transformed = palloc (length); oldlocale = setlocale(LC_COLLATE, NULL); if (!oldlocale) { elog(ERROR, setlocale(LC_COLLATE,NULL) failed to return a locale); return NULL; } newlocale = setlocale(LC_COLLATE, locale); if (!newlocale) { elog(ERROR, setlocale(LC_COLLATE,%s) failed to return a locale, locale); return NULL; } transformed_length = strxfrm (transformed, data, length); /* If the buffer was not large enough, resize it and try again. */ if (transformed_length = length) { elog(INFO, Calling strxfrm again because result didn't fit (%d%d), transformed_length, length); length = transformed_length + 1; transformed = palloc(length); strxfrm (transformed, data, length); } newlocale = setlocale(LC_COLLATE, oldlocale); Assert(newlocale !strcmp(newlocale,oldlocale)); if (!newlocale || strcmp(newlocale,oldlocale)) { elog(ERROR, Failed to reset locale (trying to reset locale to %s from %s instead got %s), oldlocale, locale, newlocale); } return transformed; } unsigned char * c_varcharxfrm(unsigned char *s, const unsigned char *l) { int lens = 0, lenl = 0, lenr = 0; unsigned char *str, *locale, *retval, *retval2; if (s) { lens = *(int32 *) s - 4; str = palloc(lens+1); memcpy(str, s+4, lens); str[lens]='\0'; } if (l) { lenl = *(int32 *) l - 4; locale = palloc(lenl+1); memcpy(locale, l+4, lenl); locale[lenl]='\0'; } retval = xfrm(str, lens, locale, lenl); lenr = strlen(retval); retval2 = palloc(lenr+5); memcpy(retval2+4, retval, lenr+1); *(int32 *)retval2 = lenr; return retval2; } /* * Local Variables: * tab-width: 4 * c-indent-level: 4 * c-basic-offset: 4 * End: */ SET search_path = public; SET autocommit TO 'on'; CREATE OR REPLACE FUNCTION xfrm(varchar, varchar) RETURNS varchar AS 'strxfrm.so', 'c_varcharxfrm' LANGUAGE 'C' STRICT IMMUTABLE ; -- greg ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Collation rules and multi-lingual databases
Greg Stark writes: This is my first C function so I'm really unsure that I've done the right thing. For the most part I pattern-matched off the string_io code in the contrib directory. That was just about the worst example you could have picked. Please forget everything you have seen and start by reading the documentation. In particular, learn about the version 1 call convention and about the PostgreSQL license. And read some code under src/backend/utils/adt. In particular I'm unsure about the code postgres-interfacing code in c_varcharxfrm which makes an extra copy of both parameters that are passed in and an extra copy of the result value. Are varchars guaranteed to be nul-terminated? They are guaranteed not to be null-terminated. But more generally. Would it make more sense to use text or bytea or something else to store these opaque binary strings? At least with glibc they tend to be unreadable anyways. bytea If it's deemed a reasonable approach and nobody has any fatal flaws then I expect it would be useful to put in the contrib directory? I'd expect it to be too slow to be useful. Have you run performance tests? -- Peter Eisentraut [EMAIL PROTECTED] ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Collation rules and multi-lingual databases
On 22 Aug 2003, Greg Stark wrote: So, I needed a way to sort using collation rules other than the one the database was built with. So I wrote up the following function exposing strxfrm with an extra parameter to specify the LC_COLLATE value to use. This is my first C function so I'm really unsure that I've done the right thing. For the most part I pattern-matched off the string_io code in the contrib directory. In particular I'm unsure about the code postgres-interfacing code in c_varcharxfrm which makes an extra copy of both parameters that are passed in and an extra copy of the result value. Are varchars guaranteed to be nul-terminated? If so I can dispose of two of the copies. And I can probably eliminate the copying of the result by alloting extra space when I allocate it initially. But more generally. Would it make more sense to use text or bytea or something else to store these opaque binary strings? At least with glibc they tend to be unreadable anyways. Other caveats: It's condemned to be permanently non-threadsafe because the whole locale system is a non thread-safe API. Also I fear some systems will leak memory like a sieve when you call setlocale a few thousand times instead of the 1 time at initialization that they foresaw. At least glibc doesn't seem to leak in my brief testing. If it's deemed a reasonable approach and nobody has any fatal flaws then I expect it would be useful to put in the contrib directory? I'm not sure that ERROR if the locale cannot be put back is sufficient (although that case should be rare or non-existant). Unless something else in the system resets the locale, after your transaction rolls back, you're in a dangerous state. I'd think FATAL would be better. ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Collation rules and multi-lingual databases
Stephan Szabo [EMAIL PROTECTED] writes: On 22 Aug 2003, Greg Stark wrote: If it's deemed a reasonable approach and nobody has any fatal flaws then I expect it would be useful to put in the contrib directory? I'm not sure that ERROR if the locale cannot be put back is sufficient (although that case should be rare or non-existant). A bigger risk is that something might elog(ERROR) while you have the wrong locale set, denying you the chance to put back the right one. I think this code is not nearly paranoid enough about how much it does while the wrong locale is set. Unless something else in the system resets the locale, after your transaction rolls back, you're in a dangerous state. I'd think FATAL would be better. I'd go so far as to make it a critical section --- that ensures that any ERROR will be turned to FATAL, even if it's in a subroutine you call. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Collation rules and multi-lingual databases
On Fri, 22 Aug 2003, Tom Lane wrote: Stephan Szabo [EMAIL PROTECTED] writes: On 22 Aug 2003, Greg Stark wrote: If it's deemed a reasonable approach and nobody has any fatal flaws then I expect it would be useful to put in the contrib directory? I'm not sure that ERROR if the locale cannot be put back is sufficient (although that case should be rare or non-existant). A bigger risk is that something might elog(ERROR) while you have the wrong locale set, denying you the chance to put back the right one. I think this code is not nearly paranoid enough about how much it does while the wrong locale is set. True, there are calls to palloc, elog, etc inside there, although the elog could be removed. Unless something else in the system resets the locale, after your transaction rolls back, you're in a dangerous state. I'd think FATAL would be better. I'd go so far as to make it a critical section --- that ensures that any ERROR will be turned to FATAL, even if it's in a subroutine you call. I didn't know we could do that, could be handy, although the comments imply that it turns into PANIC which would force a complete restart. Then again, it's better than a corrupted database. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly