Re: [HACKERS] Collation rules and multi-lingual databases

2003-08-24 Thread Tom Lane
Stephan Szabo [EMAIL PROTECTED] writes:
 On Fri, 22 Aug 2003, Tom Lane wrote:
 I'd go so far as to make it a critical section --- that ensures that any
 ERROR will be turned to FATAL, even if it's in a subroutine you call.

 I didn't know we could do that, could be handy, although the comments
 imply that it turns into PANIC which would force a complete restart.

Right, my imprecision, it actually goes to PANIC.

 Then again, it's better than a corrupted database.

Exactly.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Collation rules and multi-lingual databases

2003-08-24 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 The glibc docs sample code suggests using 2x the original string
 length for the initial buffer. My testing showed that *always*
 triggered the exceptional case. A bit of experimentation lead to the
 3x+4 which eliminates it except for 0 and 1 byte strings. I'm still
 tweaking it. But on another OS, or in a more complex collation locale
 maybe you would still trigger it a lot.

On HPUX it seems you always need 4x.  Also, *there are bugs* in some
platforms' implementations of strxfrm, such that an undersized buffer
may get overrun anyway.  I had originally tried to optimize the buffer
size like this in src/backend/utils/adt/selfuncs.c's use of strxfrm,
and eventually was forced to give it up as hopeless.  I strongly suggest
using the same code now seen there:

char   *xfrmstr;
size_txfrmlen;
size_txfrmlen2;

/*
 * Note: originally we guessed at a suitable output buffer size,
 * and only needed to call strxfrm twice if our guess was too
 * small. However, it seems that some versions of Solaris have
 * buggy strxfrm that can write past the specified buffer length
 * in that scenario.  So, do it the dumb way for portability.
 *
 * Yet other systems (e.g., glibc) sometimes return a smaller value
 * from the second call than the first; thus the Assert must be =
 * not == as you'd expect.  Can't any of these people program
 * their way out of a paper bag?
 */
xfrmlen = strxfrm(NULL, val, 0);
xfrmstr = (char *) palloc(xfrmlen + 1);
xfrmlen2 = strxfrm(xfrmstr, val, xfrmlen + 1);
Assert(xfrmlen2 = xfrmlen);

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Collation rules and multi-lingual databases

2003-08-23 Thread Stephan Szabo
On Fri, 22 Aug 2003, Stephan Szabo wrote:

 On Fri, 22 Aug 2003, Tom Lane wrote:

  Stephan Szabo [EMAIL PROTECTED] writes:
   On 22 Aug 2003, Greg Stark wrote:
   If it's deemed a reasonable approach and nobody has any fatal flaws then I
   expect it would be useful to put in the contrib directory?
 
   I'm not sure that ERROR if the locale cannot be put back is sufficient
   (although that case should be rare or non-existant).
 
  A bigger risk is that something might elog(ERROR) while you have the
  wrong locale set, denying you the chance to put back the right one.
  I think this code is not nearly paranoid enough about how much it does
  while the wrong locale is set.

 True, there are calls to palloc, elog, etc inside there, although the elog
 could be removed.

Since most of that work is for an exceptional case, maybe it'd be safer
(although slower) to structure the function as

setlocale
call strxfrm (and that's it)
setlocale back
if there wasn't enough space
 make a new buffer
 setlocale
 call strxfrm (and that's it)
 setlocale back

Probably putting the sl/strxfrm/sl into its own function.


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Collation rules and multi-lingual databases

2003-08-23 Thread Greg Stark
Stephan Szabo [EMAIL PROTECTED] writes:

 Since most of that work is for an exceptional case, maybe it'd be safer
 (although slower) to structure the function as

Yeah I thought of that. But if making it a critical section is cheap then it's
probably a better approach. The problem with restoring the locale for the
palloc is that if the user is unlucky he might sort a table of thousands of
strings that all trigger the exception case.

The glibc docs sample code suggests using 2x the original string length for
the initial buffer. My testing showed that *always* triggered the exceptional
case. A bit of experimentation lead to the 3x+4 which eliminates it except for
0 and 1 byte strings. I'm still tweaking it. But on another OS, or in a more
complex collation locale maybe you would still trigger it a lot. Even as it is
if you happy to try to sort a large list of single character strings you would
trigger it a lot.

I have some documentation reading to do apparently before I can fix this up.


 setlocale
 call strxfrm (and that's it)
 setlocale back
 if there wasn't enough space
  make a new buffer
  setlocale
  call strxfrm (and that's it)
  setlocale back
 
 Probably putting the sl/strxfrm/sl into its own function.

-- 
greg


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Collation rules and multi-lingual databases

2003-08-23 Thread Stephan Szabo
On 23 Aug 2003, Greg Stark wrote:

 Stephan Szabo [EMAIL PROTECTED] writes:

  Since most of that work is for an exceptional case, maybe it'd be safer
  (although slower) to structure the function as

 Yeah I thought of that. But if making it a critical section is cheap then it's
 probably a better approach. The problem with restoring the locale for the
 palloc is that if the user is unlucky he might sort a table of thousands of
 strings that all trigger the exception case.

True. I still worry about the critical section since an error will cause
the entire database system to restart, but as I'd also forgotten about
signals which might cause problems (uncertain without looking) I'm not
sure you have a good alternative even not counting the speed issues.


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Collation rules and multi-lingual databases

2003-08-23 Thread Joe Conway
Greg Stark wrote:
Yeah I thought of that. But if making it a critical section is cheap then it's
probably a better approach. The problem with restoring the locale for the
palloc is that if the user is unlucky he might sort a table of thousands of
strings that all trigger the exception case.
What about something like this?
8
#include setjmp.h
#include string.h
#include postgres.h
#include fmgr.h
#include tcop/tcopprot.h
#include utils/builtins.h
#define GET_STR(textp) \
  DatumGetCString(DirectFunctionCall1(textout, PointerGetDatum(textp)))
#define GET_BYTEA(str_) \
  DatumGetTextP(DirectFunctionCall1(byteain, CStringGetDatum(str_)))
#define MAX_BYTEA_LEN   0x3fff
/*
 * pg_strxfrm - Function to convert string similar to the strxfrm C
 * function using a specified locale.
 */
extern Datum pg_strxfrm(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(pg_strxfrm);
Datum
pg_strxfrm(PG_FUNCTION_ARGS)
{
  char   *str = GET_STR(PG_GETARG_TEXT_P(0));
  size_t  str_len = strlen(str);
  char   *localestr = GET_STR(PG_GETARG_TEXT_P(1));
  size_t  approx_trans_len = 4 + (str_len * 3);
  char   *trans = (char *) palloc(approx_trans_len + 1);
  size_t  actual_trans_len;
  char   *oldlocale;
  char   *newlocale;
  sigjmp_buf  save_restart;
  if (approx_trans_len  MAX_BYTEA_LEN)
elog(ERROR, source string too long to transform);
  oldlocale = setlocale(LC_COLLATE, NULL);
  if (!oldlocale)
elog(ERROR, setlocale failed to return a locale);
  /* catch elog while locale is set other than the default */
  memcpy(save_restart, Warn_restart, sizeof(save_restart));
  if (sigsetjmp(Warn_restart, 1) != 0)
  {
memcpy(Warn_restart, save_restart, sizeof(Warn_restart));
newlocale = setlocale(LC_COLLATE, oldlocale);
if (!newlocale)
  elog(PANIC, setlocale failed to reset locale: %s, localestr);
siglongjmp(Warn_restart, 1);
  }
  newlocale = setlocale(LC_COLLATE, localestr);
  if (!newlocale)
elog(ERROR, setlocale failed to set a locale: %s, localestr);
  actual_trans_len = strxfrm(trans, str, approx_trans_len + 1);

  /* if the buffer was not large enough, resize it and try again */
  if (actual_trans_len = approx_trans_len)
  {
approx_trans_len = actual_trans_len + 1;
if (approx_trans_len  MAX_BYTEA_LEN)
  elog(ERROR, source string too long to transform);
trans = (char *) repalloc(trans, approx_trans_len + 1);
actual_trans_len = strxfrm(trans, str, approx_trans_len + 1);
/* if the buffer still not large enough, punt */
if (actual_trans_len = approx_trans_len)
  elog(ERROR, strxfrm failed, buffer insufficient);
  }
  newlocale = setlocale(LC_COLLATE, oldlocale);
  if (!newlocale)
elog(PANIC, setlocale failed to reset locale: %s, localestr);
  PG_RETURN_BYTEA_P(GET_BYTEA(trans));
}
8

Joe

---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] Collation rules and multi-lingual databases

2003-08-23 Thread Joe Conway
Joe Conway wrote:
What about something like this?
Oops! Forgot to restrore error handling. See below:

Joe

8

#include setjmp.h
#include string.h
#include postgres.h
#include fmgr.h
#include tcop/tcopprot.h
#include utils/builtins.h
#define GET_STR(textp) \
  DatumGetCString(DirectFunctionCall1(textout, PointerGetDatum(textp)))
#define GET_BYTEA(str_) \
  DatumGetTextP(DirectFunctionCall1(byteain, CStringGetDatum(str_)))
#define MAX_BYTEA_LEN0x3fff
/*
 * pg_strxfrm - Function to convert string similar to the strxfrm C
 * function using a specified locale.
 */
extern Datum pg_strxfrm(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(pg_strxfrm);
Datum
pg_strxfrm(PG_FUNCTION_ARGS)
{
  char   *str = GET_STR(PG_GETARG_TEXT_P(0));
  size_t  str_len = strlen(str);
  char   *localestr = GET_STR(PG_GETARG_TEXT_P(1));
  size_t  approx_trans_len = 4 + (str_len * 3);
  char   *trans = (char *) palloc(approx_trans_len + 1);
  size_t  actual_trans_len;
  char   *oldlocale;
  char   *newlocale;
  sigjmp_buf  save_restart;
  if (approx_trans_len  MAX_BYTEA_LEN)
elog(ERROR, source string too long to transform);
  oldlocale = setlocale(LC_COLLATE, NULL);
  if (!oldlocale)
elog(ERROR, setlocale failed to return a locale);
  /* catch elog while locale is set other than the default */
  memcpy(save_restart, Warn_restart, sizeof(save_restart));
  if (sigsetjmp(Warn_restart, 1) != 0)
  {
memcpy(Warn_restart, save_restart, sizeof(Warn_restart));
newlocale = setlocale(LC_COLLATE, oldlocale);
if (!newlocale)
  elog(PANIC, setlocale failed to reset locale: %s, localestr);
siglongjmp(Warn_restart, 1);
  }
  newlocale = setlocale(LC_COLLATE, localestr);
  if (!newlocale)
elog(ERROR, setlocale failed to set a locale: %s, localestr);
  actual_trans_len = strxfrm(trans, str, approx_trans_len + 1);

  /* if the buffer was not large enough, resize it and try again */
  if (actual_trans_len = approx_trans_len)
  {
approx_trans_len = actual_trans_len + 1;
if (approx_trans_len  MAX_BYTEA_LEN)
  elog(ERROR, source string too long to transform);
trans = (char *) repalloc(trans, approx_trans_len + 1);
actual_trans_len = strxfrm(trans, str, approx_trans_len + 1);
/* if the buffer still not large enough, punt */
if (actual_trans_len = approx_trans_len)
  elog(ERROR, strxfrm failed, buffer insufficient);
  }
  newlocale = setlocale(LC_COLLATE, oldlocale);
  if (!newlocale)
elog(PANIC, setlocale failed to reset locale: %s, localestr);
  /* restore normal error handling */
  memcpy(Warn_restart, save_restart, sizeof(Warn_restart));
  PG_RETURN_BYTEA_P(GET_BYTEA(trans));
}
8



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] Collation rules and multi-lingual databases

2003-08-23 Thread Greg Stark

Joe Conway [EMAIL PROTECTED] writes:

if (sigsetjmp(Warn_restart, 1) != 0)
{
  memcpy(Warn_restart, save_restart, sizeof(Warn_restart));
  newlocale = setlocale(LC_COLLATE, oldlocale);
  if (!newlocale)
elog(PANIC, setlocale failed to reset locale: %s, localestr);
  siglongjmp(Warn_restart, 1);
}

Well presumably we want FATAL not PANIC.

And do we still need HOLD_INTERRUPTS() .. RESUME_INTERRUPTS() ?

I was afraid that was getting into bed too much with the error handling. I
have an implementation that restores the locale around the palloc and
increases the initial guess for future calls to avoid degenerate behaviour.
I'm not sure which approach is preferable.

-- 
greg


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Collation rules and multi-lingual databases

2003-08-23 Thread Joe Conway
Greg Stark wrote:
Joe Conway [EMAIL PROTECTED] writes:


 if (sigsetjmp(Warn_restart, 1) != 0)
 {
   memcpy(Warn_restart, save_restart, sizeof(Warn_restart));
   newlocale = setlocale(LC_COLLATE, oldlocale);
   if (!newlocale)
 elog(PANIC, setlocale failed to reset locale: %s, localestr);
   siglongjmp(Warn_restart, 1);
 }


Well presumably we want FATAL not PANIC.
Yeah, that was a bit overzealous. I really intended FATAL.

And do we still need HOLD_INTERRUPTS() .. RESUME_INTERRUPTS() ?
I'm not sure, but I think not.

I was afraid that was getting into bed too much with the error handling. I
have an implementation that restores the locale around the palloc and
increases the initial guess for future calls to avoid degenerate behaviour.
Well the intention of the sigsetjmp was to avoid the need to flip the 
locale to-and-fro. Increasing the initial guess might be good, but it 
will further restrict the length of the input string you can work with. 
But I'd guess you'll not want to use this with extremely long strings 
anyway.

Joe

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Collation rules and multi-lingual databases

2003-08-22 Thread Greg Stark

So, I needed a way to sort using collation rules other than the one the
database was built with. So I wrote up the following function exposing strxfrm
with an extra parameter to specify the LC_COLLATE value to use.

This is my first C function so I'm really unsure that I've done the right
thing. For the most part I pattern-matched off the string_io code in the
contrib directory. 

In particular I'm unsure about the code postgres-interfacing code in
c_varcharxfrm which makes an extra copy of both parameters that are passed in
and an extra copy of the result value. Are varchars guaranteed to be
nul-terminated? If so I can dispose of two of the copies. And I can probably
eliminate the copying of the result by alloting extra space when I allocate it
initially.

But more generally. Would it make more sense to use text or bytea or something
else to store these opaque binary strings? At least with glibc they tend to be
unreadable anyways.

Other caveats: It's condemned to be permanently non-threadsafe because the
whole locale system is a non thread-safe API. Also I fear some systems will
leak memory like a sieve when you call setlocale a few thousand times instead
of the 1 time at initialization that they foresaw. At least glibc doesn't seem
to leak in my brief testing.

If it's deemed a reasonable approach and nobody has any fatal flaws then I
expect it would be useful to put in the contrib directory?


/*
 * This software is distributed under the GNU General Public License
 * either version 2, or (at your option) any later version.
 */

#include postgres.h

#include locale.h

#include utils/builtins.h

static 
unsigned char * xfrm(unsigned char *data, int size, const unsigned char *locale, int localesize);

unsigned char * c_varcharxfrm(unsigned char *s, const unsigned char *locale);


static unsigned char *
xfrm(unsigned char *data, int size, const unsigned char *locale, int localesize)
{
  size_t length = size*3+4;
  char *transformed;
  size_t transformed_length;
  char *oldlocale, *newlocale;
 
  /* First try a buffer perhaps big enough.  */
  transformed = palloc (length);
 
  oldlocale = setlocale(LC_COLLATE, NULL);
  if (!oldlocale) {
elog(ERROR, setlocale(LC_COLLATE,NULL) failed to return a locale);
return NULL;
  }
  
  newlocale = setlocale(LC_COLLATE, locale);
  if (!newlocale) {
elog(ERROR, setlocale(LC_COLLATE,%s) failed to return a locale, locale);
return NULL;
  }

  transformed_length = strxfrm (transformed, data, length);

  /* If the buffer was not large enough, resize it and try again.  */
  if (transformed_length = length) {
elog(INFO, Calling strxfrm again because result didn't fit (%d%d), transformed_length, length);
length = transformed_length + 1;
transformed = palloc(length);
strxfrm (transformed, data, length);
  }
 
  newlocale = setlocale(LC_COLLATE, oldlocale);

  Assert(newlocale  !strcmp(newlocale,oldlocale));
  if (!newlocale || strcmp(newlocale,oldlocale)) {
elog(ERROR, Failed to reset locale (trying to reset locale to %s from %s instead got %s), oldlocale, locale, newlocale);
  }
  
  return transformed;
}


unsigned char *
c_varcharxfrm(unsigned char *s, const unsigned char *l)
{
  int lens = 0, lenl = 0, lenr = 0;
  unsigned char *str, *locale, *retval, *retval2;

  if (s) {
lens = *(int32 *) s - 4;
str = palloc(lens+1);
memcpy(str, s+4, lens);
str[lens]='\0';
  }

  if (l) {
lenl = *(int32 *) l - 4;
locale = palloc(lenl+1);
memcpy(locale, l+4, lenl);
locale[lenl]='\0';
  }

  retval = xfrm(str, lens, locale, lenl);
  
  lenr = strlen(retval);
  retval2 = palloc(lenr+5);
  memcpy(retval2+4, retval, lenr+1);
  *(int32 *)retval2 = lenr;
  
  return retval2;
}





/*
 * Local Variables:
 *	tab-width: 4
 *	c-indent-level: 4
 *	c-basic-offset: 4
 * End:
 */
SET search_path = public;

SET autocommit TO 'on';

CREATE OR REPLACE FUNCTION xfrm(varchar, varchar)
RETURNS varchar
AS 'strxfrm.so', 'c_varcharxfrm'
LANGUAGE 'C' STRICT IMMUTABLE ;



-- 
greg

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Collation rules and multi-lingual databases

2003-08-22 Thread Peter Eisentraut
Greg Stark writes:

 This is my first C function so I'm really unsure that I've done the right
 thing. For the most part I pattern-matched off the string_io code in the
 contrib directory.

That was just about the worst example you could have picked.  Please
forget everything you have seen and start by reading the documentation.
In particular, learn about the version 1 call convention and about the
PostgreSQL license.  And read some code under src/backend/utils/adt.

 In particular I'm unsure about the code postgres-interfacing code in
 c_varcharxfrm which makes an extra copy of both parameters that are passed in
 and an extra copy of the result value. Are varchars guaranteed to be
 nul-terminated?

They are guaranteed not to be null-terminated.

 But more generally. Would it make more sense to use text or bytea or something
 else to store these opaque binary strings? At least with glibc they tend to be
 unreadable anyways.

bytea

 If it's deemed a reasonable approach and nobody has any fatal flaws then I
 expect it would be useful to put in the contrib directory?

I'd expect it to be too slow to be useful.  Have you run performance tests?

-- 
Peter Eisentraut   [EMAIL PROTECTED]


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Collation rules and multi-lingual databases

2003-08-22 Thread Stephan Szabo

On 22 Aug 2003, Greg Stark wrote:


 So, I needed a way to sort using collation rules other than the one the
 database was built with. So I wrote up the following function exposing strxfrm
 with an extra parameter to specify the LC_COLLATE value to use.

 This is my first C function so I'm really unsure that I've done the right
 thing. For the most part I pattern-matched off the string_io code in the
 contrib directory.

 In particular I'm unsure about the code postgres-interfacing code in
 c_varcharxfrm which makes an extra copy of both parameters that are passed in
 and an extra copy of the result value. Are varchars guaranteed to be
 nul-terminated? If so I can dispose of two of the copies. And I can probably
 eliminate the copying of the result by alloting extra space when I allocate it
 initially.

 But more generally. Would it make more sense to use text or bytea or something
 else to store these opaque binary strings? At least with glibc they tend to be
 unreadable anyways.

 Other caveats: It's condemned to be permanently non-threadsafe because the
 whole locale system is a non thread-safe API. Also I fear some systems will
 leak memory like a sieve when you call setlocale a few thousand times instead
 of the 1 time at initialization that they foresaw. At least glibc doesn't seem
 to leak in my brief testing.

 If it's deemed a reasonable approach and nobody has any fatal flaws then I
 expect it would be useful to put in the contrib directory?

I'm not sure that ERROR if the locale cannot be put back is sufficient
(although that case should be rare or non-existant). Unless something else
in the system resets the locale, after your transaction rolls back, you're
in a dangerous state.  I'd think FATAL would be better.


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Collation rules and multi-lingual databases

2003-08-22 Thread Tom Lane
Stephan Szabo [EMAIL PROTECTED] writes:
 On 22 Aug 2003, Greg Stark wrote:
 If it's deemed a reasonable approach and nobody has any fatal flaws then I
 expect it would be useful to put in the contrib directory?

 I'm not sure that ERROR if the locale cannot be put back is sufficient
 (although that case should be rare or non-existant).

A bigger risk is that something might elog(ERROR) while you have the
wrong locale set, denying you the chance to put back the right one.
I think this code is not nearly paranoid enough about how much it does
while the wrong locale is set.

 Unless something else
 in the system resets the locale, after your transaction rolls back, you're
 in a dangerous state.  I'd think FATAL would be better.

I'd go so far as to make it a critical section --- that ensures that any
ERROR will be turned to FATAL, even if it's in a subroutine you call.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Collation rules and multi-lingual databases

2003-08-22 Thread Stephan Szabo
On Fri, 22 Aug 2003, Tom Lane wrote:

 Stephan Szabo [EMAIL PROTECTED] writes:
  On 22 Aug 2003, Greg Stark wrote:
  If it's deemed a reasonable approach and nobody has any fatal flaws then I
  expect it would be useful to put in the contrib directory?

  I'm not sure that ERROR if the locale cannot be put back is sufficient
  (although that case should be rare or non-existant).

 A bigger risk is that something might elog(ERROR) while you have the
 wrong locale set, denying you the chance to put back the right one.
 I think this code is not nearly paranoid enough about how much it does
 while the wrong locale is set.

True, there are calls to palloc, elog, etc inside there, although the elog
could be removed.

  Unless something else
  in the system resets the locale, after your transaction rolls back, you're
  in a dangerous state.  I'd think FATAL would be better.

 I'd go so far as to make it a critical section --- that ensures that any
 ERROR will be turned to FATAL, even if it's in a subroutine you call.

I didn't know we could do that, could be handy, although the comments
imply that it turns into PANIC which would force a complete restart.  Then
again, it's better than a corrupted database.


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly