subject:"\[HACKERS\] More message encoding woes"

Re: [HACKERS] More message encoding woes

2009-04-08 Thread Heikki Linnakangas


Peter Eisentraut wrote:

On Tuesday 07 April 2009 13:09:42 Heikki Linnakangas wrote:

Patch attached. Instead of checking for LC_CTYPE == C, I'm checking
pg_get_encoding_from_locale(NULL) == encoding which is more close to
what we actually want. The downside is that
pg_get_encoding_from_locale(NULL) isn't exactly free, but the upside is
that we don't need to keep this in sync with the rules we have in CREATE
DATABASE that enforce that locale matches encoding.


I would have figured we can skip this whole thing when LC_CTYPE != C, because 
it should be guaranteed that LC_CTYPE matches the database encoding in this 
case, no?


Ok, committed it like that after all.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-04-07 Thread Heikki Linnakangas


Peter Eisentraut wrote:
In practice you get either the GNU or the Solaris version of gettext, and at 
least the GNU version can cope with all the encoding names that the currently 
Windows-only code path produces. 


It doesn't. On my laptop running Debian testing:

hlinn...@heikkilaptop:~$ LC_ALL=fi_FI.UTF-8 gettext
gettext: ei riittävästi argumentteja
hlinn...@heikkilaptop:~$ LC_ALL=fi_FI.LATIN1 gettext
gettext: missing arguments
hlinn...@heikkilaptop:~$ LC_ALL=fi_FI.ISO-8859-1 gettext
gettext: ei riitt�v�sti argumentteja

Using the name for the latin1 encoding in the currently Windows-only 
mapping table, LATIN1, you get no translation because that name is not 
recognized by the system. Using the other name ISO-8859-1, it works. 
LATIN1 is not listed in the output of locale -m either.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-04-07 Thread Peter Eisentraut

On Tuesday 07 April 2009 11:21:25 Heikki Linnakangas wrote:
 Peter Eisentraut wrote:
  In practice you get either the GNU or the Solaris version of gettext, and
  at least the GNU version can cope with all the encoding names that the
  currently Windows-only code path produces.

 It doesn't. On my laptop running Debian testing:

 hlinn...@heikkilaptop:~$ LC_ALL=fi_FI.UTF-8 gettext
 gettext: ei riittävästi argumentteja
 hlinn...@heikkilaptop:~$ LC_ALL=fi_FI.LATIN1 gettext
 gettext: missing arguments

That is because no locale by the name fi_FI.LATIN1 exists.

 hlinn...@heikkilaptop:~$ LC_ALL=fi_FI.ISO-8859-1 gettext
 gettext: ei riitt�v�sti argumentteja

 Using the name for the latin1 encoding in the currently Windows-only
 mapping table, LATIN1, you get no translation because that name is not
 recognized by the system. Using the other name ISO-8859-1, it works.
 LATIN1 is not listed in the output of locale -m either.

You are looking in the wrong place.  What we need is for iconv to recognize 
the encoding name used by PostgreSQL.  iconv --list is the primary hint for 
that.

The locale names provided by the operating system are arbitrary and unrelated.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-04-07 Thread Heikki Linnakangas


Hiroshi Inoue wrote:

Heikki Linnakangas wrote:
I just tried that, and it seems that gettext() does transliteration, 
so any characters that have no counterpart in the database encoding 
will be replaced with something similar, or question marks. Assuming 
that's universal across platforms, and I think it is, using the empty 
string should work.


It also means that you can use lc_messages='ja' with 
server_encoding='latin1', but it will be unreadable because all the 
non-ascii characters are replaced with question marks. For something 
like lc_messages='es_ES' and server_encoding='koi8-r', it will still 
look quite nice.


Attached is a patch I've been testing. Seems to work quite well. It
would be nice if someone could test it on Windows, which seems to be a 
bit special in this regard.


Unfortunately it doesn't seem to work on Windows.

First any combination of valid lc_messages and non-existent encoding
passes the test  strcmp(gettext(), ) != 0 .


Now that's strange. Can you check what gettext() returns in that case 
then?



Second for example the combination of ja(lc_messages) and ISO-8859-1
passes the the test but the test fails after I changed the last_trans
lator part of ja message catalog to contain Japanese kanji characters.


Yeah, the inconsistency is not nice. In practice, though, if you try to 
use an encoding that can't represent kanji characters with Japanese, 
you're better off falling back to English than displaying strings full 
of question marks. The same goes for all other languages as well, IMHO. 
If you're going to fall back to English for some translations (and in 
practice some is a pretty high percentage) because the encoding is 
missing a character and transliteration is not working, you might as 
well not bother translating at all.


If we add the dummy translations to all .po files, we could force 
fallback-to-English in situations like that by including some or all of 
the non-ASCII characters used in the language in the dummy translation.


I'm thinking of going ahead with this approach, without the dummy 
translation, after we have resolved the first issue on Windows. We can 
add the dummy translations later if needed, but I don't think anyone 
will care.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-04-07 Thread Hiroshi Inoue


Heikki Linnakangas wrote:

Hiroshi Inoue wrote:

Heikki Linnakangas wrote:
I just tried that, and it seems that gettext() does transliteration, 
so any characters that have no counterpart in the database encoding 
will be replaced with something similar, or question marks. Assuming 
that's universal across platforms, and I think it is, using the empty 
string should work.


It also means that you can use lc_messages='ja' with 
server_encoding='latin1', but it will be unreadable because all the 
non-ascii characters are replaced with question marks. For something 
like lc_messages='es_ES' and server_encoding='koi8-r', it will still 
look quite nice.


Attached is a patch I've been testing. Seems to work quite well. It
would be nice if someone could test it on Windows, which seems to be 
a bit special in this regard.


Unfortunately it doesn't seem to work on Windows.

First any combination of valid lc_messages and non-existent encoding
passes the test  strcmp(gettext(), ) != 0 .


Now that's strange. Can you check what gettext() returns in that case 
then?


Translated but not converted string. I'm not sure if it's a bug or not.
I can see no description what should be returned in such case.


Second for example the combination of ja(lc_messages) and ISO-8859-1
passes the the test but the test fails after I changed the last_trans
lator part of ja message catalog to contain Japanese kanji characters.


Yeah, the inconsistency is not nice. In practice, though, if you try to 
use an encoding that can't represent kanji characters with Japanese, 
you're better off falling back to English than displaying strings full 
of question marks. The same goes for all other languages as well, IMHO. 
If you're going to fall back to English for some translations (and in 
practice some is a pretty high percentage) because the encoding is 
missing a character and transliteration is not working, you might as 
well not bother translating at all.


What is wrong with checking if the codeset is valid using iconv_open()?

regards,
Hiroshi Inoue


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-04-07 Thread Heikki Linnakangas


Hiroshi Inoue wrote:

What is wrong with checking if the codeset is valid using iconv_open()?


That would probably work as well. We'd have to decide what we'd try to 
convert from with iconv_open(). Utf-8 might be a safe choice. We don't 
currently use iconv_open() anywhere in the backend, though, so I'm 
hesitant to add a dependency for this. GNU gettext() uses iconv, but I'm 
not sure if that's true for all gettext() implementations.


Peter's suggestion seems the best ATM, though.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-04-07 Thread Heikki Linnakangas


Peter Eisentraut wrote:

On Tuesday 07 April 2009 11:21:25 Heikki Linnakangas wrote:

Using the name for the latin1 encoding in the currently Windows-only
mapping table, LATIN1, you get no translation because that name is not
recognized by the system. Using the other name ISO-8859-1, it works.
LATIN1 is not listed in the output of locale -m either.


You are looking in the wrong place.  What we need is for iconv to recognize 
the encoding name used by PostgreSQL.  iconv --list is the primary hint for 
that.


The locale names provided by the operating system are arbitrary and unrelated.


Oh, ok. I guess we can do the simple fix you proposed then.

Patch attached. Instead of checking for LC_CTYPE == C, I'm checking 
pg_get_encoding_from_locale(NULL) == encoding which is more close to 
what we actually want. The downside is that 
pg_get_encoding_from_locale(NULL) isn't exactly free, but the upside is 
that we don't need to keep this in sync with the rules we have in CREATE 
DATABASE that enforce that locale matches encoding.


This doesn't include the cleanup to make the mapping table easier to 
maintain that Magnus was going to have a look at before I started this 
thread.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** a/src/backend/utils/mb/mbutils.c
--- b/src/backend/utils/mb/mbutils.c
***
*** 890,896  cliplen(const char *str, int len, int limit)
  	return l;
  }
  
! #if defined(ENABLE_NLS)  defined(WIN32)
  static const struct codeset_map {
  	int	encoding;
  	const char *codeset;
--- 890,896 
  	return l;
  }
  
! #if defined(ENABLE_NLS)
  static const struct codeset_map {
  	int	encoding;
  	const char *codeset;
***
*** 929,935  static const struct codeset_map {
  	{PG_EUC_TW, EUC-TW},
  	{PG_EUC_JIS_2004, EUC-JP}
  };
! #endif /* WIN32 */
  
  void
  SetDatabaseEncoding(int encoding)
--- 929,935 
  	{PG_EUC_TW, EUC-TW},
  	{PG_EUC_JIS_2004, EUC-JP}
  };
! #endif /* ENABLE_NLS */
  
  void
  SetDatabaseEncoding(int encoding)
***
*** 946,960  SetDatabaseEncoding(int encoding)
  }
  
  /*
!  * On Windows, we need to explicitly bind gettext to the correct
!  * encoding, because gettext() tends to get confused.
   */
  void
  pg_bind_textdomain_codeset(const char *domainname, int encoding)
  {
! #if defined(ENABLE_NLS)  defined(WIN32)
  	int i;
  
  	for (i = 0; i  lengthof(codeset_map_array); i++)
  	{
  		if (codeset_map_array[i].encoding == encoding)
--- 946,975 
  }
  
  /*
!  * Bind gettext to the correct encoding.
   */
  void
  pg_bind_textdomain_codeset(const char *domainname, int encoding)
  {
! #if defined(ENABLE_NLS)
  	int i;
  
+ 	/*
+ 	 * gettext() uses the encoding specified by LC_CTYPE by default,
+ 	 * so if that matches the database encoding, we don't need to do
+ 	 * anything. This is not for performance, but because if
+ 	 * bind_textdomain_codeset() doesn't recognize the codeset name we
+ 	 * pass it, it will fall back to English and we don't want that to 
+ 	 * happen unnecessarily.
+ 	 *
+ 	 * On Windows, though, gettext() tends to get confused so we always
+ 	 * bind it.
+ 	 */
+ #ifndef WIN32
+ 	if (pg_get_encoding_from_locale(NULL) == encoding)
+ 		return;
+ #endif
+ 
  	for (i = 0; i  lengthof(codeset_map_array); i++)
  	{
  		if (codeset_map_array[i].encoding == encoding)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-04-07 Thread Tom Lane

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Hiroshi Inoue wrote:
 What is wrong with checking if the codeset is valid using iconv_open()?

 That would probably work as well. We'd have to decide what we'd try to 
 convert from with iconv_open().

The problem I have with that is that you are now guessing at *two*
platform-specific encoding names not one, plus hoping there is a
conversion between the two.

If we knew the encoding name embedded in the .mo file we wanted to use,
then it would be sensible to try to use that as the source codeset.

 GNU gettext() uses iconv, but I'm 
 not sure if that's true for all gettext() implementations.

Yeah, that's another problem.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-04-07 Thread Heikki Linnakangas


Peter Eisentraut wrote:

On Tuesday 07 April 2009 13:09:42 Heikki Linnakangas wrote:

Patch attached. Instead of checking for LC_CTYPE == C, I'm checking
pg_get_encoding_from_locale(NULL) == encoding which is more close to
what we actually want. The downside is that
pg_get_encoding_from_locale(NULL) isn't exactly free, but the upside is
that we don't need to keep this in sync with the rules we have in CREATE
DATABASE that enforce that locale matches encoding.


I would have figured we can skip this whole thing when LC_CTYPE != C, because 
it should be guaranteed that LC_CTYPE matches the database encoding in this 
case, no?


Yes, except if pg_get_encoding_from_locale() couldn't figure out what PG 
encoding LC_CTYPE corresponds to. We let CREATE DATABASE to go ahead in 
that case, trusting that the user knows what he's doing. I suppose we 
can extend that trust to this case too, and assume that the encoding of 
LC_CTYPE actually matches the database encoding.


Or if the encoding is UTF-8 and you're running on Windows, although on 
Windows we want to always call bind_textdomain_codeset(). Or if the 
database encoding is SQL_ASCII, although in that case we don't want to 
call bind_textdomain_codeset() either.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-04-07 Thread Peter Eisentraut

On Tuesday 07 April 2009 13:09:42 Heikki Linnakangas wrote:
 Patch attached. Instead of checking for LC_CTYPE == C, I'm checking
 pg_get_encoding_from_locale(NULL) == encoding which is more close to
 what we actually want. The downside is that
 pg_get_encoding_from_locale(NULL) isn't exactly free, but the upside is
 that we don't need to keep this in sync with the rules we have in CREATE
 DATABASE that enforce that locale matches encoding.

I would have figured we can skip this whole thing when LC_CTYPE != C, because 
it should be guaranteed that LC_CTYPE matches the database encoding in this 
case, no?

Other than that, I think this patch is good.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-04-07 Thread Hiroshi Inoue

Tom Lane wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Hiroshi Inoue wrote:
 What is wrong with checking if the codeset is valid using iconv_open()?
 
 That would probably work as well. We'd have to decide what we'd try to 
 convert from with iconv_open().
 
 The problem I have with that is that you are now guessing at *two*
 platform-specific encoding names not one, plus hoping there is a
 conversion between the two.

AFAIK iconv_open() supports all combinations of the valid encoding
values. Or we may be able to check it using the same encoding for
both from and to.

regards,
Hiroshi Inoue


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-04-06 Thread Peter Eisentraut

On Monday 30 March 2009 15:52:37 Heikki Linnakangas wrote:
 In CVS HEAD, we call bind_textdomain_codeset() in SetDatabaseEncoding()
 which fixes that, but we only do it on Windows. In earlier versions we
 called it on all platforms, but only for UTF-8. It seems that we should
 call bind_textdomain_codeset on all platforms and all encodings.
 However, there seems to be a reason why we only do it for Windows on CVS
 HEAD: we need a mapping from our encoding ID to the OS codeset name, and
 the OS codeset names vary.

In practice you get either the GNU or the Solaris version of gettext, and at 
least the GNU version can cope with all the encoding names that the currently 
Windows-only code path produces.  So enabling the Windows code path for all 
platforms when ENABLE_NLS is on and LC_CTYPE is C would appear to work in 
sufficiently many cases.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-04-02 Thread Hiroshi Inoue


Heikki Linnakangas wrote:

Tom Lane wrote:

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:

Tom Lane wrote:

Maybe use a special string Translate Me First that
doesn't actually need to be end-user-visible, just so no one sweats 
over

getting it right in context.


Yep, something like that. There seems to be a magic empty string 
translation at the beginning of every po file that returns the 
meta-information about the translation, like translation author and 
date. Assuming that works reliably, I'll use that.


At first that sounded like an ideal answer, but I can see a gotcha:
suppose the translation's author's name contains some characters that
don't convert to the database encoding.  I suppose that would result in
failure, when we'd prefer it not to.  A single-purpose string could be
documented as whatever you translate this to should be pure ASCII,
never mind if it's sensible.


I just tried that, and it seems that gettext() does transliteration, so 
any characters that have no counterpart in the database encoding will be 
replaced with something similar, or question marks. Assuming that's 
universal across platforms, and I think it is, using the empty string 
should work.


It also means that you can use lc_messages='ja' with 
server_encoding='latin1', but it will be unreadable because all the 
non-ascii characters are replaced with question marks. For something 
like lc_messages='es_ES' and server_encoding='koi8-r', it will still 
look quite nice.


Attached is a patch I've been testing. Seems to work quite well. It 
would be nice if someone could test it on Windows, which seems to be a 
bit special in this regard.


Unfortunately it doesn't seem to work on Windows.

First any combination of valid lc_messages and non-existent encoding
passes the test  strcmp(gettext(), ) != 0 .
Second for example the combination of ja(lc_messages) and ISO-8859-1
passes the the test but the test fails after I changed the last_trans
lator part of ja message catalog to contain Japanese kanji characters.

regards,
Hiroshi Inoue

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-04-02 Thread Peter Eisentraut

On Monday 30 March 2009 15:52:37 Heikki Linnakangas wrote:
 What is happening is that gettext() returns the message in the encoding
 determined by LC_CTYPE, while we expect it to return it in the database
 encoding. Starting with PG 8.3 we enforce that the encoding specified in
 LC_CTYPE matches the database encoding, but not for the C locale.

 In CVS HEAD, we call bind_textdomain_codeset() in SetDatabaseEncoding()
 which fixes that, but we only do it on Windows. In earlier versions we
 called it on all platforms, but only for UTF-8. It seems that we should
 call bind_textdomain_codeset on all platforms and all encodings.
 However, there seems to be a reason why we only do it for Windows on CVS
 HEAD: we need a mapping from our encoding ID to the OS codeset name, and
 the OS codeset names vary.

 How can we make this more robust?

Another approach might be to create a new configuration parameter that 
basically tells what encoding to call bind_textdomain_codeset() with, say 
server_encoding_for_gettext.  If that is not set, you just use server_encoding 
as is and hope that gettext() takes it (which it would in most cases, I 
guess).

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-04-02 Thread Hiroshi Inoue


Hiroshi Inoue wrote:

Heikki Linnakangas wrote:

Tom Lane wrote:

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:

Tom Lane wrote:

Maybe use a special string Translate Me First that
doesn't actually need to be end-user-visible, just so no one sweats 
over

getting it right in context.


Yep, something like that. There seems to be a magic empty string 
translation at the beginning of every po file that returns the 
meta-information about the translation, like translation author and 
date. Assuming that works reliably, I'll use that.


At first that sounded like an ideal answer, but I can see a gotcha:
suppose the translation's author's name contains some characters that
don't convert to the database encoding.  I suppose that would result in
failure, when we'd prefer it not to.  A single-purpose string could be
documented as whatever you translate this to should be pure ASCII,
never mind if it's sensible.


I just tried that, and it seems that gettext() does transliteration, 
so any characters that have no counterpart in the database encoding 
will be replaced with something similar, or question marks. Assuming 
that's universal across platforms, and I think it is, using the empty 
string should work.


It also means that you can use lc_messages='ja' with 
server_encoding='latin1', but it will be unreadable because all the 
non-ascii characters are replaced with question marks. For something 
like lc_messages='es_ES' and server_encoding='koi8-r', it will still 
look quite nice.


Attached is a patch I've been testing. Seems to work quite well. It 
would be nice if someone could test it on Windows, which seems to be a 
bit special in this regard.


Unfortunately it doesn't seem to work on Windows.


Is it unappropriate to call iconv_open() to check if the codeset is
 valid for bind_textdomain_codeset()?

regards,
Hiroshi Inoue


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-04-01 Thread Heikki Linnakangas


Tom Lane wrote:

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:

Tom Lane wrote:

Maybe use a special string Translate Me First that
doesn't actually need to be end-user-visible, just so no one sweats over
getting it right in context.


Yep, something like that. There seems to be a magic empty string 
translation at the beginning of every po file that returns the 
meta-information about the translation, like translation author and 
date. Assuming that works reliably, I'll use that.


At first that sounded like an ideal answer, but I can see a gotcha:
suppose the translation's author's name contains some characters that
don't convert to the database encoding.  I suppose that would result in
failure, when we'd prefer it not to.  A single-purpose string could be
documented as whatever you translate this to should be pure ASCII,
never mind if it's sensible.


I just tried that, and it seems that gettext() does transliteration, so 
any characters that have no counterpart in the database encoding will be 
replaced with something similar, or question marks. Assuming that's 
universal across platforms, and I think it is, using the empty string 
should work.


It also means that you can use lc_messages='ja' with 
server_encoding='latin1', but it will be unreadable because all the 
non-ascii characters are replaced with question marks. For something 
like lc_messages='es_ES' and server_encoding='koi8-r', it will still 
look quite nice.


Attached is a patch I've been testing. Seems to work quite well. It 
would be nice if someone could test it on Windows, which seems to be a 
bit special in this regard.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 118a6fe..390a7cf 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -290,6 +290,7 @@ locale_messages_assign(const char *value, bool doit, GucSource source)
 		if (!pg_perm_setlocale(LC_MESSAGES, value))
 			if (source != PGC_S_DEFAULT)
 return NULL;
+		pg_init_gettext_codeset();
 	}
 #ifndef WIN32
 	else
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 03d86ca..47ebe1b 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -1242,7 +1242,7 @@ pg_bindtextdomain(const char *domain)
 
 		get_locale_path(my_exec_path, locale_path);
 		bindtextdomain(domain, locale_path);
-		pg_bind_textdomain_codeset(domain, GetDatabaseEncoding());
+		pg_register_textdomain(domain);
 	}
 #endif
 }
diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index bf66321..970cb83 100644
--- a/src/backend/utils/mb/mbutils.c
+++ b/src/backend/utils/mb/mbutils.c
@@ -842,46 +842,6 @@ cliplen(const char *str, int len, int limit)
 	return l;
 }
 
-#if defined(ENABLE_NLS)  defined(WIN32)
-static const struct codeset_map {
-	int	encoding;
-	const char *codeset;
-} codeset_map_array[] = {
-	{PG_UTF8, UTF-8},
-	{PG_LATIN1, LATIN1},
-	{PG_LATIN2, LATIN2},
-	{PG_LATIN3, LATIN3},
-	{PG_LATIN4, LATIN4},
-	{PG_ISO_8859_5, ISO-8859-5},
-	{PG_ISO_8859_6, ISO_8859-6},
-	{PG_ISO_8859_7, ISO-8859-7},
-	{PG_ISO_8859_8, ISO-8859-8},
-	{PG_LATIN5, LATIN5},
-	{PG_LATIN6, LATIN6},
-	{PG_LATIN7, LATIN7},
-	{PG_LATIN8, LATIN8},
-	{PG_LATIN9, LATIN-9},
-	{PG_LATIN10, LATIN10},
-	{PG_KOI8R, KOI8-R},
-	{PG_WIN1250, CP1250},
-	{PG_WIN1251, CP1251},
-	{PG_WIN1252, CP1252},
-	{PG_WIN1253, CP1253},
-	{PG_WIN1254, CP1254},
-	{PG_WIN1255, CP1255},
-	{PG_WIN1256, CP1256},
-	{PG_WIN1257, CP1257},
-	{PG_WIN1258, CP1258},
-	{PG_WIN866, CP866},
-	{PG_WIN874, CP874},
-	{PG_EUC_CN, EUC-CN},
-	{PG_EUC_JP, EUC-JP},
-	{PG_EUC_KR, EUC-KR},
-	{PG_EUC_TW, EUC-TW},
-	{PG_EUC_JIS_2004, EUC-JP}
-};
-#endif /* WIN32 */
-
 void
 SetDatabaseEncoding(int encoding)
 {
@@ -892,28 +852,132 @@ SetDatabaseEncoding(int encoding)
 	Assert(DatabaseEncoding-encoding == encoding);
 
 #ifdef ENABLE_NLS
-	pg_bind_textdomain_codeset(textdomain(NULL), encoding);
+	pg_init_gettext_codeset();
+	pg_register_textdomain(textdomain(NULL));
 #endif
 }
 
+static char **registered_textdomains = NULL;
+static const char *system_codeset = invalid;
+
 /*
- * On Windows, we need to explicitly bind gettext to the correct
- * encoding, because gettext() tends to get confused.
+ * Register a gettext textdomain with the backend. We will call
+ * bind_textdomain_codeset() for it to ensure that translated strings
+ * are returned in the right encoding.
  */
 void
-pg_bind_textdomain_codeset(const char *domainname, int encoding)
+pg_register_textdomain(const char *domainname)
 {
-#if defined(ENABLE_NLS)  defined(WIN32)
+#if defined(ENABLE_NLS)
 	int i;
+	MemoryContext old_cxt;
+
+	old_cxt = MemoryContextSwitchTo(TopMemoryContext);
+	if (registered_textdomains == NULL)
+	{
+		registered_textdomains = palloc(sizeof(char *) * 1);
+		registered_textdomains[0] = NULL;
+	}
 
-	for (i = 0; i  lengthof(codeset_map_array);

Re: [HACKERS] More message encoding woes

2009-04-01 Thread Alvaro Herrera

Tom Lane wrote:
 Alvaro Herrera alvhe...@commandprompt.com writes:

  One problem with this idea is that it may be hard to coerce gettext into
  putting a particular string at the top of the file :-(
 
 I doubt we can, which is why the documentation needs to tell translators
 about it.

I doubt that documenting the issue will be enough (in fact I'm pretty
sure it won't).  Maybe we can just supply the string translated in our
POT files, and add a comment that the translator is not supposed to
touch it.  This doesn't seem all that difficult -- I think it just
requires that we add a msgmerge step to make update-po that uses a
file on which the message has already been translated.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-04-01 Thread Hiroshi Inoue


Heikki Linnakangas wrote:

Tom Lane wrote:

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:

Tom Lane wrote:

Maybe use a special string Translate Me First that
doesn't actually need to be end-user-visible, just so no one sweats 
over

getting it right in context.


Yep, something like that. There seems to be a magic empty string 
translation at the beginning of every po file that returns the 
meta-information about the translation, like translation author and 
date. Assuming that works reliably, I'll use that.


At first that sounded like an ideal answer, but I can see a gotcha:
suppose the translation's author's name contains some characters that
don't convert to the database encoding.  I suppose that would result in
failure, when we'd prefer it not to.  A single-purpose string could be
documented as whatever you translate this to should be pure ASCII,
never mind if it's sensible.


I just tried that, and it seems that gettext() does transliteration, so 
any characters that have no counterpart in the database encoding will be 
replaced with something similar, or question marks.

 Assuming that's
universal across platforms, and I think it is, using the empty string 
should work.


It also means that you can use lc_messages='ja' with 
server_encoding='latin1', but it will be unreadable because all the 
non-ascii characters are replaced with question marks.


It doesn't occur in the current Windows environment. As for Windows
gnu gettext which we are using, we would see the original msgid when
iconv can't convert the msgstr to the target codeset.

set client_encoding to utf_8;
SET
show server_encoding;
 server_encoding
-
 LATIN1
(1 row)

show lc_messages;
lc_messages

 Japanese_Japan.932
 (1 row)

1;
ERROR:  syntax error at or near 1
LINE 1: 1;


OTOH when the sever encoding is utf8 then

set client_encoding to utf_8;
SET
show server_encoding;
 server_encoding
-
 UTF8
(1 row)

show lc_messages;
lc_messages

 Japanese_Japan.932
(1 row)

1;
ERROR:  1またはその近辺で構文エラー
LINE 1: 1;  ^

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-04-01 Thread Tom Lane

Hiroshi Inoue in...@tpf.co.jp writes:
 Heikki Linnakangas wrote:
 I just tried that, and it seems that gettext() does transliteration, so 
 any characters that have no counterpart in the database encoding will be 
 replaced with something similar, or question marks.

 It doesn't occur in the current Windows environment. As for Windows
 gnu gettext which we are using, we would see the original msgid when
 iconv can't convert the msgstr to the target codeset.

Well, if iconv has no conversion to the codeset at all then there is no
point in selecting that particular codeset setting anyway.  The question
was about whether we can distinguish no conversion available from
conversion available, but the test string has some unconvertible
characters.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-04-01 Thread Hiroshi Inoue


Tom Lane wrote:

Hiroshi Inoue in...@tpf.co.jp writes:

Heikki Linnakangas wrote:
I just tried that, and it seems that gettext() does transliteration, so 
any characters that have no counterpart in the database encoding will be 
replaced with something similar, or question marks.



It doesn't occur in the current Windows environment. As for Windows
gnu gettext which we are using, we would see the original msgid when
iconv can't convert the msgstr to the target codeset.


Well, if iconv has no conversion to the codeset at all then there is no
point in selecting that particular codeset setting anyway.  The question
was about whether we can distinguish no conversion available from
conversion available, but the test string has some unconvertible
characters.


What I meant is we would see no '?' when we use Windows gnu gettext.
Whether conversion available or not depends on individual msgids.
For example, when the Japanese msgstr corresponding to a msgid has
no characters other than ASCII accidentally, Windows gnu gettext will
use the msgstr not the original msgid.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-03-31 Thread Heikki Linnakangas


Heikki Linnakangas wrote:
One idea is to extract the encoding from LC_MESSAGES. Then call 
pg_get_encoding_from_locale() on that and check that it matches 
server_encoding. If it does, great, pass it to 
bind_textdomain_codeset(). If it doesn't, throw an error.


I tried to implement this but it gets complicated. First of all, we can 
only throw an error when lc_messages is set interactively. If it's set 
in postgresql.conf, it might be valid for some databases but not for 
others with different encoding. And that makes per-user lc_messages 
setting quite hard too.


Another complication is what to do if e.g. plpgsql or a 3rd party module 
have called pg_bindtextdomain, when lc_messages=C and we don't yet know 
the system name for the database encoding, and you later set 
lc_messages='fi_FI.iso8859-1', in a latin1 database. In order to 
retroactively set the codeset, we'd have to remember all the calls to 
pg_bindtextdomain. Not impossible, for sure, but more work.


I'm leaning towards the idea of trying out all the spellings of the 
database encoding we have in encoding_match_list. That gives the best 
user experience, as it just works, and it doesn't seem that complicated.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-03-31 Thread Tom Lane

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 I'm leaning towards the idea of trying out all the spellings of the 
 database encoding we have in encoding_match_list. That gives the best 
 user experience, as it just works, and it doesn't seem that complicated.

How were you going to check --- use that idea of translating a string
that's known to have a translation?  OK, but you'd better document
somewhere where translators will read it you must translate this string
first of all.  Maybe use a special string Translate Me First that
doesn't actually need to be end-user-visible, just so no one sweats over
getting it right in context.  (I can see syntax error being
problematic in some translations, since translators will know it is
always just a fragment of a larger message ...)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-03-31 Thread Heikki Linnakangas


Tom Lane wrote:

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
I'm leaning towards the idea of trying out all the spellings of the 
database encoding we have in encoding_match_list. That gives the best 
user experience, as it just works, and it doesn't seem that complicated.


How were you going to check --- use that idea of translating a string
that's known to have a translation?  OK, but you'd better document
somewhere where translators will read it you must translate this string
first of all.  Maybe use a special string Translate Me First that
doesn't actually need to be end-user-visible, just so no one sweats over
getting it right in context.


Yep, something like that. There seems to be a magic empty string 
translation at the beginning of every po file that returns the 
meta-information about the translation, like translation author and 
date. Assuming that works reliably, I'll use that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-03-31 Thread Tom Lane

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Tom Lane wrote:
 Maybe use a special string Translate Me First that
 doesn't actually need to be end-user-visible, just so no one sweats over
 getting it right in context.

 Yep, something like that. There seems to be a magic empty string 
 translation at the beginning of every po file that returns the 
 meta-information about the translation, like translation author and 
 date. Assuming that works reliably, I'll use that.

At first that sounded like an ideal answer, but I can see a gotcha:
suppose the translation's author's name contains some characters that
don't convert to the database encoding.  I suppose that would result in
failure, when we'd prefer it not to.  A single-purpose string could be
documented as whatever you translate this to should be pure ASCII,
never mind if it's sensible.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-03-31 Thread Alvaro Herrera

Tom Lane wrote:

 At first that sounded like an ideal answer, but I can see a gotcha:
 suppose the translation's author's name contains some characters that
 don't convert to the database encoding.  I suppose that would result in
 failure, when we'd prefer it not to.  A single-purpose string could be
 documented as whatever you translate this to should be pure ASCII,
 never mind if it's sensible.

One problem with this idea is that it may be hard to coerce gettext into
putting a particular string at the top of the file :-(

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-03-31 Thread Tom Lane

Alvaro Herrera alvhe...@commandprompt.com writes:
 Tom Lane wrote:
 At first that sounded like an ideal answer, but I can see a gotcha:
 suppose the translation's author's name contains some characters that
 don't convert to the database encoding.  I suppose that would result in
 failure, when we'd prefer it not to.  A single-purpose string could be
 documented as whatever you translate this to should be pure ASCII,
 never mind if it's sensible.

 One problem with this idea is that it may be hard to coerce gettext into
 putting a particular string at the top of the file :-(

I doubt we can, which is why the documentation needs to tell translators
about it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-03-31 Thread Peter Eisentraut

On Monday 30 March 2009 21:04:00 Tom Lane wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
  Tom Lane wrote:
  Could we get away with just unconditionally calling
  bind_textdomain_codeset with *our* canonical spelling of the encoding
  name?  If it works, great, and if it doesn't, you get English.
 
  Yeah, that's better than nothing.

 A quick look at the output of iconv --list on Fedora 10 and OSX 10.5.6
 says that it would not work quite well enough.  The encoding names are
 similar but not identical --- in particular I notice a lot of
 discrepancies about dash versus underscore vs no separator at all.

I seem to recall that the encoding names are normalized by the C library 
somewhere, but I can't find the documentation now.  It might be worth trying 
anyway -- the above might not in fact be a problem.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-03-31 Thread Peter Eisentraut

On Monday 30 March 2009 20:06:48 Heikki Linnakangas wrote:
 Tom Lane wrote:
  Where does it get the default codeset from?  Maybe we could constrain
  that to match the database encoding, the way we do for LC_COLLATE/CTYPE?

 LC_CTYPE. In 8.3 and up where we constrain that to match the database
 encoding, we only have a problem with the C locale.

Why don't we apply the same restriction to the C locale then?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-03-31 Thread Tom Lane

Peter Eisentraut pete...@gmx.net writes:
 On Monday 30 March 2009 20:06:48 Heikki Linnakangas wrote:
 LC_CTYPE. In 8.3 and up where we constrain that to match the database
 encoding, we only have a problem with the C locale.

 Why don't we apply the same restriction to the C locale then?

(1) what would you constrain it to?

(2) historically we've allowed C locale to be used with any encoding,
and there are a *lot* of users depending on that (particularly in the
Far East, I gather).

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] More message encoding woes

2009-03-30 Thread Heikki Linnakangas


latin1db=# SELECT version();
  version 


---
 PostgreSQL 8.3.7 on i686-pc-linux-gnu, compiled by GCC gcc (Debian 
4.3.3-5) 4.3.3

(1 row)

latin1db=# SELECT name, setting FROM pg_settings where name like 'lc%' 
OR name like '%encoding';

  name   | setting
-+-
 client_encoding | utf8
 lc_collate  | C
 lc_ctype| C
 lc_messages | es_ES
 lc_monetary | C
 lc_numeric  | C
 lc_time | C
 server_encoding | LATIN1
(8 rows)

latin1db=# SELECT * FROM foo;
ERROR:  no existe la relaciÃ³n Â«fooÂ»

The accented characters are garbled. When I try the same with a database 
that's in UTF8 in the same cluster, it works:


utf8db=# SELECT name, setting FROM pg_settings where name like 'lc%' OR 
name like '%encoding';

  name   | setting
-+-
 client_encoding | UTF8
 lc_collate  | C
 lc_ctype| C
 lc_messages | es_ES
 lc_monetary | C
 lc_numeric  | C
 lc_time | C
 server_encoding | UTF8
(8 rows)

utf8db=# SELECT * FROM foo;
ERROR:  no existe la relación «foo»

What is happening is that gettext() returns the message in the encoding 
determined by LC_CTYPE, while we expect it to return it in the database 
encoding. Starting with PG 8.3 we enforce that the encoding specified in 
LC_CTYPE matches the database encoding, but not for the C locale.


In CVS HEAD, we call bind_textdomain_codeset() in SetDatabaseEncoding() 
which fixes that, but we only do it on Windows. In earlier versions we 
called it on all platforms, but only for UTF-8. It seems that we should 
call bind_textdomain_codeset on all platforms and all encodings. 
However, there seems to be a reason why we only do it for Windows on CVS 
HEAD: we need a mapping from our encoding ID to the OS codeset name, and 
the OS codeset names vary.


How can we make this more robust?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-03-30 Thread Tom Lane

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 In CVS HEAD, we call bind_textdomain_codeset() in SetDatabaseEncoding() 
 which fixes that, but we only do it on Windows. In earlier versions we 
 called it on all platforms, but only for UTF-8. It seems that we should 
 call bind_textdomain_codeset on all platforms and all encodings. 

Yes, this problem has been recognized for some time.

 However, there seems to be a reason why we only do it for Windows on CVS 
 HEAD: we need a mapping from our encoding ID to the OS codeset name, and 
 the OS codeset names vary.

 How can we make this more robust?

One possibility is to assume that the output of nl_langinfo(CODESET)
will be recognized by bind_textdomain_codeset().  Whether that actually
works can only be determined by experiment.

Another idea is to try the values listed in our encoding_match_list[]
until bind_textdomain_codeset succeeds.  The problem here is that the
GNU documentation is *exceedingly* vague about whether
bind_textdomain_codeset behaves sanely (ie throws a recognizable error)
when given a bad encoding name.  (I guess we could look at the source
code.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-03-30 Thread Heikki Linnakangas


Tom Lane wrote:

Another idea is to try the values listed in our encoding_match_list[]
until bind_textdomain_codeset succeeds.  The problem here is that the
GNU documentation is *exceedingly* vague about whether
bind_textdomain_codeset behaves sanely (ie throws a recognizable error)
when given a bad encoding name.  (I guess we could look at the source
code.)


Unfortunately it doesn't give any error. The value passed to it is just
stored, and isn't used until gettext(). Quick testing shows that if you
give an invalid encoding name, gettext will simply refrain from
translating anything and revert to English.

We could exploit that to determine if the codeset name we gave
bind_textdomain_codeset was valid: pick a string that is known to be
translated in all translations, like syntax error, and see if
gettext(syntax error) returns the original string. Something along the
lines of:

const char *teststring = syntax error;
encoding_match *m = encoding_match_list;
while(m-system_enc_name)
{
  if (m-pg_enc_code != GetDatabaseEncoding())
continue;
  bind_textdomain_codeset(postgres);
  if (gettext(teststring) != teststring)
break; /* found! */
}


This feels rather hacky, but if we only do that with the combination of 
LC_CTYPE=C and LC_MESSAGES=other than C that we have a problem with, I 
think it would be ok. The current behavior is highly unlikely to give 
correct results, so I don't think we can do much worse than that.


Another possibility is to just refrain from translating anything if 
LC_CTYPE=C. If the above loop fails to find anything that works, that's 
what we should fall back to IMHO.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-03-30 Thread Tom Lane

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Tom Lane wrote:
 Another idea is to try the values listed in our encoding_match_list[]
 until bind_textdomain_codeset succeeds.  The problem here is that the
 GNU documentation is *exceedingly* vague about whether
 bind_textdomain_codeset behaves sanely (ie throws a recognizable error)
 when given a bad encoding name.  (I guess we could look at the source
 code.)

 Unfortunately it doesn't give any error.

(Man, why are the APIs in this problem space so universally awful?)

Where does it get the default codeset from?  Maybe we could constrain
that to match the database encoding, the way we do for LC_COLLATE/CTYPE?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-03-30 Thread Heikki Linnakangas


Tom Lane wrote:

Where does it get the default codeset from?  Maybe we could constrain
that to match the database encoding, the way we do for LC_COLLATE/CTYPE?


LC_CTYPE. In 8.3 and up where we constrain that to match the database 
encoding, we only have a problem with the C locale.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-03-30 Thread Tom Lane

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Tom Lane wrote:
 Where does it get the default codeset from?  Maybe we could constrain
 that to match the database encoding, the way we do for LC_COLLATE/CTYPE?

 LC_CTYPE. In 8.3 and up where we constrain that to match the database 
 encoding, we only have a problem with the C locale.

... and even if we wanted to fiddle with it, that just moves the problem
over to finding an LC_CTYPE value that matches the database encoding
:-(.

Yup, it's a mess.  We'd have done this long ago if it were easy.

Could we get away with just unconditionally calling
bind_textdomain_codeset with *our* canonical spelling of the encoding
name?  If it works, great, and if it doesn't, you get English.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-03-30 Thread Heikki Linnakangas


Tom Lane wrote:

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:

Tom Lane wrote:

Where does it get the default codeset from?  Maybe we could constrain
that to match the database encoding, the way we do for LC_COLLATE/CTYPE?


LC_CTYPE. In 8.3 and up where we constrain that to match the database 
encoding, we only have a problem with the C locale.


... and even if we wanted to fiddle with it, that just moves the problem
over to finding an LC_CTYPE value that matches the database encoding
:-(.

Yup, it's a mess.  We'd have done this long ago if it were easy.

Could we get away with just unconditionally calling
bind_textdomain_codeset with *our* canonical spelling of the encoding
name?  If it works, great, and if it doesn't, you get English.


Yeah, that's better than nothing.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-03-30 Thread Tom Lane

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Tom Lane wrote:
 Could we get away with just unconditionally calling
 bind_textdomain_codeset with *our* canonical spelling of the encoding
 name?  If it works, great, and if it doesn't, you get English.

 Yeah, that's better than nothing.

A quick look at the output of iconv --list on Fedora 10 and OSX 10.5.6
says that it would not work quite well enough.  The encoding names are
similar but not identical --- in particular I notice a lot of
discrepancies about dash versus underscore vs no separator at all.

What we need is an API equivalent to iconv --list, but I'm not seeing
one :-(.  Do we need to go so far as to try to run that program?
Its output format is poorly standardized, among other problems ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-03-30 Thread Heikki Linnakangas


Tom Lane wrote:

What we need is an API equivalent to iconv --list, but I'm not seeing
one :-(.


There's also locale -m. Looking at the implementation of that, it just 
lists what's in /usr/share/i18n/charmaps. Not too portable either..



 Do we need to go so far as to try to run that program?
Its output format is poorly standardized, among other problems ...


And doing that at every backend startup is too slow.

I would be happy to just revert to English if the OS doesn't recognize 
the name we use for the encoding. What sucks about that most is that the 
user has no way to specify the right encoding name even if he knows it. 
I don't think we want to introduce a new GUC for that.


One idea is to extract the encoding from LC_MESSAGES. Then call 
pg_get_encoding_from_locale() on that and check that it matches 
server_encoding. If it does, great, pass it to 
bind_textdomain_codeset(). If it doesn't, throw an error.


It stretches the conventional meaning LC_MESSAGES/LC_CTYPE a bit, since 
LC_CTYPE usually specifies the codeset to use, but I think it's quite 
intuitive.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More message encoding woes

2009-03-30 Thread Zdenek Kotala


Tom Lane píše v po 30. 03. 2009 v 14:04 -0400:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
  Tom Lane wrote:
  Could we get away with just unconditionally calling
  bind_textdomain_codeset with *our* canonical spelling of the encoding
  name?  If it works, great, and if it doesn't, you get English.
 
  Yeah, that's better than nothing.
 
 A quick look at the output of iconv --list on Fedora 10 and OSX 10.5.6
 says that it would not work quite well enough.  The encoding names are
 similar but not identical --- in particular I notice a lot of
 discrepancies about dash versus underscore vs no separator at all.

The same problem is with collation when you try restore database on
different OS. :(

Zdenek 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

39 matches

Mail list logo