Re: [HACKERS] Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII

2013-09-10 Thread Peter Eisentraut
On 9/9/13 9:54 PM, Noah Misch wrote:
 On Mon, Sep 09, 2013 at 05:49:38PM -0400, Peter Eisentraut wrote:
  On 9/9/13 2:57 PM, Noah Misch wrote:
   Actually, GNU libiconv's iconv() decides that //translit is 
   unimplementable
   for some of the characters in that file, and it fails the conversion.  
   GNU
   libc's iconv(), on the other hand, emits the question marks.
  
  That can't be right, because the examples I produced earlier (which
  produced question marks) were produced on OS X with GNU libiconv.
 Hmm.  I get the good behavior (decline to transliterate Japanese) with these
 iconv --version strings:

I might have messed up my testing.  You are probably right.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII

2013-09-09 Thread Noah Misch
On Mon, Sep 09, 2013 at 08:29:58AM -0400, Peter Eisentraut wrote:
 On 9/6/13 10:37 AM, Tom Lane wrote:
  BTW: personally, I would say that what you're looking at is a glibc bug.
  I always thought the contract of gettext was to return the ASCII version
  if it fails to produce a translated version.  That might not be what the
  end user really wants to see, but surely returning something like ???
  is completely useless to anybody.
 
 The question marks come from iconv.  Take a look at what this prints:
 
 iconv po/ja.po -f utf-8 -t us-ascii//translit
 
 If you use GNU libiconv, this will print a bunch of question marks.

Actually, GNU libiconv's iconv() decides that //translit is unimplementable
for some of the characters in that file, and it fails the conversion.  GNU
libc's iconv(), on the other hand, emits the question marks.

 I think the use of //translit by gettext is poor judgement, because my
 experiments show that the quality of the results is poor and not useful
 for a user interface.

It depends on the quality of the //translit implementation.  GNU libiconv's
seems pretty good.  It gives up for Japanese or Russian characters, so you get
the English messages.  For Polish, GNU libiconv transliterates like this:

msgstr nie można usunąć pliku lub katalogu \%s\: %s\n
msgstr nie mozna usuna'c pliku lub katalogu \%s\: %s\n

That's fair, considering what it has to work with.  Ideally, (a) GNU libc
should import the smarter transliteration code from GNU libiconv, and (b) GNU
gettext should check for weak //translit implementations and not use
//translit under such circumstances.

 My suggestion in this matter is to disable gettext processing when
 LC_CTYPE is set to C.  We could log a warning when LC_MESSAGES is set to
 something and LC_CTYPE is set to C.  Or just do the warning and keep
 logging.  Something like that.

In an ENCODING=UTF8, LC_CTYPE=C database, no transliteration should need to
happen, and no transliteration does happen for the PG messages.  I think
MauMau's original bind_textdomain_codeset() proposal was on the right track.
We would need to do that for every relevant 3rd-party message domain, though.
Ick.  This suggests to me that gettext really needs an API for overriding the
default codeset pertaining to message domains not subjected to
bind_textdomain_codeset().  In the meantime, adding bind_textdomain_codeset()
calls for known localized dependencies seems like a fine coping mechanism.

If we can reasonably detect when gettext is supplying useless ? messages,
that's good, too.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII

2013-09-09 Thread Peter Eisentraut
On 9/9/13 2:57 PM, Noah Misch wrote:
 Actually, GNU libiconv's iconv() decides that //translit is unimplementable
 for some of the characters in that file, and it fails the conversion.  GNU
 libc's iconv(), on the other hand, emits the question marks.

That can't be right, because the examples I produced earlier (which
produced question marks) were produced on OS X with GNU libiconv.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII

2013-09-09 Thread Noah Misch
On Mon, Sep 09, 2013 at 05:49:38PM -0400, Peter Eisentraut wrote:
 On 9/9/13 2:57 PM, Noah Misch wrote:
  Actually, GNU libiconv's iconv() decides that //translit is unimplementable
  for some of the characters in that file, and it fails the conversion.  GNU
  libc's iconv(), on the other hand, emits the question marks.
 
 That can't be right, because the examples I produced earlier (which
 produced question marks) were produced on OS X with GNU libiconv.

Hmm.  I get the good behavior (decline to transliterate Japanese) with these
iconv --version strings:

iconv (GNU libiconv 1.11) [/usr/bin/iconv on Mac OS X 10.7]
iconv (GNU libiconv 1.14) [recently-updated fink]
iconv (GNU libiconv 1.14) [recently-updated Cygwin]

I also saw that on OpenBSD and NetBSD, though I'm not in an immediate position
to check the libiconv versions there.  I get the bad behavior (question
marks) on these:

iconv (GNU libc) 2.12 [Centos 6.4]
iconv (GNU libc) 2.3.4 [CentOS 4.4]
iconv (Ubuntu EGLIBC 2.15-0ubuntu10.4) 2.15 [Ubuntu 12.04]
iconv (GNU libc) 2.5 [Ubuntu 7.04]

That sure looked like GNU libc vs. GNU libiconv, but I guess I'm missing some
other factor.  What is your GNU libiconv version that emits question marks?

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII

2013-09-09 Thread Noah Misch
On Tue, Sep 10, 2013 at 05:42:06AM +0900, MauMau wrote:
 From: Tom Lane t...@sss.pgh.pa.us
 Noah Misch n...@leadboat.com writes:
 ... I think
 MauMau's original bind_textdomain_codeset() proposal was on the right 
 track.

 It might well be.  My objection was to the proposal for back-patching it
 when we have little idea of the possible side-effects.

Agreed.

 We are using 9.1/9.2 and 9.2 is probably dominant, so I would be relieved 
 with either of the following choices:

 1. Take the approach that doesn't use bind_textdomain_codeset(libc) 
 (i.e. the second version of errno_str.patch) for 9.4 and older releases.

 2. Use bind_textdomain_codeset(libc) (i.e. take strerror_codeset.patch) 
 for 9.4, and take the non-bind_textdomain_codeset approach for older  
 releases.

I like (2), at least at a high level.  The concept of errno_str.patch is safe
enough to back-patch.  One can verify that it only changes behavior when
strerror() returns NULL, an empty string, or something that begins with '?'.
I can't see resenting the change when that has happened.

Note that you can work around the problem today by linking PostgreSQL with a
better iconv() implementation.

Question-mark-damaged messages are not limited to strerror().  A combination
like lc_messages=ja_JP, encoding=LATIN1, lc_ctype=en_US will produce question
marks for PG and libc messages even with the bind_textdomain_codeset(libc)
change.  Is it worth doing anything about that?  That one looks self-inflicted
in comparison to the lc_messages=ja_JP, encoding=UTF8, lc_ctype=C case.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers