Re: [HACKERS] Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
On 9/9/13 9:54 PM, Noah Misch wrote: On Mon, Sep 09, 2013 at 05:49:38PM -0400, Peter Eisentraut wrote: On 9/9/13 2:57 PM, Noah Misch wrote: Actually, GNU libiconv's iconv() decides that //translit is unimplementable for some of the characters in that file, and it fails the conversion. GNU libc's iconv(), on the other hand, emits the question marks. That can't be right, because the examples I produced earlier (which produced question marks) were produced on OS X with GNU libiconv. Hmm. I get the good behavior (decline to transliterate Japanese) with these iconv --version strings: I might have messed up my testing. You are probably right. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
On Mon, Sep 09, 2013 at 08:29:58AM -0400, Peter Eisentraut wrote: On 9/6/13 10:37 AM, Tom Lane wrote: BTW: personally, I would say that what you're looking at is a glibc bug. I always thought the contract of gettext was to return the ASCII version if it fails to produce a translated version. That might not be what the end user really wants to see, but surely returning something like ??? is completely useless to anybody. The question marks come from iconv. Take a look at what this prints: iconv po/ja.po -f utf-8 -t us-ascii//translit If you use GNU libiconv, this will print a bunch of question marks. Actually, GNU libiconv's iconv() decides that //translit is unimplementable for some of the characters in that file, and it fails the conversion. GNU libc's iconv(), on the other hand, emits the question marks. I think the use of //translit by gettext is poor judgement, because my experiments show that the quality of the results is poor and not useful for a user interface. It depends on the quality of the //translit implementation. GNU libiconv's seems pretty good. It gives up for Japanese or Russian characters, so you get the English messages. For Polish, GNU libiconv transliterates like this: msgstr nie można usunąć pliku lub katalogu \%s\: %s\n msgstr nie mozna usuna'c pliku lub katalogu \%s\: %s\n That's fair, considering what it has to work with. Ideally, (a) GNU libc should import the smarter transliteration code from GNU libiconv, and (b) GNU gettext should check for weak //translit implementations and not use //translit under such circumstances. My suggestion in this matter is to disable gettext processing when LC_CTYPE is set to C. We could log a warning when LC_MESSAGES is set to something and LC_CTYPE is set to C. Or just do the warning and keep logging. Something like that. In an ENCODING=UTF8, LC_CTYPE=C database, no transliteration should need to happen, and no transliteration does happen for the PG messages. I think MauMau's original bind_textdomain_codeset() proposal was on the right track. We would need to do that for every relevant 3rd-party message domain, though. Ick. This suggests to me that gettext really needs an API for overriding the default codeset pertaining to message domains not subjected to bind_textdomain_codeset(). In the meantime, adding bind_textdomain_codeset() calls for known localized dependencies seems like a fine coping mechanism. If we can reasonably detect when gettext is supplying useless ? messages, that's good, too. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
On 9/9/13 2:57 PM, Noah Misch wrote: Actually, GNU libiconv's iconv() decides that //translit is unimplementable for some of the characters in that file, and it fails the conversion. GNU libc's iconv(), on the other hand, emits the question marks. That can't be right, because the examples I produced earlier (which produced question marks) were produced on OS X with GNU libiconv. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
On Mon, Sep 09, 2013 at 05:49:38PM -0400, Peter Eisentraut wrote: On 9/9/13 2:57 PM, Noah Misch wrote: Actually, GNU libiconv's iconv() decides that //translit is unimplementable for some of the characters in that file, and it fails the conversion. GNU libc's iconv(), on the other hand, emits the question marks. That can't be right, because the examples I produced earlier (which produced question marks) were produced on OS X with GNU libiconv. Hmm. I get the good behavior (decline to transliterate Japanese) with these iconv --version strings: iconv (GNU libiconv 1.11) [/usr/bin/iconv on Mac OS X 10.7] iconv (GNU libiconv 1.14) [recently-updated fink] iconv (GNU libiconv 1.14) [recently-updated Cygwin] I also saw that on OpenBSD and NetBSD, though I'm not in an immediate position to check the libiconv versions there. I get the bad behavior (question marks) on these: iconv (GNU libc) 2.12 [Centos 6.4] iconv (GNU libc) 2.3.4 [CentOS 4.4] iconv (Ubuntu EGLIBC 2.15-0ubuntu10.4) 2.15 [Ubuntu 12.04] iconv (GNU libc) 2.5 [Ubuntu 7.04] That sure looked like GNU libc vs. GNU libiconv, but I guess I'm missing some other factor. What is your GNU libiconv version that emits question marks? Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
On Tue, Sep 10, 2013 at 05:42:06AM +0900, MauMau wrote: From: Tom Lane t...@sss.pgh.pa.us Noah Misch n...@leadboat.com writes: ... I think MauMau's original bind_textdomain_codeset() proposal was on the right track. It might well be. My objection was to the proposal for back-patching it when we have little idea of the possible side-effects. Agreed. We are using 9.1/9.2 and 9.2 is probably dominant, so I would be relieved with either of the following choices: 1. Take the approach that doesn't use bind_textdomain_codeset(libc) (i.e. the second version of errno_str.patch) for 9.4 and older releases. 2. Use bind_textdomain_codeset(libc) (i.e. take strerror_codeset.patch) for 9.4, and take the non-bind_textdomain_codeset approach for older releases. I like (2), at least at a high level. The concept of errno_str.patch is safe enough to back-patch. One can verify that it only changes behavior when strerror() returns NULL, an empty string, or something that begins with '?'. I can't see resenting the change when that has happened. Note that you can work around the problem today by linking PostgreSQL with a better iconv() implementation. Question-mark-damaged messages are not limited to strerror(). A combination like lc_messages=ja_JP, encoding=LATIN1, lc_ctype=en_US will produce question marks for PG and libc messages even with the bind_textdomain_codeset(libc) change. Is it worth doing anything about that? That one looks self-inflicted in comparison to the lc_messages=ja_JP, encoding=UTF8, lc_ctype=C case. -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers