[HACKERS] Re: [bug fix] multibyte messages are displayed incorrectly on the client

2014-06-23 Thread Heikki Linnakangas

On 04/05/2014 07:56 AM, Tom Lane wrote:

MauMau maumau...@gmail.com writes:

Then, as a happy medium, how about disabling message localization only if
the client encoding differs from the server one?  That is, compare the
client_encoding value in the startup packet with the result of
GetPlatformEncoding().  If they don't match, call
disable_message_localization().


AFAICT this is not what was agreed to in this thread.  It puts far too
much credence in the server-side default for client_encoding, which up to
now has never been thought to be very interesting; indeed I doubt most
people bother to set it at all.  The reason that this issue is even on
the table is that that default is too likely to be wrong, no?

Also, whatever possessed you to use pg_get_encoding_from_locale to
identify the server's encoding?  That's expensive and seems fairly
unlikely to yield the right answer.  I don't remember offhand where we
keep the postmaster's idea of what encoding messages should be in, but I'm
fairly sure it's stored explicitly somewhere.  Or if it isn't, we can for
sure do better than recalculating it during every connection attempt.

Having said all that, though, I'm unconvinced that this cure isn't worse
than the disease.  Somebody claimed upthread that no very interesting
messages would be delocalized by a change like this, but that's complete
nonsense: in particular, *every* message associated with client
authentication will be sent in English if we go down this path.  Given
the nearly complete lack of complaints in the many years that this code
has worked like this, I'm betting that most people will find a change
like this to be a net reduction in friendliness.

Given the changes here to extract client_encoding from the startup packet
ASAP, I wonder whether the right thing isn't just to set the client
encoding immediately when we do that.  Most application libraries pass
client encoding in the startup packet anyway (libpq certainly does).


Based on Tom's comments above, I'm marking this as returned with 
feedback in the commitfest. I agree that setting client_encoding as 
early as possible seems like the right thing to do.


Earlier in this thread, MauMau pointed out that we can't do encoding 
conversions until we have connected to the database because you need to 
read pg_conversion for that. That's because we support creating custom 
conversions with CREATE CONVERSION. Frankly, I don't think anyone cares 
about that feature. If we just dropped the CREATE/DROP CONVERSION 
feature altogether and hard-coded the conversions we have, there would 
be close to zero complaints. Even if you want to extend something around 
encodings and conversions, the CREATE CONVERSION interface is clunky. 
Firstly, conversions are per-database, and even schema-qualified, which 
just seems like an extra complication. You'll most likely want to modify 
the conversion across the whole system. Secondly, rather than define a 
new conversion between encodings, you'll likely want to define a whole 
new encoding with conversions to/from existing encodings, but you can't 
do that anyway without hacking the source code.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [bug fix] multibyte messages are displayed incorrectly on the client

2014-06-23 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 Earlier in this thread, MauMau pointed out that we can't do encoding 
 conversions until we have connected to the database because you need to 
 read pg_conversion for that. That's because we support creating custom 
 conversions with CREATE CONVERSION. Frankly, I don't think anyone cares 
 about that feature. If we just dropped the CREATE/DROP CONVERSION 
 feature altogether and hard-coded the conversions we have, there would 
 be close to zero complaints. Even if you want to extend something around 
 encodings and conversions, the CREATE CONVERSION interface is clunky. 
 Firstly, conversions are per-database, and even schema-qualified, which 
 just seems like an extra complication. You'll most likely want to modify 
 the conversion across the whole system. Secondly, rather than define a 
 new conversion between encodings, you'll likely want to define a whole 
 new encoding with conversions to/from existing encodings, but you can't 
 do that anyway without hacking the source code.

There's certainly something to be said for that position.  If there were
any prospect of extensions defining new encodings someday, I'd argue for
keeping CREATE CONVERSION.  But the performance headaches would be
substantial, and there aren't new encodings coming down the pike often
enough to justify the work involved, so I don't see us ever doing CREATE
ENCODING; and that means that CREATE CONVERSION is of little value.

I'd kind of like to see this go just because having catalog accesses
involved in encoding conversion setup is messy and fragile.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [bug fix] multibyte messages are displayed incorrectly on the client

2014-01-10 Thread Noah Misch
On Tue, Jan 07, 2014 at 10:56:28PM +0900, MauMau wrote:
 From: Bruce Momjian br...@momjian.us
 On Sun, Jan  5, 2014 at 04:40:17PM +0900, MauMau wrote:
 Then, as a happy medium, how about disabling message localization
 only if the client encoding differs from the server one?  That is,
 compare the client_encoding value in the startup packet with the
 result of GetPlatformEncoding().  If they don't match, call
 disable_message_localization().

I like this proposal.  Thanks.

 I think the problem is we don't know the client and server encodings
 at that time.
 
 I suppose we know (or at least believe) those encodings during
 backend startup:
 
 * client encoding - the client_encoding parameter passed in the
 startup packet, or if that's not present, client_encoding GUC value.
 
 * server encoding - the encoding of strings gettext() returns.  That
 is what GetPlatformEncoding() returns.

Agreed.  You would need to poke into the relevant part of the startup packet
much earlier than we do today, but that's tractable.  Note that
GetPlatformEncoding() is gone; use GetMessageEncoding().

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [bug fix] multibyte messages are displayed incorrectly on the client

2014-01-10 Thread Tom Lane
Noah Misch n...@leadboat.com writes:
 On Sun, Jan  5, 2014 at 04:40:17PM +0900, MauMau wrote:
 Then, as a happy medium, how about disabling message localization
 only if the client encoding differs from the server one?  That is,
 compare the client_encoding value in the startup packet with the
 result of GetPlatformEncoding().  If they don't match, call
 disable_message_localization().

 I like this proposal.  Thanks.
 ...
 Agreed.  You would need to poke into the relevant part of the startup packet
 much earlier than we do today, but that's tractable.

There's still the problem of what to do before we have a complete startup
packet, or if the packet is defective enough to not contain a recognizable
client encoding.

Perhaps more to the point, what it sounds like this is doing is creating
a third behavioral state, in between what prevails when we're first
reading the packet and what prevails after we've finally adopted the
requested client encoding.  I'm less than convinced that's a good thing.

I'm also rather unexcited by the idea of introducing redundant and/or
ad-hoc code to parse the startup packet.  That sounds like a recipe for
bugs, some of which might even rise to security issues, considering it
would happen before client authentication.

I think if we're going to do anything like this at all, it'd be best
just to disable localization from postmaster fork up till we've gotten
a client encoding out of the packet in the normal course of events.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [bug fix] multibyte messages are displayed incorrectly on the client

2014-01-10 Thread Noah Misch
On Fri, Jan 10, 2014 at 08:03:00PM -0500, Tom Lane wrote:
 Noah Misch n...@leadboat.com writes:
  On Sun, Jan  5, 2014 at 04:40:17PM +0900, MauMau wrote:
  Then, as a happy medium, how about disabling message localization
  only if the client encoding differs from the server one?  That is,
  compare the client_encoding value in the startup packet with the
  result of GetPlatformEncoding().  If they don't match, call
  disable_message_localization().
 
  I like this proposal.  Thanks.
  ...
  Agreed.  You would need to poke into the relevant part of the startup packet
  much earlier than we do today, but that's tractable.
 
 There's still the problem of what to do before we have a complete startup
 packet, or if the packet is defective enough to not contain a recognizable
 client encoding.

MauMau proposed using untranslated messages until we're past that point.  I
like that answer fine, because routine mistakes from ordinary users will not
elicit the errors in question.  The most interesting message in that group
might be 'invalid value for parameter client_encoding', and I think the
presence of the term client_encoding will be a sufficient clue regardless of
how we translate and encode the surrounding words.

 Perhaps more to the point, what it sounds like this is doing is creating
 a third behavioral state, in between what prevails when we're first
 reading the packet and what prevails after we've finally adopted the
 requested client encoding.  I'm less than convinced that's a good thing.
 
 I'm also rather unexcited by the idea of introducing redundant and/or
 ad-hoc code to parse the startup packet.  That sounds like a recipe for
 bugs, some of which might even rise to security issues, considering it
 would happen before client authentication.

Valid worries.

 I think if we're going to do anything like this at all, it'd be best
 just to disable localization from postmaster fork up till we've gotten
 a client encoding out of the packet in the normal course of events.

That was MauMau's original proposal.  I opined upthread that it would be
better to change nothing than to do that.

nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [bug fix] multibyte messages are displayed incorrectly on the client

2014-01-06 Thread Bruce Momjian
On Sun, Jan  5, 2014 at 04:40:17PM +0900, MauMau wrote:
 From: Noah Misch n...@leadboat.com
 I agree that English consistently beats mojibake.  I question whether that
 makes up for the loss of translation when encodings do happen to match,
 particularly for non-technical errors like a mistyped password.  The
 everything-UTF8 scenario appears often, perhaps explaining infrequent
 complaints about the status quo.  If 90% of translated message users have
 client_encoding != server_encoding, then +1 for your patch's
 strategy.  If the
 figure is only 60%, I'd vote for holding out for a more-extensive fix that
 allows us to encoding-convert localized authentication failure messages.
 
 I agree with you.  It would be more friendly to users if more
 messages are localized.
 
 Then, as a happy medium, how about disabling message localization
 only if the client encoding differs from the server one?  That is,
 compare the client_encoding value in the startup packet with the
 result of GetPlatformEncoding().  If they don't match, call
 disable_message_localization().

I think the problem is we don't know the client and server encodings
at that time.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [bug fix] multibyte messages are displayed incorrectly on the client

2013-12-29 Thread Noah Misch
On Sun, Dec 22, 2013 at 07:51:55PM +0900, MauMau wrote:
 From: Noah Misch n...@leadboat.com
 Better to attack that directly.  Arrange to apply any
 client_encoding named in
 the startup packet earlier, before authentication.  This relates
 to the TODO
 item Let the client indicate character encoding of database names, user
 names, and passwords.  (I expect such an endeavor to be tricky.)
 
 Unfortunately, character set conversion is not possible until the
 database session is established, since it requires system catalog
 access.  Please the comment in src/backend/utils/mb/mbutils.c:
 
 * During backend startup we can't set client encoding because we (a)
 * can't look up the conversion functions, and (b) may not know the database
 * encoding yet either.  So SetClientEncoding() just accepts anything and
 * remembers it for InitializeClientEncoding() to apply later.

Yes, changing that is the tricky part.

 I guess that's why Tom-san suggested the same solution as my patch
 (as a compromise) in the below thread, which is also a TODO item:
 
 Re: encoding of PostgreSQL messages
 http://www.postgresql.org/message-id/19896.1234107...@sss.pgh.pa.us

That's fair for the necessarily-earliest messages, like 'invalid value for
parameter client_encoding' and messages pertaining to the physical structure
of the startup packet.  The client's encoding expectation is unknowable.  An
error that mentions client_encoding will hopefully put users on the right
track regardless of how we translate and encode the surrounding words.  The
other affected messages are quite technical, making a casual user unlikely to
fix or even see them.  Not so for authentication messages, so I'm wary of
forcing use of ASCII that late in the handshake.

Note that choosing to use ASCII need not imply wholly declining to translate.
If the build uses GNU libiconv, gettext can emit ASCII approximations for
translations that conform to a Latin-derived alphabet, falling back to no
translation where the alphabet differs too much.  pg_perm_setlocale(LC_CTYPE,
C) requests such behavior.  (The inferior iconv //TRANSLIT implementation of
GNU libc will convert non-ASCII characters to question marks, though.)

 From: Alvaro Herrera alvhe...@2ndquadrant.com
 The problem is that if there's an encoding mismatch, the message might
 be impossible to figure out.  If the message is in english, at least it
 can be searched for in the web, or something -- the user might even find
 a page in which the english error string appears, with a native language
 explanation.
 
 I feel like this, too.  Being readable in English is better than
 being unrecognizable.

I agree that English consistently beats mojibake.  I question whether that
makes up for the loss of translation when encodings do happen to match,
particularly for non-technical errors like a mistyped password.  The
everything-UTF8 scenario appears often, perhaps explaining infrequent
complaints about the status quo.  If 90% of translated message users have
client_encoding != server_encoding, then +1 for your patch's strategy.  If the
figure is only 60%, I'd vote for holding out for a more-extensive fix that
allows us to encoding-convert localized authentication failure messages.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [bug fix] multibyte messages are displayed incorrectly on the client

2013-12-20 Thread Alvaro Herrera
Noah Misch escribió:
 On Tue, Dec 17, 2013 at 01:42:08PM -0500, Bruce Momjian wrote:
  On Fri, Dec 13, 2013 at 10:41:17PM +0900, MauMau wrote:

   [Fix]
   Disable message localization during session startup.  In other
   words, messages are output in English until the database session is
   established.
  
  I think the question is whether the server encoding or English are
  likely to be better for the average client.  My bet is that the server
  encoding is more likely correct.
  
  However, you are right that English/ASCII at least will always be
  viewable, while there are many server/client combinations that will
  produce unreadable characters.
  
  I would be interested to hear other people's experience with this.
 
 I don't have a sufficient sense of multilingualism among our users to know
 whether English/ASCII messages would be more useful, on average, than
 localized messages in the server encoding.  Forcing English/ASCII does worsen
 behavior in the frequent situation where client encoding will match server
 encoding.  I lean toward retaining the status quo of delivering localized
 messages in the server encoding.

The problem is that if there's an encoding mismatch, the message might
be impossible to figure out.  If the message is in english, at least it
can be searched for in the web, or something -- the user might even find
a page in which the english error string appears, with a native language
explanation.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [bug fix] multibyte messages are displayed incorrectly on the client

2013-12-19 Thread Noah Misch
On Tue, Dec 17, 2013 at 01:42:08PM -0500, Bruce Momjian wrote:
 On Fri, Dec 13, 2013 at 10:41:17PM +0900, MauMau wrote:
  [Cause]
  While the session is being established, the server cannot use the
  client encoding for message conversion yet, because it cannot access
  system catalogs to retrieve conversion functions.  So, the server
  sends messages to the client without conversion.  In the above
  example, the server sends Japanese UTF-8 messages to psql, which
  expects those messages in SJIS.

Better to attack that directly.  Arrange to apply any client_encoding named in
the startup packet earlier, before authentication.  This relates to the TODO
item Let the client indicate character encoding of database names, user
names, and passwords.  (I expect such an endeavor to be tricky.)

  [Fix]
  Disable message localization during session startup.  In other
  words, messages are output in English until the database session is
  established.
 
 I think the question is whether the server encoding or English are
 likely to be better for the average client.  My bet is that the server
 encoding is more likely correct.
 
 However, you are right that English/ASCII at least will always be
 viewable, while there are many server/client combinations that will
 produce unreadable characters.
 
 I would be interested to hear other people's experience with this.

I don't have a sufficient sense of multilingualism among our users to know
whether English/ASCII messages would be more useful, on average, than
localized messages in the server encoding.  Forcing English/ASCII does worsen
behavior in the frequent situation where client encoding will match server
encoding.  I lean toward retaining the status quo of delivering localized
messages in the server encoding.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers