Re: [HACKERS] BUG #7493: Postmaster messages unreadable in a Windows console

2013-06-26 Thread Noah Misch
On Mon, Jun 24, 2013 at 04:00:00PM +0400, Alexander Law wrote:
 Thanks for your work, your patch is definitely better. I agree that this  
 approach much more generic.

Committed.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #7493: Postmaster messages unreadable in a Windows console

2013-06-24 Thread Alexander Law

Hello Noah,

Thanks for your work, your patch is definitely better. I agree that this 
approach much more generic.

23.06.2013 20:53, Noah Misch wrote:

The attached revision fixes all above points.  Would you look it over?  The
area was painfully light on comments, so I added some.  I renamed
pgwin32_toUTF16(), which ceases to be a good name now that it converts from
message encoding, not database encoding.  GetPlatformEncoding() became unused,
so I removed it.  (If we have cause reintroduce the exact same concept later,
GetTTYEncoding() would name it more accurately.)
Yes, the patch works for me. I have just a little question about 
pgwin32_message_to_UTF16. Do we need to convert SQL_ASCII through UTF8 
or should SQL_ASCII be mapped to 20127 (US-ASCII (7-bit))?

What should we do for the back branches, if anything?  Fixes for garbled
display on consoles and event logs are fair to back-patch, but users might be
accustomed to the present situation for SQL_ASCII databases.  Given the low
incidence of complaints and the workaround of using logging_collector, I am
inclined to put the whole thing in master only.
I thought that the change could be a first step to the PosgreSQL log 
encoding normalization. Today the log may contain messages with 
different encodings (we had a long discussion a year ago: 
http://www.postgresql.org/message-id/5007c399.6000...@gmail.com)
Now the new function GetMessageEncoding allows to convert all the 
messages consistently. If the future log encoding fix will be considered 
as important enough to be backported, then this patch should be 
backported too.


Thanks again!

Best regards,
Alexander



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #7493: Postmaster messages unreadable in a Windows console

2013-06-24 Thread Noah Misch
On Mon, Jun 24, 2013 at 04:00:00PM +0400, Alexander Law wrote:
 23.06.2013 20:53, Noah Misch wrote:
 The attached revision fixes all above points.  Would you look it over?  The
 area was painfully light on comments, so I added some.  I renamed
 pgwin32_toUTF16(), which ceases to be a good name now that it converts from
 message encoding, not database encoding.  GetPlatformEncoding() became 
 unused,
 so I removed it.  (If we have cause reintroduce the exact same concept later,
 GetTTYEncoding() would name it more accurately.)

 Yes, the patch works for me. I have just a little question about  
 pgwin32_message_to_UTF16. Do we need to convert SQL_ASCII through UTF8  
 or should SQL_ASCII be mapped to 20127 (US-ASCII (7-bit))?

Good question.  SQL_ASCII is an awkward database encoding; it actually
indicates ignorance about the significance of byte sequences in text.  As a
result, nothing we do here is likely to be delightful.  I caused an error
containing some LATIN2 bytes (SELECT * FROM śmiać się) in an SQL_ASCII /
LC_CTYPE=C database, and the result was decent, considering: the event log
shows a question-mark-in-rhombus symbol for each of the non-ASCII bytes.
Using CP20127 turned it into 6miaf sij.  I don't have a great idea for
improving on the existing hack.

 What should we do for the back branches, if anything?  Fixes for garbled
 display on consoles and event logs are fair to back-patch, but users might be
 accustomed to the present situation for SQL_ASCII databases.  Given the low
 incidence of complaints and the workaround of using logging_collector, I am
 inclined to put the whole thing in master only.

 I thought that the change could be a first step to the PosgreSQL log  
 encoding normalization. Today the log may contain messages with  
 different encodings (we had a long discussion a year ago:  
 http://www.postgresql.org/message-id/5007c399.6000...@gmail.com)
 Now the new function GetMessageEncoding allows to convert all the  
 messages consistently. If the future log encoding fix will be considered  
 as important enough to be backported, then this patch should be  
 backported too.

I doubt that would prove to be back-patch material.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #7493: Postmaster messages unreadable in a Windows console

2013-06-23 Thread Noah Misch
On Wed, Feb 20, 2013 at 04:00:00PM +0400, Alexander Law wrote:
 15.02.2013 02:59, Noah Misch wrote:
 With your proposed change, the problem will resurface in an actual 
 SQL_ASCII
 database.  At the problem's root is write_console()'s assumption that 
 messages
 are in the database encoding.  pg_bind_textdomain_codeset() tries to make 
 that
 so, but it only works for encodings with a pg_enc2gettext_tbl entry.  That
 excludes SQL_ASCII, MULE_INTERNAL, and others.  write_console() needs to
 behave differently in such cases.
 Thank you for the notice. So it seems that DatabaseEncoding variable
 alone can't present a database encoding (for communication with a
 client) and current process messages encoding (for logging messages) at
 once. There should be another variable, something like
 CurrentProcessEncoding, that will be set to OS encoding at start and can
 be changed to encoding of a connected database (if
 bind_textdomain_codeset succeeded).
 I'd call it MessageEncoding unless it corresponds with similar rigor to a
 broader concept.
 Please look at the next version of the patch.

I studied this patch.  I like the APIs it adds, but the implementation details
required attention.

Your patch assumes the message encoding will be a backend encoding, but this
need not be so; locale Japanese (Japan) uses CP932 aka SJIS.  I've also
added the non-backend encodings to pg_enc2gettext_tbl; I'd welcome sanity
checks from folks who know those encodings well.

write_console() still garbled the payload in all cases where it used write()
to a console instead of WriteConsoleW().  You can observe this by configuring
Windows for Russian, running initdb with default locale settings, and running
select 1/0 in template1.  See comments for the technical details; I fixed
this by removing the logic to preemptively skip WriteConsoleW() for
encoding-related reasons.  (Over in write_eventlog(), I am suspicious of the
benefits we get from avoiding ReportEventW() where possible.  I have not
removed that, though.)

 --- a/src/backend/main/main.c
 +++ b/src/backend/main/main.c
 @@ -100,6 +100,8 @@ main(int argc, char *argv[])
  
   set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN(postgres));
  
 + SetMessageEncoding(GetPlatformEncoding());

SetMessageEncoding() and GetPlatformEncoding() can both elog()/ereport(), but
this is too early in the life of a PostgreSQL process for that.

The goal of this line of code is to capture gettext's default encoding, which
is platform-specific.  On non-Windows platforms, it's the encoding implied by
LC_CTYPE.  On Windows, it's the Windows ANSI code page.  GetPlatformEncoding()
always gives the encoding implied by LC_CTYPE, which is therefore not correct
on Windows.  You can observe broken output by setting your locale (in control
panel Region and Language - Formats - Format) to something unrelated to
your Windows ANSI code page (Region and Language - Administrative - Change
system locale...).  Let's make PostgreSQL independent of gettext's
Windows-specific default by *always* calling bind_textdomain_codeset() on
Windows.  In the postmaster and other non-database-attached processes, as well
as in SQL_ASCII databases, we would pass the encoding implied by LC_CTYPE.
Besides removing a discrepancy in PostgreSQL behavior between Windows and
non-Windows platforms, this has the great practical advantage of reducing
PostgreSQL's dependence on the deprecated Windows ANSI code page, which can
only be changed on a system-wide basis, by an administrator, after a reboot.

For message creation purposes, the encoding implied by LC_CTYPE on Windows is
somewhat arbitrary.  Microsoft has deprecated them all in favor of UTF16, and
some locales (e.g. Hindi (India)) have no legacy code page.  I can see
adding a postmaster_encoding GUC to be used in place of consulting LC_CTYPE.
I haven't done that now; I just mention it for future reference.

On a !ENABLE_NLS build, messages are ASCII with database-encoding strings
sometimes interpolated.  Therefore, such builds should keep the message
encoding equal to the database encoding.

The MessageEncoding was not maintained accurately on non-Windows systems.  For
example, connecting to an lc_ctype=LATIN1, encoding=LATIN1 database does not
require a bind_textdomain_codeset() call on such systems, so we did not update
the message encoding.  This was academic for the time being, because
MessageEncoding is only consulted on Windows.


The attached revision fixes all above points.  Would you look it over?  The
area was painfully light on comments, so I added some.  I renamed
pgwin32_toUTF16(), which ceases to be a good name now that it converts from
message encoding, not database encoding.  GetPlatformEncoding() became unused,
so I removed it.  (If we have cause reintroduce the exact same concept later,
GetTTYEncoding() would name it more accurately.)


What should we do for the back branches, if anything?  Fixes for garbled
display on consoles and event logs are fair to 

Re: [HACKERS] BUG #7493: Postmaster messages unreadable in a Windows console

2013-02-20 Thread Alexander Law

Hello,

15.02.2013 02:59, Noah Misch wrote:

With your proposed change, the problem will resurface in an actual SQL_ASCII
database.  At the problem's root is write_console()'s assumption that messages
are in the database encoding.  pg_bind_textdomain_codeset() tries to make that
so, but it only works for encodings with a pg_enc2gettext_tbl entry.  That
excludes SQL_ASCII, MULE_INTERNAL, and others.  write_console() needs to
behave differently in such cases.

Thank you for the notice. So it seems that DatabaseEncoding variable
alone can't present a database encoding (for communication with a
client) and current process messages encoding (for logging messages) at
once. There should be another variable, something like
CurrentProcessEncoding, that will be set to OS encoding at start and can
be changed to encoding of a connected database (if
bind_textdomain_codeset succeeded).

I'd call it MessageEncoding unless it corresponds with similar rigor to a
broader concept.

Please look at the next version of the patch.

Thanks,
Alexander
From 5bce21326d48761c6f86be8797432a69b2533dcd Mon Sep 17 00:00:00 2001
From: Alexander Lakhin exclus...@gmail.com
Date: Wed, 20 Feb 2013 15:34:05 +0400
Subject: Fix postmaster messages encoding

---
 src/backend/main/main.c|2 ++
 src/backend/utils/error/elog.c |4 ++--
 src/backend/utils/mb/mbutils.c |   24 ++--
 3 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index 1173bda..ed4067e 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -100,6 +100,8 @@ main(int argc, char *argv[])
 
 	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN(postgres));
 
+	SetMessageEncoding(GetPlatformEncoding());
+
 #ifdef WIN32
 
 	/*
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 3a211bf..40f20f3 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -1868,7 +1868,7 @@ write_eventlog(int level, const char *line, int len)
 	 * Also verify that we are not on our way into error recursion trouble due
 	 * to error messages thrown deep inside pgwin32_toUTF16().
 	 */
-	if (GetDatabaseEncoding() != GetPlatformEncoding() 
+	if (GetMessageEncoding() != GetPlatformEncoding() 
 		!in_error_recursion_trouble())
 	{
 		utf16 = pgwin32_toUTF16(line, len, NULL);
@@ -1915,7 +1915,7 @@ write_console(const char *line, int len)
 	 * through to writing unconverted if we have not yet set up
 	 * CurrentMemoryContext.
 	 */
-	if (GetDatabaseEncoding() != GetPlatformEncoding() 
+	if (GetMessageEncoding() != GetPlatformEncoding() 
 		!in_error_recursion_trouble() 
 		!redirection_done 
 		CurrentMemoryContext != NULL)
diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index 287ff80..8b51b78 100644
--- a/src/backend/utils/mb/mbutils.c
+++ b/src/backend/utils/mb/mbutils.c
@@ -57,6 +57,7 @@ static FmgrInfo *ToClientConvProc = NULL;
  */
 static pg_enc2name *ClientEncoding = pg_enc2name_tbl[PG_SQL_ASCII];
 static pg_enc2name *DatabaseEncoding = pg_enc2name_tbl[PG_SQL_ASCII];
+static pg_enc2name *MessageEncoding = pg_enc2name_tbl[PG_SQL_ASCII];
 static pg_enc2name *PlatformEncoding = NULL;
 
 /*
@@ -881,6 +882,16 @@ SetDatabaseEncoding(int encoding)
 	Assert(DatabaseEncoding-encoding == encoding);
 }
 
+void
+SetMessageEncoding(int encoding)
+{
+	if (!PG_VALID_BE_ENCODING(encoding))
+		elog(ERROR, invalid message encoding: %d, encoding);
+
+	MessageEncoding = pg_enc2name_tbl[encoding];
+	Assert(MessageEncoding-encoding == encoding);
+}
+
 /*
  * Bind gettext to the codeset equivalent with the database encoding.
  */
@@ -915,6 +926,8 @@ pg_bind_textdomain_codeset(const char *domainname)
 			if (bind_textdomain_codeset(domainname,
 		pg_enc2gettext_tbl[i].name) == NULL)
 elog(LOG, bind_textdomain_codeset failed);
+			else
+SetMessageEncoding(encoding);
 			break;
 		}
 	}
@@ -964,6 +977,13 @@ GetPlatformEncoding(void)
 	return PlatformEncoding-encoding;
 }
 
+int
+GetMessageEncoding(void)
+{
+	Assert(MessageEncoding);
+	return MessageEncoding-encoding;
+}
+
 #ifdef WIN32
 
 /*
@@ -977,7 +997,7 @@ pgwin32_toUTF16(const char *str, int len, int *utf16len)
 	int			dstlen;
 	UINT		codepage;
 
-	codepage = pg_enc2name_tbl[GetDatabaseEncoding()].codepage;
+	codepage = pg_enc2name_tbl[GetMessageEncoding()].codepage;
 
 	/*
 	 * Use MultiByteToWideChar directly if there is a corresponding codepage,
@@ -994,7 +1014,7 @@ pgwin32_toUTF16(const char *str, int len, int *utf16len)
 		char	   *utf8;
 
 		utf8 = (char *) pg_do_encoding_conversion((unsigned char *) str,
-		len, GetDatabaseEncoding(), PG_UTF8);
+		len, GetMessageEncoding(), PG_UTF8);
 		if (utf8 != str)
 			len = strlen(utf8);
 
-- 
1.7.10.4


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #7493: Postmaster messages unreadable in a Windows console

2013-02-14 Thread Peter Eisentraut
On 2/11/13 10:22 PM, Greg Stark wrote:
 On Sun, Feb 10, 2013 at 11:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 If we knew that postgresql.conf was stored in, say, UTF8, then it would
 probably be possible to perform encoding conversion to get string
 variables into the database encoding.  Perhaps we should allow some
 magic syntax to tell us the encoding of a config file?

 file_encoding = 'utf8'  # must precede any non-ASCII in the file
 
 If we're going to do that we might as well use the Emacs standard
 -*-coding: latin-1;-*-

Yes, or more generally perhaps what Python does:
http://docs.python.org/2.7/reference/lexical_analysis.html#encoding-declarations

(In Python 2, the default is ASCII, in Python 3, the default is UTF8.)

 But that said I'm not sure saying the whole file is in an encoding is
 the right approach. Paths are actually binary strings. any encoding is
 purely for display purposes anyways. Log line formats could be treated
 similarly if we choose.

Well, at some point someone is going to open a configuration file in an
editor, and the editor will usually only be table to deal with one
encoding per file.  So we need to make that work, even if technically,
different considerations apply to different settings.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #7493: Postmaster messages unreadable in a Windows console

2013-02-14 Thread Craig Ringer
On 02/15/2013 12:45 AM, Peter Eisentraut wrote:
 On 2/11/13 10:22 PM, Greg Stark wrote:
 On Sun, Feb 10, 2013 at 11:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 If we knew that postgresql.conf was stored in, say, UTF8, then it would
 probably be possible to perform encoding conversion to get string
 variables into the database encoding.  Perhaps we should allow some
 magic syntax to tell us the encoding of a config file?

 file_encoding = 'utf8'  # must precede any non-ASCII in the file
 If we're going to do that we might as well use the Emacs standard
 -*-coding: latin-1;-*-
 Yes, or more generally perhaps what Python does:
 http://docs.python.org/2.7/reference/lexical_analysis.html#encoding-declarations

 (In Python 2, the default is ASCII, in Python 3, the default is UTF8.)
Not that Python also respects a BOM in a UTF-8 file, treating the BOM as
flagging the file as being UTF-8.

In addition, if the first bytes of the file are the UTF-8 byte-order
mark ('\xef\xbb\xbf'), the declared file encoding is UTF-8.

IMO we should do the same. If there's no explicit encoding declaration,
treat it as UTF-8 if there's a BOM and as the platform's local character
encoding otherwise.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #7493: Postmaster messages unreadable in a Windows console

2013-02-14 Thread Noah Misch
On Thu, Feb 14, 2013 at 11:11:13AM +0400, Alexander Law wrote:
 Hello,
 Alexander Law exclus...@gmail.com writes:
 Please look at the following l10n bug:
 http://www.postgresql.org/message-id/502a26f1.6010...@gmail.com
 and the proposed patch.
 With your proposed change, the problem will resurface in an actual SQL_ASCII
 database.  At the problem's root is write_console()'s assumption that 
 messages
 are in the database encoding.  pg_bind_textdomain_codeset() tries to make 
 that
 so, but it only works for encodings with a pg_enc2gettext_tbl entry.  That
 excludes SQL_ASCII, MULE_INTERNAL, and others.  write_console() needs to
 behave differently in such cases.
 Thank you for the notice. So it seems that DatabaseEncoding variable  
 alone can't present a database encoding (for communication with a  
 client) and current process messages encoding (for logging messages) at  
 once. There should be another variable, something like  
 CurrentProcessEncoding, that will be set to OS encoding at start and can  
 be changed to encoding of a connected database (if  
 bind_textdomain_codeset succeeded).

I'd call it MessageEncoding unless it corresponds with similar rigor to a
broader concept.

 On Tue, Feb 12, 2013 at 03:22:17AM +, Greg Stark wrote:
 But that said I'm not sure saying the whole file is in an encoding is
 the right approach. Paths are actually binary strings. any encoding is
 purely for display purposes anyways.
 For Unix, yes.  On Windows, they're ultimately UTF16 strings; some system 
 APIs
 accept paths in the Windows ANSI code page and convert to UTF16 internally.
 Nonetheless, good point.
 Yes, and if postresql.conf not going to be UTF16 encoded, it seems  
 natural to use ANSI code page on Windows to write such paths in it.
 So the paths should be written in OS encoding, which is accepted by OS  
 functions, such as fopen. (This is what we have now.)

To the contrary, we would do better to use _wfopen() after converting from the
encoding at hand to UTF16.  We should have the goal of removing our dependence
on the Windows ANSI code page, not tightening our bonds thereto.  As long as
PostgreSQL uses fopen() on Windows, it will remain possible to create a file
that PostgreSQL cannot open.  Making the full transition is probably a big
job, and we don't need to get there in one patch.  Try, however, to avoid
patch designs that increase the distance left to cover.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #7493: Postmaster messages unreadable in a Windows console

2013-02-13 Thread Noah Misch
On Tue, Feb 12, 2013 at 03:22:17AM +, Greg Stark wrote:
 But that said I'm not sure saying the whole file is in an encoding is
 the right approach. Paths are actually binary strings. any encoding is
 purely for display purposes anyways.

For Unix, yes.  On Windows, they're ultimately UTF16 strings; some system APIs
accept paths in the Windows ANSI code page and convert to UTF16 internally.
Nonetheless, good point.

 What parts of postgresql.conf are actually encoded strings that need
 to be (and can be) manipulated as encoded strings?

Mainly the ones that refer to arbitrary database objects.  At least these:

default_tablespace
default_text_search_config
search_path
temp_tablespaces


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #7493: Postmaster messages unreadable in a Windows console

2013-02-13 Thread Alexander Law

Hello,

Alexander Law exclus...@gmail.com writes:

Please look at the following l10n bug:
http://www.postgresql.org/message-id/502a26f1.6010...@gmail.com
and the proposed patch.

With your proposed change, the problem will resurface in an actual SQL_ASCII
database.  At the problem's root is write_console()'s assumption that messages
are in the database encoding.  pg_bind_textdomain_codeset() tries to make that
so, but it only works for encodings with a pg_enc2gettext_tbl entry.  That
excludes SQL_ASCII, MULE_INTERNAL, and others.  write_console() needs to
behave differently in such cases.
Thank you for the notice. So it seems that DatabaseEncoding variable 
alone can't present a database encoding (for communication with a 
client) and current process messages encoding (for logging messages) at 
once. There should be another variable, something like 
CurrentProcessEncoding, that will be set to OS encoding at start and can 
be changed to encoding of a connected database (if 
bind_textdomain_codeset succeeded).



On Tue, Feb 12, 2013 at 03:22:17AM +, Greg Stark wrote:

But that said I'm not sure saying the whole file is in an encoding is
the right approach. Paths are actually binary strings. any encoding is
purely for display purposes anyways.

For Unix, yes.  On Windows, they're ultimately UTF16 strings; some system APIs
accept paths in the Windows ANSI code page and convert to UTF16 internally.
Nonetheless, good point.
Yes, and if postresql.conf not going to be UTF16 encoded, it seems 
natural to use ANSI code page on Windows to write such paths in it.
So the paths should be written in OS encoding, which is accepted by OS 
functions, such as fopen. (This is what we have now.)
And it seems too complicated to have different encodings in one file. Or 
maybe path parameters should be separated from the others, for which OS 
encoding is undesirable.

If we knew that postgresql.conf was stored in, say, UTF8, then it would
probably be possible to perform encoding conversion to get string
variables into the database encoding.  Perhaps we should allow some
magic syntax to tell us the encoding of a config file?

 file_encoding = 'utf8'  # must precede any non-ASCII in the file

If we're going to do that we might as well use the Emacs standard
-*-coding: latin-1;-*-


Explicit encoding specification such as these (or even ?xml 
version=1.0 encoding=utf-8?) can be useful but what encoding to 
assume without it? For XML (without BOM) it's UTF-8, for emacs it 
depends on it's language environment.
If postgresql.conf doesn't have to be portable (as XML), then IMO OS 
encoding is the right choice for it.



Best regards,
Alexander


Re: [HACKERS] BUG #7493: Postmaster messages unreadable in a Windows console

2013-02-11 Thread Noah Misch
On Sun, Feb 10, 2013 at 06:47:30PM -0500, Tom Lane wrote:
 Noah Misch n...@leadboat.com writes:
  Following some actual testing, I see that we treat postgresql.conf values as
  byte sequences; any reinterpretation as encoded text happens later.  Hence,
  contrary to my earlier suspicion, your patch does not make that situation
  worse.  The present situation is bad; among other things, current_setting() 
  is
  a vector for injecting invalid text data.  But unconditionally validating
  postgresql.conf values in the platform encoding would not be an improvement.
  Suppose you have a UTF-8 platform encoding and KOI8R databases.  You may 
  wish
  to put KOI8R strings in a GUC, say search_path.  That's possible today; if 
  we
  required that postgresql.conf conform to the platform encoding and no other,
  it would become impossible.  This area warrants improvement, but doing so 
  will
  entail careful design.
 
 The key problem, ISTM, is that it's not at all clear what encoding to
 expect the incoming data to be in.  I'm concerned about trying to fix
 that by assuming it's in some platform encoding --- for one thing,
 while that might be a well-defined concept on Windows, I don't believe
 it is anywhere else.

GetPlatformEncoding() imposes a sufficiently-portable definition.  I just
don't think that definition leads to a value that can be presumed desirable
and adequate for postgresql.conf.

 If we knew that postgresql.conf was stored in, say, UTF8, then it would
 probably be possible to perform encoding conversion to get string
 variables into the database encoding.  Perhaps we should allow some
 magic syntax to tell us the encoding of a config file?
 
   file_encoding = 'utf8'  # must precede any non-ASCII in the file
 
 There would still be a lot of practical problems to solve, like what to
 do if we fail to convert some string into the database encoding.  But at
 least the problems would be somewhat well-defined.

Agreed.  That's a promising direction.

 While we're thinking about this, it'd be nice to fix our handling (or
 rather lack of handling) of encoding considerations for database names,
 user names, and passwords.  I could imagine adding some sort of encoding
 marker to connection request packets, which could fix the don't-know-
 the-encoding problem as far as incoming data is concerned.

That deserves a TODO entry under Wire Protocol Changes to avoid losing it.

 But how
 shall we deal with storing the strings in shared catalogs, which have to
 be readable from multiple databases possibly of different encodings?

I suppose we would pick an encoding sufficient for all values we intend to
support (UTF8?  MULE_INTERNAL?), then store the data in that encoding using
either bytea or a new type, say omnitext.

Thanks,
nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #7493: Postmaster messages unreadable in a Windows console

2013-02-11 Thread Tom Lane
Noah Misch n...@leadboat.com writes:
 On Sun, Feb 10, 2013 at 06:47:30PM -0500, Tom Lane wrote:
 The key problem, ISTM, is that it's not at all clear what encoding to
 expect the incoming data to be in.  I'm concerned about trying to fix
 that by assuming it's in some platform encoding --- for one thing,
 while that might be a well-defined concept on Windows, I don't believe
 it is anywhere else.

 GetPlatformEncoding() imposes a sufficiently-portable definition.  I just
 don't think that definition leads to a value that can be presumed desirable
 and adequate for postgresql.conf.

Nah, GetPlatformEncoding is utterly useless for this purpose --- restart
the server with some other environment, and bad things will happen.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #7493: Postmaster messages unreadable in a Windows console

2013-02-11 Thread Greg Stark
On Sun, Feb 10, 2013 at 11:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 If we knew that postgresql.conf was stored in, say, UTF8, then it would
 probably be possible to perform encoding conversion to get string
 variables into the database encoding.  Perhaps we should allow some
 magic syntax to tell us the encoding of a config file?

 file_encoding = 'utf8'  # must precede any non-ASCII in the file

If we're going to do that we might as well use the Emacs standard
-*-coding: latin-1;-*-

But that said I'm not sure saying the whole file is in an encoding is
the right approach. Paths are actually binary strings. any encoding is
purely for display purposes anyways. Log line formats could be treated
similarly if we choose.

Hostnames would need to be in a particular encoding if we're to
generate punycode I think but I think that particular encoding would
have to be UTF8. Converting from some other encoding would be error
prone and unnecessarily complex.

What parts of postgresql.conf are actually encoded strings that need
to be (and can be) manipulated as encoded strings?

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #7493: Postmaster messages unreadable in a Windows console

2013-02-10 Thread Noah Misch
On Wed, Jan 30, 2013 at 10:00:01AM +0400, Alexander Law wrote:
 30.01.2013 05:51, Noah Misch wrote:
 On Tue, Jan 29, 2013 at 09:54:04AM -0500, Tom Lane wrote:
 Alexander Law exclus...@gmail.com writes:
 Please look at the following l10n bug:
 http://www.postgresql.org/message-id/502a26f1.6010...@gmail.com
 and the proposed patch.

 Even then, I wouldn't be surprised to find problematic consequences beyond
 error display.  What if all the databases are EUC_JP, the platform encoding 
 is
 KOI8, and some postgresql.conf settings contain EUC_JP characters?  Does the
 postmaster not rely on its use of SQL_ASCII to allow those values?

 I would look at fixing this by making the error output machinery smarter in
 this area before changing the postmaster's notion of server_encoding.

With your proposed change, the problem will resurface in an actual SQL_ASCII
database.  At the problem's root is write_console()'s assumption that messages
are in the database encoding.  pg_bind_textdomain_codeset() tries to make that
so, but it only works for encodings with a pg_enc2gettext_tbl entry.  That
excludes SQL_ASCII, MULE_INTERNAL, and others.  write_console() needs to
behave differently in such cases.

 Maybe I still miss something but I thought that  
 postinit.c/CheckMyDatabase will switch encoding of a messages by  
 pg_bind_textdomain_codeset to EUC_JP so there will be no issues with it.  
 But until then KOI8 should be used.
 Regarding postgresql.conf, as it has no explicit encoding specification,  
 it should be interpreted as having the platform encoding. So in your  
 example it should contain KOI8, not EUC_JP characters.

Following some actual testing, I see that we treat postgresql.conf values as
byte sequences; any reinterpretation as encoded text happens later.  Hence,
contrary to my earlier suspicion, your patch does not make that situation
worse.  The present situation is bad; among other things, current_setting() is
a vector for injecting invalid text data.  But unconditionally validating
postgresql.conf values in the platform encoding would not be an improvement.
Suppose you have a UTF-8 platform encoding and KOI8R databases.  You may wish
to put KOI8R strings in a GUC, say search_path.  That's possible today; if we
required that postgresql.conf conform to the platform encoding and no other,
it would become impossible.  This area warrants improvement, but doing so will
entail careful design.

Thanks,
nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #7493: Postmaster messages unreadable in a Windows console

2013-02-10 Thread Tom Lane
Noah Misch n...@leadboat.com writes:
 Following some actual testing, I see that we treat postgresql.conf values as
 byte sequences; any reinterpretation as encoded text happens later.  Hence,
 contrary to my earlier suspicion, your patch does not make that situation
 worse.  The present situation is bad; among other things, current_setting() is
 a vector for injecting invalid text data.  But unconditionally validating
 postgresql.conf values in the platform encoding would not be an improvement.
 Suppose you have a UTF-8 platform encoding and KOI8R databases.  You may wish
 to put KOI8R strings in a GUC, say search_path.  That's possible today; if we
 required that postgresql.conf conform to the platform encoding and no other,
 it would become impossible.  This area warrants improvement, but doing so will
 entail careful design.

The key problem, ISTM, is that it's not at all clear what encoding to
expect the incoming data to be in.  I'm concerned about trying to fix
that by assuming it's in some platform encoding --- for one thing,
while that might be a well-defined concept on Windows, I don't believe
it is anywhere else.

If we knew that postgresql.conf was stored in, say, UTF8, then it would
probably be possible to perform encoding conversion to get string
variables into the database encoding.  Perhaps we should allow some
magic syntax to tell us the encoding of a config file?

file_encoding = 'utf8'  # must precede any non-ASCII in the file

There would still be a lot of practical problems to solve, like what to
do if we fail to convert some string into the database encoding.  But at
least the problems would be somewhat well-defined.

While we're thinking about this, it'd be nice to fix our handling (or
rather lack of handling) of encoding considerations for database names,
user names, and passwords.  I could imagine adding some sort of encoding
marker to connection request packets, which could fix the don't-know-
the-encoding problem as far as incoming data is concerned.  But how
shall we deal with storing the strings in shared catalogs, which have to
be readable from multiple databases possibly of different encodings?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #7493: Postmaster messages unreadable in a Windows console

2013-01-29 Thread Tom Lane
Alexander Law exclus...@gmail.com writes:
 Please look at the following l10n bug:
 http://www.postgresql.org/message-id/502a26f1.6010...@gmail.com
 and the proposed patch.

That patch looks entirely unsafe to me.  Neither of those functions
should be expected to be able to run when none of our standard
infrastructure (palloc, elog) is up yet.

Possibly it would be safe to do this somewhere around where we do
GUC initialization.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #7493: Postmaster messages unreadable in a Windows console

2013-01-29 Thread Noah Misch
On Tue, Jan 29, 2013 at 09:54:04AM -0500, Tom Lane wrote:
 Alexander Law exclus...@gmail.com writes:
  Please look at the following l10n bug:
  http://www.postgresql.org/message-id/502a26f1.6010...@gmail.com
  and the proposed patch.
 
 That patch looks entirely unsafe to me.  Neither of those functions
 should be expected to be able to run when none of our standard
 infrastructure (palloc, elog) is up yet.
 
 Possibly it would be safe to do this somewhere around where we do
 GUC initialization.

Even then, I wouldn't be surprised to find problematic consequences beyond
error display.  What if all the databases are EUC_JP, the platform encoding is
KOI8, and some postgresql.conf settings contain EUC_JP characters?  Does the
postmaster not rely on its use of SQL_ASCII to allow those values?

I would look at fixing this by making the error output machinery smarter in
this area before changing the postmaster's notion of server_encoding.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #7493: Postmaster messages unreadable in a Windows console

2013-01-29 Thread Alexander Law

30.01.2013 05:51, Noah Misch wrote:

On Tue, Jan 29, 2013 at 09:54:04AM -0500, Tom Lane wrote:

Alexander Law exclus...@gmail.com writes:

Please look at the following l10n bug:
http://www.postgresql.org/message-id/502a26f1.6010...@gmail.com
and the proposed patch.

That patch looks entirely unsafe to me.  Neither of those functions
should be expected to be able to run when none of our standard
infrastructure (palloc, elog) is up yet.

Possibly it would be safe to do this somewhere around where we do
GUC initialization.


Looking at elog.c:write_console, and boostrap.c:AuxiliaryProcessMain, 
mcxt.c:MemoryContextInit I would place this call 
(SetDatabaseEncoding(GetPlatformEncoding())) at MemoryContextInit.
(The branch of conversion pgwin32_toUTF16 is not executed until 
CurrentMemoryContext is not null)


But I see some calls to ereport before MemoryContextInit. Is it ok or 
MemoryContext initialization should be done before?

For example, main.c:main - pgwin32_signal_initialize - ereport

And there is another issue with elog.c:write_stderr
if (pgwin32_is_service) then the process writes message to the windows 
eventlog (write_eventlog), trying to convert in to UTF16. But it doesn't 
check MemoryContext before the call to pgwin32_toUTF16 (as write_console 
does) and we can get a crash in the following way:
main.c:check_root - if (pgwin32_is_admin()) write_stderr - if 
(pgwin32_is_service()) write_eventlog - if (if (GetDatabaseEncoding() 
!= GetPlatformEncoding() ) pgwin32_toUTF16 - crash


So placing SetDatabaseEncoding(GetPlatformEncoding()) before the 
check_root can be a solution for the issue.



Even then, I wouldn't be surprised to find problematic consequences beyond
error display.  What if all the databases are EUC_JP, the platform encoding is
KOI8, and some postgresql.conf settings contain EUC_JP characters?  Does the
postmaster not rely on its use of SQL_ASCII to allow those values?

I would look at fixing this by making the error output machinery smarter in
this area before changing the postmaster's notion of server_encoding.
Maybe I still miss something but I thought that 
postinit.c/CheckMyDatabase will switch encoding of a messages by 
pg_bind_textdomain_codeset to EUC_JP so there will be no issues with it. 
But until then KOI8 should be used.
Regarding postgresql.conf, as it has no explicit encoding specification, 
it should be interpreted as having the platform encoding. So in your 
example it should contain KOI8, not EUC_JP characters.


Thanks,
Alexander


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #7493: Postmaster messages unreadable in a Windows console

2013-01-28 Thread Alexander Law

Hello,
Thanks for fixing bug #6510!
Please look at the following l10n bug:
http://www.postgresql.org/message-id/502a26f1.6010...@gmail.com
and the proposed patch.

Best regards,
Alexander

From 1e2d5f712744d4731b665724703c0da4971ea41e Mon Sep 17 00:00:00 2001
From: Alexander Lakhin exclus...@gmail.com
Date: Mon, 28 Jan 2013 08:19:34 +0400
Subject: Fix postmaster messages encoding

---
 src/backend/main/main.c |6 ++
 1 file changed, 6 insertions(+)
 mode change 100644 = 100755 src/backend/main/main.c

diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index 1173bda..b79a483
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -89,6 +89,12 @@ main(int argc, char *argv[])
 	pgwin32_install_crashdump_handler();
 #endif
 
+/*
+* Use the platform encoding until the process connects to a database
+* and sets the appropriate encoding.
+*/
+	SetDatabaseEncoding(GetPlatformEncoding());
+
 	/*
 	 * Set up locale information from environment.	Note that LC_CTYPE and
 	 * LC_COLLATE will be overridden later from pg_control if we are in an
-- 
1.7.10.4




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers