Re: [HACKERS] Encoding issues in console and eventlog on win32

2009-10-16 Thread Bruce Momjian
Robert Haas wrote:
  Sure, the logfile will be filled with mixed encoding strings,
  that could happen in logfile and syslog on non-win32 platforms.
  I think UTF8 is better than UTF16 for logfile encoding because
  there are some text editors that do not support wide characters.
  At any rate, the logfile encoding feature will come from another patch,
  that might add log_encoding variable and work on any platforms.
 
 Magnus has promised me on a stack of instant messages that he will
 review this soon, but as he hasn't gotten to it yet, I am moving it to
 the next CommitFest.

I am with Magnus today and will make sure it gets done.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Encoding issues in console and eventlog on win32

2009-10-16 Thread Magnus Hagander
2009/10/13 Itagaki Takahiro itagaki.takah...@oss.ntt.co.jp:

 Magnus Hagander mag...@hagander.net wrote:

 One other question - you note that WriteConsoleW() could fail if
 stderr is redirected. Are you saying that it will always fail when
 stderr is redirected, or only sometimes? If ony sometimes, do you know
 under which conditions it happens?

 It will always fail if redirected. We can test the conditions using:
    pg_ctl start  result.log
 So, the comment should be:
    /* WriteConsoleW always fails if stderr is redirected. */

Ok, fair enough. We already have a variable for that though - it's
called redirection_done. I think it does what's necessary - I have
used that one in my version of the patch. Please verify that it works
in your environment.


 I cleaned up the patch per comments. I hope this will be the final one ;-).

  * Use in_error_recursion_trouble() instead of own implementation.
  * Use def_enc2name() macro to avoid adding the codepage field
    on non-win32 platforms.

Per previous email, I had done this in my version of the patch, so it
looks slightly different than yours, but it has the same
functionality.


  * Fix a bug of calculation of result length.

Where exactly is this one? I can't find it compared to my code, but
that could just be out-of-timezone-brain speaking :-)

  * Fix a memory leak on error handling path in pgwin32_toUTF16().

Missed that one, thanks!


 If it's always, I assume this just means that the logfile will be in
 the database encoding and not in UTF16? Is this what we want, or would
 we like the logfile to also be in UTF16? If we can convert it to
 UTF16, that would fix the case when you have different databases in
 different encodings, wouldn't it? (Even if your editor, unlike the
 console subsystem, can view the individual encoding you need, I bet it
 can't deal with multiple encodings in the same file)

 Sure, the logfile will be filled with mixed encoding strings,
 that could happen in logfile and syslog on non-win32 platforms.
 I think UTF8 is better than UTF16 for logfile encoding because
 there are some text editors that do not support wide characters.

Don't most text editors on Windows do UTF16? In particular, I'd expect
more of them to do UTF16 than UTF8, but I could be wrong?

 At any rate, the logfile encoding feature will come from another patch,
 that might add log_encoding variable and work on any platforms.

Ok, good. Particularly the other platform is the winning argument.

So, what I believe is the latest version of the patch applied. Please
point out if I made a mistake in my changes against yours.

Sorry about the delay :(


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Encoding issues in console and eventlog on win32

2009-10-15 Thread Robert Haas
On Mon, Oct 12, 2009 at 9:13 PM, Itagaki Takahiro
itagaki.takah...@oss.ntt.co.jp wrote:

 Magnus Hagander mag...@hagander.net wrote:

 One other question - you note that WriteConsoleW() could fail if
 stderr is redirected. Are you saying that it will always fail when
 stderr is redirected, or only sometimes? If ony sometimes, do you know
 under which conditions it happens?

 It will always fail if redirected. We can test the conditions using:
    pg_ctl start  result.log
 So, the comment should be:
    /* WriteConsoleW always fails if stderr is redirected. */

 I cleaned up the patch per comments. I hope this will be the final one ;-).

  * Use in_error_recursion_trouble() instead of own implementation.
  * Use def_enc2name() macro to avoid adding the codepage field
    on non-win32 platforms.
  * Fix a bug of calculation of result length.
  * Fix a memory leak on error handling path in pgwin32_toUTF16().


 If it's always, I assume this just means that the logfile will be in
 the database encoding and not in UTF16? Is this what we want, or would
 we like the logfile to also be in UTF16? If we can convert it to
 UTF16, that would fix the case when you have different databases in
 different encodings, wouldn't it? (Even if your editor, unlike the
 console subsystem, can view the individual encoding you need, I bet it
 can't deal with multiple encodings in the same file)

 Sure, the logfile will be filled with mixed encoding strings,
 that could happen in logfile and syslog on non-win32 platforms.
 I think UTF8 is better than UTF16 for logfile encoding because
 there are some text editors that do not support wide characters.
 At any rate, the logfile encoding feature will come from another patch,
 that might add log_encoding variable and work on any platforms.

Magnus has promised me on a stack of instant messages that he will
review this soon, but as he hasn't gotten to it yet, I am moving it to
the next CommitFest.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Encoding issues in console and eventlog on win32

2009-10-12 Thread Itagaki Takahiro

Magnus Hagander mag...@hagander.net wrote:

 One other question - you note that WriteConsoleW() could fail if
 stderr is redirected. Are you saying that it will always fail when
 stderr is redirected, or only sometimes? If ony sometimes, do you know
 under which conditions it happens?

It will always fail if redirected. We can test the conditions using:
pg_ctl start  result.log
So, the comment should be:
/* WriteConsoleW always fails if stderr is redirected. */

I cleaned up the patch per comments. I hope this will be the final one ;-).

  * Use in_error_recursion_trouble() instead of own implementation.
  * Use def_enc2name() macro to avoid adding the codepage field
on non-win32 platforms.
  * Fix a bug of calculation of result length.
  * Fix a memory leak on error handling path in pgwin32_toUTF16().


 If it's always, I assume this just means that the logfile will be in
 the database encoding and not in UTF16? Is this what we want, or would
 we like the logfile to also be in UTF16? If we can convert it to
 UTF16, that would fix the case when you have different databases in
 different encodings, wouldn't it? (Even if your editor, unlike the
 console subsystem, can view the individual encoding you need, I bet it
 can't deal with multiple encodings in the same file)

Sure, the logfile will be filled with mixed encoding strings,
that could happen in logfile and syslog on non-win32 platforms.
I think UTF8 is better than UTF16 for logfile encoding because
there are some text editors that do not support wide characters.
At any rate, the logfile encoding feature will come from another patch,
that might add log_encoding variable and work on any platforms.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



eventlog_20091013.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Encoding issues in console and eventlog on win32

2009-10-10 Thread Tom Lane
Magnus Hagander mag...@hagander.net writes:
 2009/10/7 Itagaki Takahiro itagaki.takah...@oss.ntt.co.jp:
 Also I added the following error checks before calling pgwin32_toUTF16()
    (errordata_stack_depth  ERRORDATA_STACK_SIZE - 1)
 to avoid recursive errors, but I'm not sure it is really meaningful.
 Please remove or rewrite this part if it is not a right way.

 I'm not entirely sure either, but it looks like it could protect us
 from getting into a tight loop on an error here.. Tom (or someone else
 who knows that for sure :P),comments?

I haven't read the patch, but I'd suggest making any behavior changes
dependent on in_error_recursion_trouble(), rather than getting in bed
with internal implementation variables.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Encoding issues in console and eventlog on win32

2009-10-09 Thread Magnus Hagander
2009/10/7 Itagaki Takahiro itagaki.takah...@oss.ntt.co.jp:

 Magnus Hagander mag...@hagander.net wrote:
 Per your own comments earlier, and in the code, what will happen if
 pg_do_encoding_conversion() calls ereport()? Didn't you say we need a
 non-throwing version of it?

 We are hard to use encoding conversion functions in logging routines
 because they could throw errors if there are some unconvertable characters.
 Non-throwing version will convert such characters into '?' or escaped form
 (something like \888 or \xFF). If there where such infrastructure, we can
 support log_encoding settings and convert messages in platform-dependent
 encoding before writing to syslog or console.

Right, which we don't have at this point. That would be very useful on
unix, i believe.


 pgwin32_toUTF16() needs error checking on the API calls, and needs to
 do something reasonable if it fails.

 Now it returns NULL and caller writes messages in the original encoding.

Seems reasonable. If encoding fails, I think that's the best we can do.


 Also I added the following error checks before calling pgwin32_toUTF16()
    (errordata_stack_depth  ERRORDATA_STACK_SIZE - 1)
 to avoid recursive errors, but I'm not sure it is really meaningful.
 Please remove or rewrite this part if it is not a right way.

I'm not entirely sure either, but it looks like it could protect us
from getting into a tight loop on an error here.. Tom (or someone else
who knows that for sure :P),comments?


 The encoding_to_codepage array needs to go in encnames.c, where other
 such tables are. Perhaps it can even be integrated in pg_enc2name_tbl
 as a separate field?

 I added pg_enc2name.codepage. Note that this field is needed only
 on Windows, but now exported for all platforms. If you don't like
 the useless field, the following macro could be a help.
 #ifdef WIN32
 #define def_enc2name(name, codepage)    { #name, PG_##name, codepage }
 #else
 #define def_enc2name(name, codepage)    { #name, PG_##name }
 #endif
 pg_enc2name pg_enc2name_tbl[] =
 {
    def_enc2name(SQL_ASCII),
    def_enc2name(EUC_JP),
    ...

Yeah, I think that makes sense. It's not much data, but it's
completely unnecessary :-) I can make that change at commit.

One other question - you note that WriteConsoleW() could fail if
stderr is redirected. Are you saying that it will always fail when
stderr is redirected, or only sometimes? If ony sometimes, do you know
under which conditions it happens?

If it's always, I assume this just means that the logfile will be in
the database encoding and not in UTF16? Is this what we want, or would
we like the logfile to also be in UTF16? If we can convert it to
UTF16, that would fix the case when you have different databases in
different encodings, wouldn't it? (Even if your editor, unlike the
console subsystem, can view the individual encoding you need, I bet it
can't deal with multiple encodings in the same file)

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Encoding issues in console and eventlog on win32

2009-10-06 Thread Magnus Hagander
2009/9/15 Itagaki Takahiro itagaki.takah...@oss.ntt.co.jp:

 Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:

 Can't we use MultiByteToWideChar() to convert directly to the required
 encoding, avoiding the double conversion?

 Here is an updated version of the patch.
 I use direct conversion in pgwin32_toUTF16() if a corresponding codepage
 is available. If not available, I still use double conversion.

 Now pgwin32_toUTF16() is exported from mbutil.c. I used the function
 in following parts, although the main target of the patch is eventlog.

  * WriteConsoleW() - write unredirected stderr log.
  * ReportEventW()  - write evenlog.
  * CreateFileW()   - open non-ascii filename (ex. COPY TO/FROM 'mb-path').

 This approach is only available for Windows because any other platform
 don't support locale-independent and wide-character-based system calls.
 Other platforms require a different approach, but even then we'd still
 better have win32-specific routines because UTF16 is the native encoding
 in Windows.

I did a quick check of this, and here are the things I would like to
have changed:

First of all, the change to port/open.c seems to be unrelated to the
rest, and should be a separate patch, correct? I'm sure there's a
usecase for it, but it's not actually  included in the patches
description, so I assume this was a mistake?

Per your own comments earlier, and in the code, what will happen if
pg_do_encoding_conversion() calls ereport()? Didn't you say we need a
non-throwing version of it?

pgwin32_toUTF16() needs error checking on the API calls, and needs to
do something reasonable if it fails. For example, it can fail because
of out of memory error. I suggest just returning the error code in
some way in that case, and have the callers fall back to logging in
the incorrect encoding - in a lot of cases that will produce an at
least partially readable message. A second message should also be
logged saying that the conversion failed - this needs to be done
directly with the eventlog API functions and not ereport, so we don't
end up in infinite recursion.

The encoding_to_codepage array needs to go in encnames.c, where other
such tables are. Perhaps it can even be integrated in pg_enc2name_tbl
as a separate field?

I don't have the time to clean this up right now, so if you have,
please do so and resubmit. If not, I can clean it up later and apply.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Encoding issues in console and eventlog on win32

2009-10-06 Thread Itagaki Takahiro

Magnus Hagander mag...@hagander.net wrote:

 First of all, the change to port/open.c seems to be unrelated to the
 rest, and should be a separate patch, correct? I'm sure there's a
 usecase for it, but it's not actually  included in the patches
 description, so I assume this was a mistake?

It was just a demo for pgwin32_toUTF16(). I'll remove this part from
the patch, but I think we also need to fix the encoding mismatch issue
in path strings. I'll re-submit for the next commitfest.

 Per your own comments earlier, and in the code, what will happen if
 pg_do_encoding_conversion() calls ereport()? Didn't you say we need a
 non-throwing version of it?

We are hard to use encoding conversion functions in logging routines
because they could throw errors if there are some unconvertable characters.
Non-throwing version will convert such characters into '?' or escaped form
(something like \888 or \xFF). If there where such infrastructure, we can
support log_encoding settings and convert messages in platform-dependent
encoding before writing to syslog or console.

 pgwin32_toUTF16() needs error checking on the API calls, and needs to
 do something reasonable if it fails.

Now it returns NULL and caller writes messages in the original encoding.
Also I added the following error checks before calling pgwin32_toUTF16()
(errordata_stack_depth  ERRORDATA_STACK_SIZE - 1)
to avoid recursive errors, but I'm not sure it is really meaningful.
Please remove or rewrite this part if it is not a right way.

 The encoding_to_codepage array needs to go in encnames.c, where other
 such tables are. Perhaps it can even be integrated in pg_enc2name_tbl
 as a separate field?

I added pg_enc2name.codepage. Note that this field is needed only
on Windows, but now exported for all platforms. If you don't like
the useless field, the following macro could be a help.

#ifdef WIN32
#define def_enc2name(name, codepage){ #name, PG_##name, codepage }
#else
#define def_enc2name(name, codepage){ #name, PG_##name }
#endif
pg_enc2name pg_enc2name_tbl[] =
{
def_enc2name(SQL_ASCII),
def_enc2name(EUC_JP),
...

 I don't have the time to clean this up right now, so if you have,
 please do so and resubmit. If not, I can clean it up later and apply.

Patch attached.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



eventlog_20091007.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Encoding issues in console and eventlog on win32

2009-09-20 Thread Josh Williams
On Tue, 2009-09-15 at 12:49 +0900, Itagaki Takahiro wrote:
 Here is an updated version of the patch.

This is a review of the Eventlog encoding on Windows patch:
http://archives.postgresql.org/message-id/20090915123243.9c59.52131...@oss.ntt.co.jp

Purpose  Format

This patch is designed to coerce log messages to a specific encoding.
It's currently only targeted at the win32 port, where the logs are
written in UTF-16.

The patch applies cleanly.  It doesn't include any documentation updates
or additional regression tests.  A comment in the documentation that
logs on Windows will go through an encoding conversion if appropriate
might be nice, though.

Initial Run
===
To (hopefully) properly test I initdb'd a couple directories under
different locales.  I then ran a few statements designed to generate
event log messages showing characters in a different encoding:
SELECT E'\xF0'::int;

The unpatched backend generated event log message showing only the byte
value interpreted as the same character each time in the system default
encoding.

With the patch in place the event log message showed the character
correctly for each of the different encodings.

I haven't tried any performance testing against it.

Concurrent Development Issues
=
On a hunch, tried applying the syslogger infrastructure changes at the
same time.  They conflict on elog.c.  Not sure if we're supposed to
check for that, but thought I'd point it out. :)

Editorial
=
The problem seems to stem from PG and Windows each having a few
encodings the other won't understand, or at least don't immediately
support.  So log messages back to the system from its perspective
contain incorrect or broken characters.  I'm not sure this is as much of
a problem on other platforms, though, where the database encoding
typically doesn't have any trouble matching the system's; would it be
worth pursuing beyond the win32 port?

I'm not too familiar with alternate character sets...  I would assume if
there's a code page supported on win32 it'll naturally support
conversion to UTF-16 on the platform, but is there any time this could
fail?  What about the few encodings that it doesn't directly support,
which need a conversion to UTF-8 first?

Maybe someone with more familiarity with encoding conversion issues
could comment on that?  Otherwise I think this is ready to be bumped up
for committer review.

- Josh Williams



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Encoding issues in console and eventlog on win32

2009-09-14 Thread Heikki Linnakangas
Itagaki Takahiro wrote:
 We can choose different encodings from platform-dependent one
 for database, but postgres writes serverlogs in the database encoding.
 As the result, serverlogs are filled with broken characters.
 
 The problem could occur on all platforms, however, there is a solution
 for win32. Since Windows supports wide characters to write logs, we can
 convert log texts = UTF-8 = UTF-16 and pass them to WriteConsoleW()
 and ReportEventW().
 
 Especially in Japan, encoding troubles on Windows are unavoidable
 because postgres doesn't support Shift-JIS for database encoding,
 that is the native encoding for Windows Japanese edition.
 
 If we also want to support the same functionality on non-win32 platform,
 we might need non-throwable version of pg_do_encoding_conversion():
 
 log_message_to_write = pg_do_encoding_conversion_nothrow(
 log_message_in_database_encoding,
 GetDatabaseEncoding() /* as src_encoding */,
 GetPlatformEncoding() /* as dst_encoding */)
 
 and pass the result to stderr and syslog. But it requires major rewrites
 of conversion functions, so I'd like to submit a solution only for win32
 for now. Also, the issue is not so serious on non-win32 platforms because
 we can choose UTF-8 or EUC_* on those platforms.

Something like that seems reasonable for the Windows event log; that is
clearly supposed to be written using a specific encoding. With the log
files, we're more free to do what we want, and IMHO we shouldn't put a
Windows-specific hack there because as you say we have the same problem
on all platforms.

There's no guarantee that conversion to UTF-8 won't fail, so this isn't
totally risk-free on Windows either. Theoretically, MultiByteToWideChar
could fail too (the patch neglects to check for that), although I
suppose it can't really happen for UTF-8 - UTF-16 conversion.

Can't we use MultiByteToWideChar() to convert directly to the required
encoding, avoiding the double conversion?

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Encoding issues in console and eventlog on win32

2009-09-14 Thread Itagaki Takahiro

Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:

 Can't we use MultiByteToWideChar() to convert directly to the required
 encoding, avoiding the double conversion?

Here is an updated version of the patch.
I use direct conversion in pgwin32_toUTF16() if a corresponding codepage
is available. If not available, I still use double conversion.

Now pgwin32_toUTF16() is exported from mbutil.c. I used the function
in following parts, although the main target of the patch is eventlog.

  * WriteConsoleW() - write unredirected stderr log.
  * ReportEventW()  - write evenlog.
  * CreateFileW()   - open non-ascii filename (ex. COPY TO/FROM 'mb-path').

This approach is only available for Windows because any other platform
don't support locale-independent and wide-character-based system calls.
Other platforms require a different approach, but even then we'd still
better have win32-specific routines because UTF16 is the native encoding
in Windows.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



eventlog-20090915.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers