Re: [HACKERS] UTF-8 encoding problem w/ libpq

2013-06-10 Thread Heikki Linnakangas

On 04.06.2013 09:39, Martin Schäfer wrote:

Can't really blame Windows on that. On Windows, we don't require that the
encoding and LC_CTYPE's charset match. The OP used UTF-8 encoding in the
server, but LC_CTYPE="English_United Kingdom.1252", ie. LC_CTYPE implies
WIN1252 encoding. We allow that and it generally works on Windows
because in varstr_cmp, we use MultiByteToWideChar() followed by
wcscoll_l(), which doesn't care about the charset implied by LC_CTYPE.
But for isupper(), it matters.


Does this mean that the UTF-8 messing up would disappear if the database were 
using a different locale for LC_CTYPE? If so, which locale should I use?
This would be useful for a temporary workaround.


Maybe, not sure. The logical thing to do would be to set LC_CTYPE to 
"English_United Kingdom.65001", which tell Windows to expect UTF-8 
charset. However, old discussions on this subject suggest that Windows 
won't accept that:


http://www.postgresql.org/message-id/20071015090954.gd4...@svr2.hagander.net

It's still worth a try, I think. Things might've changed since then. If 
that doesn't work, you could also try some other random codepages as a 
workaround. If you're lucky, one of them might work better, even though 
it would still be the wrong codepage for UTF-8.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF-8 encoding problem w/ libpq

2013-06-10 Thread Martin Schäfer
Thanks Andrew. I will test the next release.

Martin

> -Original Message-
> From: Andrew Dunstan [mailto:and...@dunslane.net]
> Sent: 08 June 2013 16:43
> To: Tom Lane
> Cc: Heikki Linnakangas; k...@rice.edu; Martin Schäfer; pgsql-
> hack...@postgresql.org
> Subject: Re: [HACKERS] UTF-8 encoding problem w/ libpq
> 
> 
> On 06/03/2013 02:41 PM, Andrew Dunstan wrote:
> >
> > On 06/03/2013 02:28 PM, Tom Lane wrote:
> >> . I wonder though if we couldn't just fix this code to not do
> >> anything to high-bit-set bytes in multibyte encodings.
> >
> >
> > That's exactly what I suggested back in November.
> 
> 
> This thread seems to have gone cold, so I have applied the fix I originally
> suggested along these lines to all live branches.
> 
> At least that means we won't produce junk, but we still need to work out
> how to downcase multi-byte characters.
> 
> If anyone thinks there are other places in the code that need similar
> treatment, they are welcome to find them. I have not yet found one.
> 
> 
> cheers
> 
> andrew
> 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF-8 encoding problem w/ libpq

2013-06-08 Thread Andrew Dunstan


On 06/03/2013 02:41 PM, Andrew Dunstan wrote:


On 06/03/2013 02:28 PM, Tom Lane wrote:
. I wonder though if we couldn't just fix this code to not do 
anything to high-bit-set bytes in multibyte encodings.



That's exactly what I suggested back in November.



This thread seems to have gone cold, so I have applied the fix I 
originally suggested along these lines to all live branches.


At least that means we won't produce junk, but we still need to work out 
how to downcase multi-byte characters.


If anyone thinks there are other places in the code that need similar 
treatment, they are welcome to find them. I have not yet found one.



cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF-8 encoding problem w/ libpq

2013-06-03 Thread Martin Schäfer
> Can't really blame Windows on that. On Windows, we don't require that the
> encoding and LC_CTYPE's charset match. The OP used UTF-8 encoding in the
> server, but LC_CTYPE="English_United Kingdom.1252", ie. LC_CTYPE implies
> WIN1252 encoding. We allow that and it generally works on Windows
> because in varstr_cmp, we use MultiByteToWideChar() followed by
> wcscoll_l(), which doesn't care about the charset implied by LC_CTYPE.
> But for isupper(), it matters.

Does this mean that the UTF-8 messing up would disappear if the database were 
using a different locale for LC_CTYPE? If so, which locale should I use?
This would be useful for a temporary workaround.

> > We talked about this before and went off into the weeds about whether
> > it was sensible to try to use towlower() and whether that wouldn't
> > create undesirably platform-sensitive results.  I wonder though if we
> > couldn't just fix this code to not do anything to high-bit-set bytes
> > in multibyte encodings.
> 
> Yeah, we should do that. It makes no sense to call isupper or tolower on
> bytes belonging to multi-byte characters.

Actually, I would expect that 'create table HÄUSER (...)' would create a table 
named 'häuser', and not a table named 'hÄuser', so towlower seems the right 
choice IMHO.

Martin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF-8 encoding problem w/ libpq

2013-06-03 Thread Heikki Linnakangas

On 03.06.2013 21:28, Tom Lane wrote:

Heikki Linnakangas  writes:

He *is* using UTF-8. Or trying to, anyway :-). The downcasing in the
backend is supposed to leave bytes with the high-bit set alone, ie. in
UTF-8 encoding, it's supposed to leave ä and ß alone.


Well, actually, downcase_truncate_identifier() is doing this:

unsigned char ch = (unsigned char) ident[i];

if (ch>= 'A'&&  ch<= 'Z')
ch += 'a' - 'A';
else if (IS_HIGHBIT_SET(ch)&&  isupper(ch))
ch = tolower(ch);

There's basically no way that that second case can give pleasant results
in a multibyte encoding, other than by not doing anything.


Hmph, I see.


I suspect
that Windows' libc has fewer defenses than other implementations and
performs some transformation that we don't get elsewhere.  This may also
explain the gripe yesterday in -general about funny results in OS X.


Can't really blame Windows on that. On Windows, we don't require that 
the encoding and LC_CTYPE's charset match. The OP used UTF-8 encoding in 
the server, but LC_CTYPE="English_United Kingdom.1252", ie. LC_CTYPE 
implies WIN1252 encoding. We allow that and it generally works on 
Windows because in varstr_cmp, we use MultiByteToWideChar() followed by 
wcscoll_l(), which doesn't care about the charset implied by LC_CTYPE. 
But for isupper(), it matters.



We talked about this before and went off into the weeds about whether
it was sensible to try to use towlower() and whether that wouldn't
create undesirably platform-sensitive results.  I wonder though if we
couldn't just fix this code to not do anything to high-bit-set bytes
in multibyte encodings.


Yeah, we should do that. It makes no sense to call isupper or tolower on 
bytes belonging to multi-byte characters.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF-8 encoding problem w/ libpq

2013-06-03 Thread Andrew Dunstan


On 06/03/2013 02:28 PM, Tom Lane wrote:
. I wonder though if we couldn't just fix this code to not do anything 
to high-bit-set bytes in multibyte encodings.



That's exactly what I suggested back in November.

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF-8 encoding problem w/ libpq

2013-06-03 Thread Tom Lane
Heikki Linnakangas  writes:
> He *is* using UTF-8. Or trying to, anyway :-). The downcasing in the 
> backend is supposed to leave bytes with the high-bit set alone, ie. in 
> UTF-8 encoding, it's supposed to leave ä and ß alone.

Well, actually, downcase_truncate_identifier() is doing this:

unsigned char ch = (unsigned char) ident[i];

if (ch >= 'A' && ch <= 'Z')
ch += 'a' - 'A';
else if (IS_HIGHBIT_SET(ch) && isupper(ch))
ch = tolower(ch);

There's basically no way that that second case can give pleasant results
in a multibyte encoding, other than by not doing anything.  I suspect
that Windows' libc has fewer defenses than other implementations and
performs some transformation that we don't get elsewhere.  This may also
explain the gripe yesterday in -general about funny results in OS X.

We talked about this before and went off into the weeds about whether
it was sensible to try to use towlower() and whether that wouldn't
create undesirably platform-sensitive results.  I wonder though if we
couldn't just fix this code to not do anything to high-bit-set bytes
in multibyte encodings.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF-8 encoding problem w/ libpq

2013-06-03 Thread Andrew Dunstan


On 06/03/2013 12:22 PM, Heikki Linnakangas wrote:

On 03.06.2013 18:27, k...@rice.edu wrote:

On Mon, Jun 03, 2013 at 04:09:29PM +0100, Martin Schäfer wrote:


If I change the strCreate query and add double quotes around the 
column

name, then the problem disappears. But the original name is already in
lowercase, so I think it should also work without quoting the 
column name.
Am I missing some setup in either the database or in the use of 
libpq?


I’m using PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit

The database uses:
ENCODING = 'UTF8'
LC_COLLATE = 'English_United Kingdom.1252'
LC_CTYPE = 'English_United Kingdom.1252'

Thanks for any help,

Martin



Hi Martin,

If you do not want the lowercase behavior, you must put double-quotes
around the column name per the documentation:

http://www.postgresql.org/docs/9.2/interactive/sql-syntax-
lexical.html#SQL-SYNTAX-IDENTIFIERS

section 4.1.1.

Regards,
Ken


The original name 'id_äß' is already in lowercase. The backend 
should leave it unchanged IMO.


Only in utf-8 which needs to be double-quoted for a column name as 
you have

seen, otherwise the value will be lowercased per byte.


He *is* using UTF-8. Or trying to, anyway :-). The downcasing in the 
backend is supposed to leave bytes with the high-bit set alone, ie. in 
UTF-8 encoding, it's supposed to leave ä and ß alone.


I suspect that the conversion to UTF-8, before the string is sent to 
the server, is not being done correctly. I'm not sure what's wrong 
there, but I'd suggest printing the actual byte sequence sent to the 
server, to check if it's in fact valid UTF-8. ie. replace the PQexec() 
line with something like:


const char *s = ToUtf8(strCreate.c_str()).c_str();
int i;
for (i=0; s[i]; i++)
  printf("%02x", (unsigned char) s[i]);
printf("\n");
pResult = PQexec(pConn, s);

That should contain the UTF-8 byte sequence for äß, "c3a4c39f"





Umm, no, the backend code doesn't do it right. Some time ago I suggested 
a fix for this - see 
. 
Tom thought there might be other places that need fixing, and I haven't 
had time to look for them. But maybe we should just fix this one for now 
at least.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF-8 encoding problem w/ libpq

2013-06-03 Thread Heikki Linnakangas

On 03.06.2013 18:27, k...@rice.edu wrote:

On Mon, Jun 03, 2013 at 04:09:29PM +0100, Martin Schäfer wrote:



If I change the strCreate query and add double quotes around the column

name, then the problem disappears. But the original name is already in
lowercase, so I think it should also work without quoting the column name.

Am I missing some setup in either the database or in the use of libpq?

I’m using PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit

The database uses:
ENCODING = 'UTF8'
LC_COLLATE = 'English_United Kingdom.1252'
LC_CTYPE = 'English_United Kingdom.1252'

Thanks for any help,

Martin



Hi Martin,

If you do not want the lowercase behavior, you must put double-quotes
around the column name per the documentation:

http://www.postgresql.org/docs/9.2/interactive/sql-syntax-
lexical.html#SQL-SYNTAX-IDENTIFIERS

section 4.1.1.

Regards,
Ken


The original name 'id_äß' is already in lowercase. The backend should leave it 
unchanged IMO.


Only in utf-8 which needs to be double-quoted for a column name as you have
seen, otherwise the value will be lowercased per byte.


He *is* using UTF-8. Or trying to, anyway :-). The downcasing in the 
backend is supposed to leave bytes with the high-bit set alone, ie. in 
UTF-8 encoding, it's supposed to leave ä and ß alone.


I suspect that the conversion to UTF-8, before the string is sent to the 
server, is not being done correctly. I'm not sure what's wrong there, 
but I'd suggest printing the actual byte sequence sent to the server, to 
check if it's in fact valid UTF-8. ie. replace the PQexec() line with 
something like:


const char *s = ToUtf8(strCreate.c_str()).c_str();
int i;
for (i=0; s[i]; i++)
  printf("%02x", (unsigned char) s[i]);
printf("\n");
pResult = PQexec(pConn, s);

That should contain the UTF-8 byte sequence for äß, "c3a4c39f"

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF-8 encoding problem w/ libpq

2013-06-03 Thread k...@rice.edu
On Mon, Jun 03, 2013 at 04:09:29PM +0100, Martin Schäfer wrote:
> 
> > > If I change the strCreate query and add double quotes around the column
> > name, then the problem disappears. But the original name is already in
> > lowercase, so I think it should also work without quoting the column name.
> > > Am I missing some setup in either the database or in the use of libpq?
> > >
> > > I’m using PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit
> > >
> > > The database uses:
> > > ENCODING = 'UTF8'
> > > LC_COLLATE = 'English_United Kingdom.1252'
> > > LC_CTYPE = 'English_United Kingdom.1252'
> > >
> > > Thanks for any help,
> > >
> > > Martin
> > >
> > 
> > Hi Martin,
> > 
> > If you do not want the lowercase behavior, you must put double-quotes
> > around the column name per the documentation:
> > 
> > http://www.postgresql.org/docs/9.2/interactive/sql-syntax-
> > lexical.html#SQL-SYNTAX-IDENTIFIERS
> > 
> > section 4.1.1.
> > 
> > Regards,
> > Ken
> 
> The original name 'id_äß' is already in lowercase. The backend should leave 
> it unchanged IMO.
> 
> Regards,
> Martin
> 

Only in utf-8 which needs to be double-quoted for a column name as you have
seen, otherwise the value will be lowercased per byte.

Regards,
Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF-8 encoding problem w/ libpq

2013-06-03 Thread Martin Schäfer


> -Original Message-
> From: k...@rice.edu [mailto:k...@rice.edu]
> Sent: 03 June 2013 16:48
> To: Martin Schäfer
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] UTF-8 encoding problem w/ libpq
> 
> On Mon, Jun 03, 2013 at 03:40:14PM +0100, Martin Schäfer wrote:
> > I try to create database columns with umlauts, using the UTF8 client
> encoding. However, the server seems to mess up the column names. In
> particular, it seems to perform a lowercase operation on each byte of the
> UTF-8 multi-byte sequence.
> >
> > Here is my code:
> >
> > const wchar_t *strName = L"id_äß";
> > wstring strCreate = wstring(L"create table test_umlaut(") +
> > strName + L" integer primary key)";
> >
> > PGconn *pConn = PQsetdbLogin("", "", NULL, NULL, "dev503", "postgres",
> "**");
> > if (!pConn) FAIL;
> > if (PQsetClientEncoding(pConn, "UTF-8")) FAIL;
> >
> > PGresult *pResult = PQexec(pConn, "drop table test_umlaut");
> > if (pResult) PQclear(pResult);
> >
> > pResult = PQexec(pConn, ToUtf8(strCreate.c_str()).c_str());
> > if (pResult) PQclear(pResult);
> >
> > pResult = PQexec(pConn, "select * from test_umlaut");
> > if (!pResult) FAIL;
> > if (PQresultStatus(pResult)!=PGRES_TUPLES_OK) FAIL;
> > if (PQnfields(pResult)!=1) FAIL;
> > const char *fName = PQfname(pResult,0);
> >
> > ShowW("Name: ", strName);
> > ShowA("in UTF8:  ", ToUtf8(strName).c_str());
> > ShowA("from DB:  ", fName);
> > ShowW("in UTF16: ", ToWide(fName).c_str());
> >
> > PQclear(pResult);
> > PQreset(pConn);
> >
> > (ShowA/W call OutputDebugStringA/W, and ToUtf8/ToWide use
> > WideCharToMultiByte/MultiByteToWideChar with CP_UTF8.)
> >
> > And this is the output generated:
> >
> > Name: id_äß
> > in UTF8:  id_äß
> > from DB:  id_ã¤ãÿ
> > in UTF16: id_???
> >
> > It seems like the backend thinks the name is in ANSI encoding, not in UTF-8.
> > If I change the strCreate query and add double quotes around the column
> name, then the problem disappears. But the original name is already in
> lowercase, so I think it should also work without quoting the column name.
> > Am I missing some setup in either the database or in the use of libpq?
> >
> > I’m using PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit
> >
> > The database uses:
> > ENCODING = 'UTF8'
> > LC_COLLATE = 'English_United Kingdom.1252'
> > LC_CTYPE = 'English_United Kingdom.1252'
> >
> > Thanks for any help,
> >
> > Martin
> >
> 
> Hi Martin,
> 
> If you do not want the lowercase behavior, you must put double-quotes
> around the column name per the documentation:
> 
> http://www.postgresql.org/docs/9.2/interactive/sql-syntax-
> lexical.html#SQL-SYNTAX-IDENTIFIERS
> 
> section 4.1.1.
> 
> Regards,
> Ken

The original name 'id_äß' is already in lowercase. The backend should leave it 
unchanged IMO.

Regards,
Martin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF-8 encoding problem w/ libpq

2013-06-03 Thread k...@rice.edu
On Mon, Jun 03, 2013 at 03:40:14PM +0100, Martin Schäfer wrote:
> I try to create database columns with umlauts, using the UTF8 client 
> encoding. However, the server seems to mess up the column names. In 
> particular, it seems to perform a lowercase operation on each byte of the 
> UTF-8 multi-byte sequence.
> 
> Here is my code:
> 
> const wchar_t *strName = L"id_äß";
> wstring strCreate = wstring(L"create table test_umlaut(") + strName + L" 
> integer primary key)";
> 
> PGconn *pConn = PQsetdbLogin("", "", NULL, NULL, "dev503", "postgres", 
> "**");
> if (!pConn) FAIL;
> if (PQsetClientEncoding(pConn, "UTF-8")) FAIL;
> 
> PGresult *pResult = PQexec(pConn, "drop table test_umlaut");
> if (pResult) PQclear(pResult);
> 
> pResult = PQexec(pConn, ToUtf8(strCreate.c_str()).c_str());
> if (pResult) PQclear(pResult);
> 
> pResult = PQexec(pConn, "select * from test_umlaut");
> if (!pResult) FAIL;
> if (PQresultStatus(pResult)!=PGRES_TUPLES_OK) FAIL;
> if (PQnfields(pResult)!=1) FAIL;
> const char *fName = PQfname(pResult,0);
> 
> ShowW("Name: ", strName);
> ShowA("in UTF8:  ", ToUtf8(strName).c_str());
> ShowA("from DB:  ", fName);
> ShowW("in UTF16: ", ToWide(fName).c_str());
> 
> PQclear(pResult);
> PQreset(pConn);
> 
> (ShowA/W call OutputDebugStringA/W, and ToUtf8/ToWide use 
> WideCharToMultiByte/MultiByteToWideChar with CP_UTF8.)
> 
> And this is the output generated:
> 
> Name: id_äß
> in UTF8:  id_äß
> from DB:  id_ã¤ãÿ
> in UTF16: id_???
> 
> It seems like the backend thinks the name is in ANSI encoding, not in UTF-8.
> If I change the strCreate query and add double quotes around the column name, 
> then the problem disappears. But the original name is already in lowercase, 
> so I think it should also work without quoting the column name.
> Am I missing some setup in either the database or in the use of libpq?
> 
> I’m using PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit
> 
> The database uses:
> ENCODING = 'UTF8'
> LC_COLLATE = 'English_United Kingdom.1252'
> LC_CTYPE = 'English_United Kingdom.1252'
> 
> Thanks for any help,
> 
> Martin
> 

Hi Martin,

If you do not want the lowercase behavior, you must put double-quotes around the
column name per the documentation:

http://www.postgresql.org/docs/9.2/interactive/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS

section 4.1.1.

Regards,
Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] UTF-8 encoding problem w/ libpq

2013-06-03 Thread Martin Schäfer
I try to create database columns with umlauts, using the UTF8 client encoding. 
However, the server seems to mess up the column names. In particular, it seems 
to perform a lowercase operation on each byte of the UTF-8 multi-byte sequence.

Here is my code:

const wchar_t *strName = L"id_äß";
wstring strCreate = wstring(L"create table test_umlaut(") + strName + L" 
integer primary key)";

PGconn *pConn = PQsetdbLogin("", "", NULL, NULL, "dev503", "postgres", 
"**");
if (!pConn) FAIL;
if (PQsetClientEncoding(pConn, "UTF-8")) FAIL;

PGresult *pResult = PQexec(pConn, "drop table test_umlaut");
if (pResult) PQclear(pResult);

pResult = PQexec(pConn, ToUtf8(strCreate.c_str()).c_str());
if (pResult) PQclear(pResult);

pResult = PQexec(pConn, "select * from test_umlaut");
if (!pResult) FAIL;
if (PQresultStatus(pResult)!=PGRES_TUPLES_OK) FAIL;
if (PQnfields(pResult)!=1) FAIL;
const char *fName = PQfname(pResult,0);

ShowW("Name: ", strName);
ShowA("in UTF8:  ", ToUtf8(strName).c_str());
ShowA("from DB:  ", fName);
ShowW("in UTF16: ", ToWide(fName).c_str());

PQclear(pResult);
PQreset(pConn);

(ShowA/W call OutputDebugStringA/W, and ToUtf8/ToWide use 
WideCharToMultiByte/MultiByteToWideChar with CP_UTF8.)

And this is the output generated:

Name: id_äß
in UTF8:  id_äß
from DB:  id_ã¤ãÿ
in UTF16: id_???

It seems like the backend thinks the name is in ANSI encoding, not in UTF-8.
If I change the strCreate query and add double quotes around the column name, 
then the problem disappears. But the original name is already in lowercase, so 
I think it should also work without quoting the column name.
Am I missing some setup in either the database or in the use of libpq?

I’m using PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit

The database uses:
ENCODING = 'UTF8'
LC_COLLATE = 'English_United Kingdom.1252'
LC_CTYPE = 'English_United Kingdom.1252'

Thanks for any help,

Martin