Re: plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)

2014-06-02 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 06/01/2014 05:35 PM, Tom Lane wrote:
 I did a little bit of experimentation and determined that none of the
 LATIN1 characters are significantly more portable than what we've got:
 for instance a-acute fails to convert into 16 of the 33 supported
 server-side encodings (versus 17 failures for U+0080).  However,
 non-breaking space is significantly better: it converts into all our
 supported server encodings except EUC_CN, EUC_JP, EUC_KR, EUC_TW.
 It seems likely that we won't do better than that except with a basic
 ASCII character.

 Yeah, I just looked at the copyright symbol, with similar results.

I'd been hopeful about that one too, but nope :-(

 Let's just stick to ASCII.

The more I think about it, the more I think that using a plain-ASCII
character would defeat most of the purpose of the test.  Non-breaking
space seems like the best bet here, not least because it has several
different representations among the encodings we support.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)

2014-06-02 Thread Tom Lane
I wrote:
 Andrew Dunstan and...@dunslane.net writes:
 Let's just stick to ASCII.

 The more I think about it, the more I think that using a plain-ASCII
 character would defeat most of the purpose of the test.  Non-breaking
 space seems like the best bet here, not least because it has several
 different representations among the encodings we support.

I've confirmed that a patch as attached behaves per expectation
(in particular, it passes with WIN1250 database encoding).

I think that the worry I expressed about UTF8 characters in expected-files
is probably overblown: we have such in collate.linux.utf8 test, and we've
not heard reports of that one breaking.  (Though of course, it's not run
by default :-(.)  It's still annoying that the test would fail in EUC_xx
encodings, but I see no way around that without largely lobotomizing the
test.

So I propose to apply this, and back-patch to 9.1 (not 9.0, because its
version of this test is different anyway --- so Tomas will have to drop
testing cs_CZ.WIN-1250 in 9.0).

regards, tom lane

diff --git a/src/pl/plpython/expected/plpython_unicode.out b/src/pl/plpython/expected/plpython_unicode.out
index 859edbb..c7546dd 100644
*** a/src/pl/plpython/expected/plpython_unicode.out
--- b/src/pl/plpython/expected/plpython_unicode.out
***
*** 1,22 
  --
  -- Unicode handling
  --
  SET client_encoding TO UTF8;
  CREATE TABLE unicode_test (
  	testvalue  text NOT NULL
  );
  CREATE FUNCTION unicode_return() RETURNS text AS E'
! return u\\x80
  ' LANGUAGE plpythonu;
  CREATE FUNCTION unicode_trigger() RETURNS trigger AS E'
! TD[new][testvalue] = u\\x80
  return MODIFY
  ' LANGUAGE plpythonu;
  CREATE TRIGGER unicode_test_bi BEFORE INSERT ON unicode_test
FOR EACH ROW EXECUTE PROCEDURE unicode_trigger();
  CREATE FUNCTION unicode_plan1() RETURNS text AS E'
  plan = plpy.prepare(SELECT $1 AS testvalue, [text])
! rv = plpy.execute(plan, [u\\x80], 1)
  return rv[0][testvalue]
  ' LANGUAGE plpythonu;
  CREATE FUNCTION unicode_plan2() RETURNS text AS E'
--- 1,27 
  --
  -- Unicode handling
  --
+ -- Note: this test case is known to fail if the database encoding is
+ -- EUC_CN, EUC_JP, EUC_KR, or EUC_TW, for lack of any equivalent to
+ -- U+00A0 (no-break space) in those encodings.  However, testing with
+ -- plain ASCII data would be rather useless, so we must live with that.
+ --
  SET client_encoding TO UTF8;
  CREATE TABLE unicode_test (
  	testvalue  text NOT NULL
  );
  CREATE FUNCTION unicode_return() RETURNS text AS E'
! return u\\xA0
  ' LANGUAGE plpythonu;
  CREATE FUNCTION unicode_trigger() RETURNS trigger AS E'
! TD[new][testvalue] = u\\xA0
  return MODIFY
  ' LANGUAGE plpythonu;
  CREATE TRIGGER unicode_test_bi BEFORE INSERT ON unicode_test
FOR EACH ROW EXECUTE PROCEDURE unicode_trigger();
  CREATE FUNCTION unicode_plan1() RETURNS text AS E'
  plan = plpy.prepare(SELECT $1 AS testvalue, [text])
! rv = plpy.execute(plan, [u\\xA0], 1)
  return rv[0][testvalue]
  ' LANGUAGE plpythonu;
  CREATE FUNCTION unicode_plan2() RETURNS text AS E'
*** return rv[0][testvalue]
*** 27,46 
  SELECT unicode_return();
   unicode_return 
  
!  \u0080
  (1 row)
  
  INSERT INTO unicode_test (testvalue) VALUES ('test');
  SELECT * FROM unicode_test;
   testvalue 
  ---
!  \u0080
  (1 row)
  
  SELECT unicode_plan1();
   unicode_plan1 
  ---
!  \u0080
  (1 row)
  
  SELECT unicode_plan2();
--- 32,51 
  SELECT unicode_return();
   unicode_return 
  
!   
  (1 row)
  
  INSERT INTO unicode_test (testvalue) VALUES ('test');
  SELECT * FROM unicode_test;
   testvalue 
  ---
!   
  (1 row)
  
  SELECT unicode_plan1();
   unicode_plan1 
  ---
!   
  (1 row)
  
  SELECT unicode_plan2();
diff --git a/src/pl/plpython/sql/plpython_unicode.sql b/src/pl/plpython/sql/plpython_unicode.sql
index bdd40c4..a11e5ee 100644
*** a/src/pl/plpython/sql/plpython_unicode.sql
--- b/src/pl/plpython/sql/plpython_unicode.sql
***
*** 1,6 
--- 1,11 
  --
  -- Unicode handling
  --
+ -- Note: this test case is known to fail if the database encoding is
+ -- EUC_CN, EUC_JP, EUC_KR, or EUC_TW, for lack of any equivalent to
+ -- U+00A0 (no-break space) in those encodings.  However, testing with
+ -- plain ASCII data would be rather useless, so we must live with that.
+ --
  
  SET client_encoding TO UTF8;
  
*** CREATE TABLE unicode_test (
*** 9,19 
  );
  
  CREATE FUNCTION unicode_return() RETURNS text AS E'
! return u\\x80
  ' LANGUAGE plpythonu;
  
  CREATE FUNCTION unicode_trigger() RETURNS trigger AS E'
! TD[new][testvalue] = u\\x80
  return MODIFY
  ' LANGUAGE plpythonu;
  
--- 14,24 
  );
  
  CREATE FUNCTION unicode_return() RETURNS text AS E'
! return u\\xA0
  ' LANGUAGE plpythonu;
  
  CREATE FUNCTION unicode_trigger() RETURNS trigger AS E'
! TD[new][testvalue] = u\\xA0
  return MODIFY
  ' LANGUAGE plpythonu;
  
*** 

Re: plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)

2014-06-01 Thread Tom Lane
I wrote:
 3. Try to select some more portable non-ASCII character, perhaps U+00A0
 (non breaking space) or U+00E1 (a-acute).  I think this would probably
 work for most encodings but it might still fail in the Far East.  Another
 objection is that the expected/plpython_unicode.out file would contain
 that character in UTF8 form.  In principle that would work, since the test
 sets client_encoding = utf8 explicitly, but I'm worried about accidental
 corruption of the expected file by text editors, file transfers, etc.
 (The current usage of U+0080 doesn't suffer from this risk because psql
 special-cases printing of multibyte UTF8 control characters, so that we
 get exactly \u0080.)

I did a little bit of experimentation and determined that none of the
LATIN1 characters are significantly more portable than what we've got:
for instance a-acute fails to convert into 16 of the 33 supported
server-side encodings (versus 17 failures for U+0080).  However,
non-breaking space is significantly better: it converts into all our
supported server encodings except EUC_CN, EUC_JP, EUC_KR, EUC_TW.
It seems likely that we won't do better than that except with a basic
ASCII character.

In principle we could make the test pass even in these encodings
by adding variant expected files, but I doubt it's worth it.  I'd
be inclined to just add a comment to the regression test file indicating
that that's a known failure case, and move on.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)

2014-06-01 Thread Andrew Dunstan


On 06/01/2014 05:35 PM, Tom Lane wrote:

I wrote:

3. Try to select some more portable non-ASCII character, perhaps U+00A0
(non breaking space) or U+00E1 (a-acute).  I think this would probably
work for most encodings but it might still fail in the Far East.  Another
objection is that the expected/plpython_unicode.out file would contain
that character in UTF8 form.  In principle that would work, since the test
sets client_encoding = utf8 explicitly, but I'm worried about accidental
corruption of the expected file by text editors, file transfers, etc.
(The current usage of U+0080 doesn't suffer from this risk because psql
special-cases printing of multibyte UTF8 control characters, so that we
get exactly \u0080.)

I did a little bit of experimentation and determined that none of the
LATIN1 characters are significantly more portable than what we've got:
for instance a-acute fails to convert into 16 of the 33 supported
server-side encodings (versus 17 failures for U+0080).  However,
non-breaking space is significantly better: it converts into all our
supported server encodings except EUC_CN, EUC_JP, EUC_KR, EUC_TW.
It seems likely that we won't do better than that except with a basic
ASCII character.



Yeah, I just looked at the copyright symbol, with similar results.

Let's just stick to ASCII.

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers