Re: plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)

2014-06-02 Thread Tom Lane
I wrote:
> Andrew Dunstan  writes:
>> Let's just stick to ASCII.

> The more I think about it, the more I think that using a plain-ASCII
> character would defeat most of the purpose of the test.  Non-breaking
> space seems like the best bet here, not least because it has several
> different representations among the encodings we support.

I've confirmed that a patch as attached behaves per expectation
(in particular, it passes with WIN1250 database encoding).

I think that the worry I expressed about UTF8 characters in expected-files
is probably overblown: we have such in collate.linux.utf8 test, and we've
not heard reports of that one breaking.  (Though of course, it's not run
by default :-(.)  It's still annoying that the test would fail in EUC_xx
encodings, but I see no way around that without largely lobotomizing the
test.

So I propose to apply this, and back-patch to 9.1 (not 9.0, because its
version of this test is different anyway --- so Tomas will have to drop
testing cs_CZ.WIN-1250 in 9.0).

regards, tom lane

diff --git a/src/pl/plpython/expected/plpython_unicode.out b/src/pl/plpython/expected/plpython_unicode.out
index 859edbb..c7546dd 100644
*** a/src/pl/plpython/expected/plpython_unicode.out
--- b/src/pl/plpython/expected/plpython_unicode.out
***
*** 1,22 
  --
  -- Unicode handling
  --
  SET client_encoding TO UTF8;
  CREATE TABLE unicode_test (
  	testvalue  text NOT NULL
  );
  CREATE FUNCTION unicode_return() RETURNS text AS E'
! return u"\\x80"
  ' LANGUAGE plpythonu;
  CREATE FUNCTION unicode_trigger() RETURNS trigger AS E'
! TD["new"]["testvalue"] = u"\\x80"
  return "MODIFY"
  ' LANGUAGE plpythonu;
  CREATE TRIGGER unicode_test_bi BEFORE INSERT ON unicode_test
FOR EACH ROW EXECUTE PROCEDURE unicode_trigger();
  CREATE FUNCTION unicode_plan1() RETURNS text AS E'
  plan = plpy.prepare("SELECT $1 AS testvalue", ["text"])
! rv = plpy.execute(plan, [u"\\x80"], 1)
  return rv[0]["testvalue"]
  ' LANGUAGE plpythonu;
  CREATE FUNCTION unicode_plan2() RETURNS text AS E'
--- 1,27 
  --
  -- Unicode handling
  --
+ -- Note: this test case is known to fail if the database encoding is
+ -- EUC_CN, EUC_JP, EUC_KR, or EUC_TW, for lack of any equivalent to
+ -- U+00A0 (no-break space) in those encodings.  However, testing with
+ -- plain ASCII data would be rather useless, so we must live with that.
+ --
  SET client_encoding TO UTF8;
  CREATE TABLE unicode_test (
  	testvalue  text NOT NULL
  );
  CREATE FUNCTION unicode_return() RETURNS text AS E'
! return u"\\xA0"
  ' LANGUAGE plpythonu;
  CREATE FUNCTION unicode_trigger() RETURNS trigger AS E'
! TD["new"]["testvalue"] = u"\\xA0"
  return "MODIFY"
  ' LANGUAGE plpythonu;
  CREATE TRIGGER unicode_test_bi BEFORE INSERT ON unicode_test
FOR EACH ROW EXECUTE PROCEDURE unicode_trigger();
  CREATE FUNCTION unicode_plan1() RETURNS text AS E'
  plan = plpy.prepare("SELECT $1 AS testvalue", ["text"])
! rv = plpy.execute(plan, [u"\\xA0"], 1)
  return rv[0]["testvalue"]
  ' LANGUAGE plpythonu;
  CREATE FUNCTION unicode_plan2() RETURNS text AS E'
*** return rv[0]["testvalue"]
*** 27,46 
  SELECT unicode_return();
   unicode_return 
  
!  \u0080
  (1 row)
  
  INSERT INTO unicode_test (testvalue) VALUES ('test');
  SELECT * FROM unicode_test;
   testvalue 
  ---
!  \u0080
  (1 row)
  
  SELECT unicode_plan1();
   unicode_plan1 
  ---
!  \u0080
  (1 row)
  
  SELECT unicode_plan2();
--- 32,51 
  SELECT unicode_return();
   unicode_return 
  
!   
  (1 row)
  
  INSERT INTO unicode_test (testvalue) VALUES ('test');
  SELECT * FROM unicode_test;
   testvalue 
  ---
!   
  (1 row)
  
  SELECT unicode_plan1();
   unicode_plan1 
  ---
!   
  (1 row)
  
  SELECT unicode_plan2();
diff --git a/src/pl/plpython/sql/plpython_unicode.sql b/src/pl/plpython/sql/plpython_unicode.sql
index bdd40c4..a11e5ee 100644
*** a/src/pl/plpython/sql/plpython_unicode.sql
--- b/src/pl/plpython/sql/plpython_unicode.sql
***
*** 1,6 
--- 1,11 
  --
  -- Unicode handling
  --
+ -- Note: this test case is known to fail if the database encoding is
+ -- EUC_CN, EUC_JP, EUC_KR, or EUC_TW, for lack of any equivalent to
+ -- U+00A0 (no-break space) in those encodings.  However, testing with
+ -- plain ASCII data would be rather useless, so we must live with that.
+ --
  
  SET client_encoding TO UTF8;
  
*** CREATE TABLE unicode_test (
*** 9,19 
  );
  
  CREATE FUNCTION unicode_return() RETURNS text AS E'
! return u"\\x80"
  ' LANGUAGE plpythonu;
  
  CREATE FUNCTION unicode_trigger() RETURNS trigger AS E'
! TD["new"]["testvalue"] = u"\\x80"
  return "MODIFY"
  ' LANGUAGE plpythonu;
  
--- 14,24 
  );
  
  CREATE FUNCTION unicode_return() RETURNS text AS E'
! return u"\\xA0"
  ' LANGUAGE plpythonu;
  
  CREATE FUNCTION unicode_trigger() RETURNS trigger AS E'
! TD["new"]["testvalue"] = u"\\xA0"
  return "MODIFY"
  '

Re: plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)

2014-06-02 Thread Tom Lane
Andrew Dunstan  writes:
> On 06/01/2014 05:35 PM, Tom Lane wrote:
>> I did a little bit of experimentation and determined that none of the
>> LATIN1 characters are significantly more portable than what we've got:
>> for instance a-acute fails to convert into 16 of the 33 supported
>> server-side encodings (versus 17 failures for U+0080).  However,
>> non-breaking space is significantly better: it converts into all our
>> supported server encodings except EUC_CN, EUC_JP, EUC_KR, EUC_TW.
>> It seems likely that we won't do better than that except with a basic
>> ASCII character.

> Yeah, I just looked at the copyright symbol, with similar results.

I'd been hopeful about that one too, but nope :-(

> Let's just stick to ASCII.

The more I think about it, the more I think that using a plain-ASCII
character would defeat most of the purpose of the test.  Non-breaking
space seems like the best bet here, not least because it has several
different representations among the encodings we support.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)

2014-06-01 Thread Andrew Dunstan


On 06/01/2014 05:35 PM, Tom Lane wrote:

I wrote:

3. Try to select some "more portable" non-ASCII character, perhaps U+00A0
(non breaking space) or U+00E1 (a-acute).  I think this would probably
work for most encodings but it might still fail in the Far East.  Another
objection is that the expected/plpython_unicode.out file would contain
that character in UTF8 form.  In principle that would work, since the test
sets client_encoding = utf8 explicitly, but I'm worried about accidental
corruption of the expected file by text editors, file transfers, etc.
(The current usage of U+0080 doesn't suffer from this risk because psql
special-cases printing of multibyte UTF8 control characters, so that we
get exactly "\u0080".)

I did a little bit of experimentation and determined that none of the
LATIN1 characters are significantly more portable than what we've got:
for instance a-acute fails to convert into 16 of the 33 supported
server-side encodings (versus 17 failures for U+0080).  However,
non-breaking space is significantly better: it converts into all our
supported server encodings except EUC_CN, EUC_JP, EUC_KR, EUC_TW.
It seems likely that we won't do better than that except with a basic
ASCII character.



Yeah, I just looked at the copyright symbol, with similar results.

Let's just stick to ASCII.

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)

2014-06-01 Thread Tom Lane
I wrote:
> 3. Try to select some "more portable" non-ASCII character, perhaps U+00A0
> (non breaking space) or U+00E1 (a-acute).  I think this would probably
> work for most encodings but it might still fail in the Far East.  Another
> objection is that the expected/plpython_unicode.out file would contain
> that character in UTF8 form.  In principle that would work, since the test
> sets client_encoding = utf8 explicitly, but I'm worried about accidental
> corruption of the expected file by text editors, file transfers, etc.
> (The current usage of U+0080 doesn't suffer from this risk because psql
> special-cases printing of multibyte UTF8 control characters, so that we
> get exactly "\u0080".)

I did a little bit of experimentation and determined that none of the
LATIN1 characters are significantly more portable than what we've got:
for instance a-acute fails to convert into 16 of the 33 supported
server-side encodings (versus 17 failures for U+0080).  However,
non-breaking space is significantly better: it converts into all our
supported server encodings except EUC_CN, EUC_JP, EUC_KR, EUC_TW.
It seems likely that we won't do better than that except with a basic
ASCII character.

In principle we could make the test "pass" even in these encodings
by adding variant expected files, but I doubt it's worth it.  I'd
be inclined to just add a comment to the regression test file indicating
that that's a known failure case, and move on.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers