Re: plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)

2014-06-02 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 06/01/2014 05:35 PM, Tom Lane wrote:
 I did a little bit of experimentation and determined that none of the
 LATIN1 characters are significantly more portable than what we've got:
 for instance a-acute fails to convert into 16 of the 33 supported
 server-side encodings (versus 17 failures for U+0080).  However,
 non-breaking space is significantly better: it converts into all our
 supported server encodings except EUC_CN, EUC_JP, EUC_KR, EUC_TW.
 It seems likely that we won't do better than that except with a basic
 ASCII character.

 Yeah, I just looked at the copyright symbol, with similar results.

I'd been hopeful about that one too, but nope :-(

 Let's just stick to ASCII.

The more I think about it, the more I think that using a plain-ASCII
character would defeat most of the purpose of the test.  Non-breaking
space seems like the best bet here, not least because it has several
different representations among the encodings we support.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)

2014-06-02 Thread Tom Lane
I wrote:
 Andrew Dunstan and...@dunslane.net writes:
 Let's just stick to ASCII.

 The more I think about it, the more I think that using a plain-ASCII
 character would defeat most of the purpose of the test.  Non-breaking
 space seems like the best bet here, not least because it has several
 different representations among the encodings we support.

I've confirmed that a patch as attached behaves per expectation
(in particular, it passes with WIN1250 database encoding).

I think that the worry I expressed about UTF8 characters in expected-files
is probably overblown: we have such in collate.linux.utf8 test, and we've
not heard reports of that one breaking.  (Though of course, it's not run
by default :-(.)  It's still annoying that the test would fail in EUC_xx
encodings, but I see no way around that without largely lobotomizing the
test.

So I propose to apply this, and back-patch to 9.1 (not 9.0, because its
version of this test is different anyway --- so Tomas will have to drop
testing cs_CZ.WIN-1250 in 9.0).

regards, tom lane

diff --git a/src/pl/plpython/expected/plpython_unicode.out b/src/pl/plpython/expected/plpython_unicode.out
index 859edbb..c7546dd 100644
*** a/src/pl/plpython/expected/plpython_unicode.out
--- b/src/pl/plpython/expected/plpython_unicode.out
***
*** 1,22 
  --
  -- Unicode handling
  --
  SET client_encoding TO UTF8;
  CREATE TABLE unicode_test (
  	testvalue  text NOT NULL
  );
  CREATE FUNCTION unicode_return() RETURNS text AS E'
! return u\\x80
  ' LANGUAGE plpythonu;
  CREATE FUNCTION unicode_trigger() RETURNS trigger AS E'
! TD[new][testvalue] = u\\x80
  return MODIFY
  ' LANGUAGE plpythonu;
  CREATE TRIGGER unicode_test_bi BEFORE INSERT ON unicode_test
FOR EACH ROW EXECUTE PROCEDURE unicode_trigger();
  CREATE FUNCTION unicode_plan1() RETURNS text AS E'
  plan = plpy.prepare(SELECT $1 AS testvalue, [text])
! rv = plpy.execute(plan, [u\\x80], 1)
  return rv[0][testvalue]
  ' LANGUAGE plpythonu;
  CREATE FUNCTION unicode_plan2() RETURNS text AS E'
--- 1,27 
  --
  -- Unicode handling
  --
+ -- Note: this test case is known to fail if the database encoding is
+ -- EUC_CN, EUC_JP, EUC_KR, or EUC_TW, for lack of any equivalent to
+ -- U+00A0 (no-break space) in those encodings.  However, testing with
+ -- plain ASCII data would be rather useless, so we must live with that.
+ --
  SET client_encoding TO UTF8;
  CREATE TABLE unicode_test (
  	testvalue  text NOT NULL
  );
  CREATE FUNCTION unicode_return() RETURNS text AS E'
! return u\\xA0
  ' LANGUAGE plpythonu;
  CREATE FUNCTION unicode_trigger() RETURNS trigger AS E'
! TD[new][testvalue] = u\\xA0
  return MODIFY
  ' LANGUAGE plpythonu;
  CREATE TRIGGER unicode_test_bi BEFORE INSERT ON unicode_test
FOR EACH ROW EXECUTE PROCEDURE unicode_trigger();
  CREATE FUNCTION unicode_plan1() RETURNS text AS E'
  plan = plpy.prepare(SELECT $1 AS testvalue, [text])
! rv = plpy.execute(plan, [u\\xA0], 1)
  return rv[0][testvalue]
  ' LANGUAGE plpythonu;
  CREATE FUNCTION unicode_plan2() RETURNS text AS E'
*** return rv[0][testvalue]
*** 27,46 
  SELECT unicode_return();
   unicode_return 
  
!  \u0080
  (1 row)
  
  INSERT INTO unicode_test (testvalue) VALUES ('test');
  SELECT * FROM unicode_test;
   testvalue 
  ---
!  \u0080
  (1 row)
  
  SELECT unicode_plan1();
   unicode_plan1 
  ---
!  \u0080
  (1 row)
  
  SELECT unicode_plan2();
--- 32,51 
  SELECT unicode_return();
   unicode_return 
  
!   
  (1 row)
  
  INSERT INTO unicode_test (testvalue) VALUES ('test');
  SELECT * FROM unicode_test;
   testvalue 
  ---
!   
  (1 row)
  
  SELECT unicode_plan1();
   unicode_plan1 
  ---
!   
  (1 row)
  
  SELECT unicode_plan2();
diff --git a/src/pl/plpython/sql/plpython_unicode.sql b/src/pl/plpython/sql/plpython_unicode.sql
index bdd40c4..a11e5ee 100644
*** a/src/pl/plpython/sql/plpython_unicode.sql
--- b/src/pl/plpython/sql/plpython_unicode.sql
***
*** 1,6 
--- 1,11 
  --
  -- Unicode handling
  --
+ -- Note: this test case is known to fail if the database encoding is
+ -- EUC_CN, EUC_JP, EUC_KR, or EUC_TW, for lack of any equivalent to
+ -- U+00A0 (no-break space) in those encodings.  However, testing with
+ -- plain ASCII data would be rather useless, so we must live with that.
+ --
  
  SET client_encoding TO UTF8;
  
*** CREATE TABLE unicode_test (
*** 9,19 
  );
  
  CREATE FUNCTION unicode_return() RETURNS text AS E'
! return u\\x80
  ' LANGUAGE plpythonu;
  
  CREATE FUNCTION unicode_trigger() RETURNS trigger AS E'
! TD[new][testvalue] = u\\x80
  return MODIFY
  ' LANGUAGE plpythonu;
  
--- 14,24 
  );
  
  CREATE FUNCTION unicode_return() RETURNS text AS E'
! return u\\xA0
  ' LANGUAGE plpythonu;
  
  CREATE FUNCTION unicode_trigger() RETURNS trigger AS E'
! TD[new][testvalue] = u\\xA0
  return MODIFY
  ' LANGUAGE plpythonu;
  
*** 

plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)

2014-06-01 Thread Tom Lane
Tomas Vondra t...@fuzzy.cz writes:
 On 13.5.2014 20:58, Tom Lane wrote:
 Tomas Vondra t...@fuzzy.cz writes:
 Yeah, not really what we were shooting for. I've fixed this by
 defining the missing locales, and indeed - magpie now fails in
 plpython tests.

 I saw that earlier today (tho right now the buildfarm server seems
 to not be responding :-().  Probably we should use some
 more-widely-used character code in that specific test?

 Any idea what other character could be used in those tests? ISTM fixing
 this universally would mean using ASCII characters - the subset of UTF-8
 common to all the encodings. But I'm afraid that'd contradict the very
 purpose of those tests ...

We really ought to resolve this issue so that we can get rid of some of
the red in the buildfarm.  ISTM there are three possible approaches:

1. Decide that we're not going to support running the plpython regression
tests under weird server encodings, in which case Tomas should just
remove cs_CZ.WIN-1250 from the set of encodings his buildfarm animals
test.  Don't much care for this, but it has the attraction of being
minimal work.

2. Change the plpython_unicode test to use some ASCII character in place
of \u0080.  We could keep on using the \u syntax to create the character,
but as stated above, this still seems like it's losing a significant
amount of test coverage.

3. Try to select some more portable non-ASCII character, perhaps U+00A0
(non breaking space) or U+00E1 (a-acute).  I think this would probably
work for most encodings but it might still fail in the Far East.  Another
objection is that the expected/plpython_unicode.out file would contain
that character in UTF8 form.  In principle that would work, since the test
sets client_encoding = utf8 explicitly, but I'm worried about accidental
corruption of the expected file by text editors, file transfers, etc.
(The current usage of U+0080 doesn't suffer from this risk because psql
special-cases printing of multibyte UTF8 control characters, so that we
get exactly \u0080.)

Thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)

2014-06-01 Thread Tom Lane
I wrote:
 3. Try to select some more portable non-ASCII character, perhaps U+00A0
 (non breaking space) or U+00E1 (a-acute).  I think this would probably
 work for most encodings but it might still fail in the Far East.  Another
 objection is that the expected/plpython_unicode.out file would contain
 that character in UTF8 form.  In principle that would work, since the test
 sets client_encoding = utf8 explicitly, but I'm worried about accidental
 corruption of the expected file by text editors, file transfers, etc.
 (The current usage of U+0080 doesn't suffer from this risk because psql
 special-cases printing of multibyte UTF8 control characters, so that we
 get exactly \u0080.)

I did a little bit of experimentation and determined that none of the
LATIN1 characters are significantly more portable than what we've got:
for instance a-acute fails to convert into 16 of the 33 supported
server-side encodings (versus 17 failures for U+0080).  However,
non-breaking space is significantly better: it converts into all our
supported server encodings except EUC_CN, EUC_JP, EUC_KR, EUC_TW.
It seems likely that we won't do better than that except with a basic
ASCII character.

In principle we could make the test pass even in these encodings
by adding variant expected files, but I doubt it's worth it.  I'd
be inclined to just add a comment to the regression test file indicating
that that's a known failure case, and move on.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)

2014-06-01 Thread Andrew Dunstan


On 06/01/2014 05:35 PM, Tom Lane wrote:

I wrote:

3. Try to select some more portable non-ASCII character, perhaps U+00A0
(non breaking space) or U+00E1 (a-acute).  I think this would probably
work for most encodings but it might still fail in the Far East.  Another
objection is that the expected/plpython_unicode.out file would contain
that character in UTF8 form.  In principle that would work, since the test
sets client_encoding = utf8 explicitly, but I'm worried about accidental
corruption of the expected file by text editors, file transfers, etc.
(The current usage of U+0080 doesn't suffer from this risk because psql
special-cases printing of multibyte UTF8 control characters, so that we
get exactly \u0080.)

I did a little bit of experimentation and determined that none of the
LATIN1 characters are significantly more portable than what we've got:
for instance a-acute fails to convert into 16 of the 33 supported
server-side encodings (versus 17 failures for U+0080).  However,
non-breaking space is significantly better: it converts into all our
supported server encodings except EUC_CN, EUC_JP, EUC_KR, EUC_TW.
It seems likely that we won't do better than that except with a basic
ASCII character.



Yeah, I just looked at the copyright symbol, with similar results.

Let's just stick to ASCII.

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers