Re: plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)
I wrote: > Andrew Dunstan writes: >> Let's just stick to ASCII. > The more I think about it, the more I think that using a plain-ASCII > character would defeat most of the purpose of the test. Non-breaking > space seems like the best bet here, not least because it has several > different representations among the encodings we support. I've confirmed that a patch as attached behaves per expectation (in particular, it passes with WIN1250 database encoding). I think that the worry I expressed about UTF8 characters in expected-files is probably overblown: we have such in collate.linux.utf8 test, and we've not heard reports of that one breaking. (Though of course, it's not run by default :-(.) It's still annoying that the test would fail in EUC_xx encodings, but I see no way around that without largely lobotomizing the test. So I propose to apply this, and back-patch to 9.1 (not 9.0, because its version of this test is different anyway --- so Tomas will have to drop testing cs_CZ.WIN-1250 in 9.0). regards, tom lane diff --git a/src/pl/plpython/expected/plpython_unicode.out b/src/pl/plpython/expected/plpython_unicode.out index 859edbb..c7546dd 100644 *** a/src/pl/plpython/expected/plpython_unicode.out --- b/src/pl/plpython/expected/plpython_unicode.out *** *** 1,22 -- -- Unicode handling -- SET client_encoding TO UTF8; CREATE TABLE unicode_test ( testvalue text NOT NULL ); CREATE FUNCTION unicode_return() RETURNS text AS E' ! return u"\\x80" ' LANGUAGE plpythonu; CREATE FUNCTION unicode_trigger() RETURNS trigger AS E' ! TD["new"]["testvalue"] = u"\\x80" return "MODIFY" ' LANGUAGE plpythonu; CREATE TRIGGER unicode_test_bi BEFORE INSERT ON unicode_test FOR EACH ROW EXECUTE PROCEDURE unicode_trigger(); CREATE FUNCTION unicode_plan1() RETURNS text AS E' plan = plpy.prepare("SELECT $1 AS testvalue", ["text"]) ! rv = plpy.execute(plan, [u"\\x80"], 1) return rv[0]["testvalue"] ' LANGUAGE plpythonu; CREATE FUNCTION unicode_plan2() RETURNS text AS E' --- 1,27 -- -- Unicode handling -- + -- Note: this test case is known to fail if the database encoding is + -- EUC_CN, EUC_JP, EUC_KR, or EUC_TW, for lack of any equivalent to + -- U+00A0 (no-break space) in those encodings. However, testing with + -- plain ASCII data would be rather useless, so we must live with that. + -- SET client_encoding TO UTF8; CREATE TABLE unicode_test ( testvalue text NOT NULL ); CREATE FUNCTION unicode_return() RETURNS text AS E' ! return u"\\xA0" ' LANGUAGE plpythonu; CREATE FUNCTION unicode_trigger() RETURNS trigger AS E' ! TD["new"]["testvalue"] = u"\\xA0" return "MODIFY" ' LANGUAGE plpythonu; CREATE TRIGGER unicode_test_bi BEFORE INSERT ON unicode_test FOR EACH ROW EXECUTE PROCEDURE unicode_trigger(); CREATE FUNCTION unicode_plan1() RETURNS text AS E' plan = plpy.prepare("SELECT $1 AS testvalue", ["text"]) ! rv = plpy.execute(plan, [u"\\xA0"], 1) return rv[0]["testvalue"] ' LANGUAGE plpythonu; CREATE FUNCTION unicode_plan2() RETURNS text AS E' *** return rv[0]["testvalue"] *** 27,46 SELECT unicode_return(); unicode_return ! \u0080 (1 row) INSERT INTO unicode_test (testvalue) VALUES ('test'); SELECT * FROM unicode_test; testvalue --- ! \u0080 (1 row) SELECT unicode_plan1(); unicode_plan1 --- ! \u0080 (1 row) SELECT unicode_plan2(); --- 32,51 SELECT unicode_return(); unicode_return ! Â (1 row) INSERT INTO unicode_test (testvalue) VALUES ('test'); SELECT * FROM unicode_test; testvalue --- ! Â (1 row) SELECT unicode_plan1(); unicode_plan1 --- ! Â (1 row) SELECT unicode_plan2(); diff --git a/src/pl/plpython/sql/plpython_unicode.sql b/src/pl/plpython/sql/plpython_unicode.sql index bdd40c4..a11e5ee 100644 *** a/src/pl/plpython/sql/plpython_unicode.sql --- b/src/pl/plpython/sql/plpython_unicode.sql *** *** 1,6 --- 1,11 -- -- Unicode handling -- + -- Note: this test case is known to fail if the database encoding is + -- EUC_CN, EUC_JP, EUC_KR, or EUC_TW, for lack of any equivalent to + -- U+00A0 (no-break space) in those encodings. However, testing with + -- plain ASCII data would be rather useless, so we must live with that. + -- SET client_encoding TO UTF8; *** CREATE TABLE unicode_test ( *** 9,19 ); CREATE FUNCTION unicode_return() RETURNS text AS E' ! return u"\\x80" ' LANGUAGE plpythonu; CREATE FUNCTION unicode_trigger() RETURNS trigger AS E' ! TD["new"]["testvalue"] = u"\\x80" return "MODIFY" ' LANGUAGE plpythonu; --- 14,24 ); CREATE FUNCTION unicode_return() RETURNS text AS E' ! return u"\\xA0" ' LANGUAGE plpythonu; CREATE FUNCTION unicode_trigger() RETURNS trigger AS E' ! TD["new"]["testvalue"] = u"\\xA0" return "MODIFY" '
Re: plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)
Andrew Dunstan writes: > On 06/01/2014 05:35 PM, Tom Lane wrote: >> I did a little bit of experimentation and determined that none of the >> LATIN1 characters are significantly more portable than what we've got: >> for instance a-acute fails to convert into 16 of the 33 supported >> server-side encodings (versus 17 failures for U+0080). However, >> non-breaking space is significantly better: it converts into all our >> supported server encodings except EUC_CN, EUC_JP, EUC_KR, EUC_TW. >> It seems likely that we won't do better than that except with a basic >> ASCII character. > Yeah, I just looked at the copyright symbol, with similar results. I'd been hopeful about that one too, but nope :-( > Let's just stick to ASCII. The more I think about it, the more I think that using a plain-ASCII character would defeat most of the purpose of the test. Non-breaking space seems like the best bet here, not least because it has several different representations among the encodings we support. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)
On 06/01/2014 05:35 PM, Tom Lane wrote: I wrote: 3. Try to select some "more portable" non-ASCII character, perhaps U+00A0 (non breaking space) or U+00E1 (a-acute). I think this would probably work for most encodings but it might still fail in the Far East. Another objection is that the expected/plpython_unicode.out file would contain that character in UTF8 form. In principle that would work, since the test sets client_encoding = utf8 explicitly, but I'm worried about accidental corruption of the expected file by text editors, file transfers, etc. (The current usage of U+0080 doesn't suffer from this risk because psql special-cases printing of multibyte UTF8 control characters, so that we get exactly "\u0080".) I did a little bit of experimentation and determined that none of the LATIN1 characters are significantly more portable than what we've got: for instance a-acute fails to convert into 16 of the 33 supported server-side encodings (versus 17 failures for U+0080). However, non-breaking space is significantly better: it converts into all our supported server encodings except EUC_CN, EUC_JP, EUC_KR, EUC_TW. It seems likely that we won't do better than that except with a basic ASCII character. Yeah, I just looked at the copyright symbol, with similar results. Let's just stick to ASCII. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)
I wrote: > 3. Try to select some "more portable" non-ASCII character, perhaps U+00A0 > (non breaking space) or U+00E1 (a-acute). I think this would probably > work for most encodings but it might still fail in the Far East. Another > objection is that the expected/plpython_unicode.out file would contain > that character in UTF8 form. In principle that would work, since the test > sets client_encoding = utf8 explicitly, but I'm worried about accidental > corruption of the expected file by text editors, file transfers, etc. > (The current usage of U+0080 doesn't suffer from this risk because psql > special-cases printing of multibyte UTF8 control characters, so that we > get exactly "\u0080".) I did a little bit of experimentation and determined that none of the LATIN1 characters are significantly more portable than what we've got: for instance a-acute fails to convert into 16 of the 33 supported server-side encodings (versus 17 failures for U+0080). However, non-breaking space is significantly better: it converts into all our supported server encodings except EUC_CN, EUC_JP, EUC_KR, EUC_TW. It seems likely that we won't do better than that except with a basic ASCII character. In principle we could make the test "pass" even in these encodings by adding variant expected files, but I doubt it's worth it. I'd be inclined to just add a comment to the regression test file indicating that that's a known failure case, and move on. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers