Re: plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)
Andrew Dunstan and...@dunslane.net writes: On 06/01/2014 05:35 PM, Tom Lane wrote: I did a little bit of experimentation and determined that none of the LATIN1 characters are significantly more portable than what we've got: for instance a-acute fails to convert into 16 of the 33 supported server-side encodings (versus 17 failures for U+0080). However, non-breaking space is significantly better: it converts into all our supported server encodings except EUC_CN, EUC_JP, EUC_KR, EUC_TW. It seems likely that we won't do better than that except with a basic ASCII character. Yeah, I just looked at the copyright symbol, with similar results. I'd been hopeful about that one too, but nope :-( Let's just stick to ASCII. The more I think about it, the more I think that using a plain-ASCII character would defeat most of the purpose of the test. Non-breaking space seems like the best bet here, not least because it has several different representations among the encodings we support. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)
I wrote: Andrew Dunstan and...@dunslane.net writes: Let's just stick to ASCII. The more I think about it, the more I think that using a plain-ASCII character would defeat most of the purpose of the test. Non-breaking space seems like the best bet here, not least because it has several different representations among the encodings we support. I've confirmed that a patch as attached behaves per expectation (in particular, it passes with WIN1250 database encoding). I think that the worry I expressed about UTF8 characters in expected-files is probably overblown: we have such in collate.linux.utf8 test, and we've not heard reports of that one breaking. (Though of course, it's not run by default :-(.) It's still annoying that the test would fail in EUC_xx encodings, but I see no way around that without largely lobotomizing the test. So I propose to apply this, and back-patch to 9.1 (not 9.0, because its version of this test is different anyway --- so Tomas will have to drop testing cs_CZ.WIN-1250 in 9.0). regards, tom lane diff --git a/src/pl/plpython/expected/plpython_unicode.out b/src/pl/plpython/expected/plpython_unicode.out index 859edbb..c7546dd 100644 *** a/src/pl/plpython/expected/plpython_unicode.out --- b/src/pl/plpython/expected/plpython_unicode.out *** *** 1,22 -- -- Unicode handling -- SET client_encoding TO UTF8; CREATE TABLE unicode_test ( testvalue text NOT NULL ); CREATE FUNCTION unicode_return() RETURNS text AS E' ! return u\\x80 ' LANGUAGE plpythonu; CREATE FUNCTION unicode_trigger() RETURNS trigger AS E' ! TD[new][testvalue] = u\\x80 return MODIFY ' LANGUAGE plpythonu; CREATE TRIGGER unicode_test_bi BEFORE INSERT ON unicode_test FOR EACH ROW EXECUTE PROCEDURE unicode_trigger(); CREATE FUNCTION unicode_plan1() RETURNS text AS E' plan = plpy.prepare(SELECT $1 AS testvalue, [text]) ! rv = plpy.execute(plan, [u\\x80], 1) return rv[0][testvalue] ' LANGUAGE plpythonu; CREATE FUNCTION unicode_plan2() RETURNS text AS E' --- 1,27 -- -- Unicode handling -- + -- Note: this test case is known to fail if the database encoding is + -- EUC_CN, EUC_JP, EUC_KR, or EUC_TW, for lack of any equivalent to + -- U+00A0 (no-break space) in those encodings. However, testing with + -- plain ASCII data would be rather useless, so we must live with that. + -- SET client_encoding TO UTF8; CREATE TABLE unicode_test ( testvalue text NOT NULL ); CREATE FUNCTION unicode_return() RETURNS text AS E' ! return u\\xA0 ' LANGUAGE plpythonu; CREATE FUNCTION unicode_trigger() RETURNS trigger AS E' ! TD[new][testvalue] = u\\xA0 return MODIFY ' LANGUAGE plpythonu; CREATE TRIGGER unicode_test_bi BEFORE INSERT ON unicode_test FOR EACH ROW EXECUTE PROCEDURE unicode_trigger(); CREATE FUNCTION unicode_plan1() RETURNS text AS E' plan = plpy.prepare(SELECT $1 AS testvalue, [text]) ! rv = plpy.execute(plan, [u\\xA0], 1) return rv[0][testvalue] ' LANGUAGE plpythonu; CREATE FUNCTION unicode_plan2() RETURNS text AS E' *** return rv[0][testvalue] *** 27,46 SELECT unicode_return(); unicode_return ! \u0080 (1 row) INSERT INTO unicode_test (testvalue) VALUES ('test'); SELECT * FROM unicode_test; testvalue --- ! \u0080 (1 row) SELECT unicode_plan1(); unicode_plan1 --- ! \u0080 (1 row) SELECT unicode_plan2(); --- 32,51 SELECT unicode_return(); unicode_return ! Â (1 row) INSERT INTO unicode_test (testvalue) VALUES ('test'); SELECT * FROM unicode_test; testvalue --- ! Â (1 row) SELECT unicode_plan1(); unicode_plan1 --- ! Â (1 row) SELECT unicode_plan2(); diff --git a/src/pl/plpython/sql/plpython_unicode.sql b/src/pl/plpython/sql/plpython_unicode.sql index bdd40c4..a11e5ee 100644 *** a/src/pl/plpython/sql/plpython_unicode.sql --- b/src/pl/plpython/sql/plpython_unicode.sql *** *** 1,6 --- 1,11 -- -- Unicode handling -- + -- Note: this test case is known to fail if the database encoding is + -- EUC_CN, EUC_JP, EUC_KR, or EUC_TW, for lack of any equivalent to + -- U+00A0 (no-break space) in those encodings. However, testing with + -- plain ASCII data would be rather useless, so we must live with that. + -- SET client_encoding TO UTF8; *** CREATE TABLE unicode_test ( *** 9,19 ); CREATE FUNCTION unicode_return() RETURNS text AS E' ! return u\\x80 ' LANGUAGE plpythonu; CREATE FUNCTION unicode_trigger() RETURNS trigger AS E' ! TD[new][testvalue] = u\\x80 return MODIFY ' LANGUAGE plpythonu; --- 14,24 ); CREATE FUNCTION unicode_return() RETURNS text AS E' ! return u\\xA0 ' LANGUAGE plpythonu; CREATE FUNCTION unicode_trigger() RETURNS trigger AS E' ! TD[new][testvalue] = u\\xA0 return MODIFY ' LANGUAGE plpythonu; ***
plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)
Tomas Vondra t...@fuzzy.cz writes: On 13.5.2014 20:58, Tom Lane wrote: Tomas Vondra t...@fuzzy.cz writes: Yeah, not really what we were shooting for. I've fixed this by defining the missing locales, and indeed - magpie now fails in plpython tests. I saw that earlier today (tho right now the buildfarm server seems to not be responding :-(). Probably we should use some more-widely-used character code in that specific test? Any idea what other character could be used in those tests? ISTM fixing this universally would mean using ASCII characters - the subset of UTF-8 common to all the encodings. But I'm afraid that'd contradict the very purpose of those tests ... We really ought to resolve this issue so that we can get rid of some of the red in the buildfarm. ISTM there are three possible approaches: 1. Decide that we're not going to support running the plpython regression tests under weird server encodings, in which case Tomas should just remove cs_CZ.WIN-1250 from the set of encodings his buildfarm animals test. Don't much care for this, but it has the attraction of being minimal work. 2. Change the plpython_unicode test to use some ASCII character in place of \u0080. We could keep on using the \u syntax to create the character, but as stated above, this still seems like it's losing a significant amount of test coverage. 3. Try to select some more portable non-ASCII character, perhaps U+00A0 (non breaking space) or U+00E1 (a-acute). I think this would probably work for most encodings but it might still fail in the Far East. Another objection is that the expected/plpython_unicode.out file would contain that character in UTF8 form. In principle that would work, since the test sets client_encoding = utf8 explicitly, but I'm worried about accidental corruption of the expected file by text editors, file transfers, etc. (The current usage of U+0080 doesn't suffer from this risk because psql special-cases printing of multibyte UTF8 control characters, so that we get exactly \u0080.) Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)
I wrote: 3. Try to select some more portable non-ASCII character, perhaps U+00A0 (non breaking space) or U+00E1 (a-acute). I think this would probably work for most encodings but it might still fail in the Far East. Another objection is that the expected/plpython_unicode.out file would contain that character in UTF8 form. In principle that would work, since the test sets client_encoding = utf8 explicitly, but I'm worried about accidental corruption of the expected file by text editors, file transfers, etc. (The current usage of U+0080 doesn't suffer from this risk because psql special-cases printing of multibyte UTF8 control characters, so that we get exactly \u0080.) I did a little bit of experimentation and determined that none of the LATIN1 characters are significantly more portable than what we've got: for instance a-acute fails to convert into 16 of the 33 supported server-side encodings (versus 17 failures for U+0080). However, non-breaking space is significantly better: it converts into all our supported server encodings except EUC_CN, EUC_JP, EUC_KR, EUC_TW. It seems likely that we won't do better than that except with a basic ASCII character. In principle we could make the test pass even in these encodings by adding variant expected files, but I doubt it's worth it. I'd be inclined to just add a comment to the regression test file indicating that that's a known failure case, and move on. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: plpython_unicode test (was Re: [HACKERS] buildfarm / handling (undefined) locales)
On 06/01/2014 05:35 PM, Tom Lane wrote: I wrote: 3. Try to select some more portable non-ASCII character, perhaps U+00A0 (non breaking space) or U+00E1 (a-acute). I think this would probably work for most encodings but it might still fail in the Far East. Another objection is that the expected/plpython_unicode.out file would contain that character in UTF8 form. In principle that would work, since the test sets client_encoding = utf8 explicitly, but I'm worried about accidental corruption of the expected file by text editors, file transfers, etc. (The current usage of U+0080 doesn't suffer from this risk because psql special-cases printing of multibyte UTF8 control characters, so that we get exactly \u0080.) I did a little bit of experimentation and determined that none of the LATIN1 characters are significantly more portable than what we've got: for instance a-acute fails to convert into 16 of the 33 supported server-side encodings (versus 17 failures for U+0080). However, non-breaking space is significantly better: it converts into all our supported server encodings except EUC_CN, EUC_JP, EUC_KR, EUC_TW. It seems likely that we won't do better than that except with a basic ASCII character. Yeah, I just looked at the copyright symbol, with similar results. Let's just stick to ASCII. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers