Re: Add Unicode Support to the DBI
On Wed, 09 Nov 2011 19:41:33 +, "Martin J. Evans" wrote: > Your going to have a lot of problems with this test code and DBD::Unify > as we previously discovered that DBD::Unify does not decode the data > coming back from the database itself but it can be decoded by any Perl > script using DBD::Unify into the correct data. Any chance you could > change the test code to print out the results of type_info_all for > DBD::Unify and send me them? Right, I have this: --8<--- $type_info_all = [ { TYPE_NAME => 0, DATA_TYPE => 1, COLUMN_SIZE=> 2, LITERAL_PREFIX => 3, LITERAL_SUFFIX => 4, CREATE_PARAMS => 5, NULLABLE => 6, CASE_SENSITIVE => 7, SEARCHABLE => 8, UNSIGNED_ATTRIBUTE => 9, FIXED_PREC_SCALE => 10, AUTO_UNIQUE_VALUE => 11, LOCAL_TYPE_NAME=> 12, MINIMUM_SCALE => 13, MAXIMUM_SCALE => 14, SQL_DATA_TYPE => 15, SQL_DATETIME_SUB => 16, NUM_PREC_RADIX => 17, INTERVAL_PRECISION => 18, }, # TYPE_NAME DATA_TYPESIZE PFX SFX PARAMS N C S UNSIG FPS AUTO LOCAL MINSC MAXSC SDT SDS RADIX PREC [ "UNKNOWN", 0, undef,undef,undef,undef, 1,0,3,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef, ], # [ "GIANT AMOUNT", undef,undef,undef,"PRECISION,SCALE",1,0,3,0,undef,undef,undef,2,2, undef,undef,undef,undef, ], [ "HUGE AMOUNT", -207, undef,undef,undef,"PRECISION,SCALE",1,0,3,0,undef,undef,undef,2,2, undef,undef,undef,undef, ], [ "AMOUNT", -206, undef,undef,undef,"PRECISION,SCALE",1,0,3,0,undef,undef,undef,2,2, undef,undef,undef,undef, ], [ "VARBINARY",SQL_VARBINARY, undef,"'", ,"'", undef, 1,0,3,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef, ], [ "BINARY", SQL_BINARY, undef,undef,undef,undef, 1,0,3,0,undef,undef,undef,0,undef,undef,undef,undef,undef, ], [ "CHAR", SQL_CHAR,undef,undef,undef,"PRECISION", 1,0,3,0,undef,undef,undef,0,undef,undef,undef,undef,undef, ], [ "CURRENCY", -218, undef,undef,undef,"PRECISION,SCALE",1,0,3,0,2,undef,undef,0,8, undef,undef,undef,undef, ], [ "TIMESTAMP",SQL_TIMESTAMP, undef,undef,undef,undef, 1,0,3,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef, ], [ "DATE", SQL_DATE,undef,undef,undef,undef, 1,0,3,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef, ], [ "DECIMAL", SQL_DECIMAL, undef,undef,undef,"PRECISION", 1,0,3,0,undef,undef,undef,0,undef,undef,undef,undef,undef, ], [ "DOUBLE PRECISION", SQL_DOUBLE, undef,undef,undef,"PRECISION", 1,0,3,0,undef,undef,undef,0,undef,undef,undef,undef,undef, ], [ "FLOAT",SQL_FLOAT, undef,undef,undef,"PRECISION", 1,0,3,0,undef,undef,undef,0,undef,undef,undef,undef,undef, ], [ "HUGE INTEGER", SQL_BIGINT, undef,undef,undef,"PRECISION", 1,0,3,0,undef,undef,undef,0,0,undef,undef,undef,undef, ], [ "INTEGER", SQL_INTEGER, undef,undef,undef,"PRECISION", 1,0,3,0,undef,undef,undef,0,undef,undef,undef,undef,undef, ], [ "NUMERIC", SQL_NUMERIC, undef,undef,undef,"PRECISION", 1,0,3,0,undef,undef,undef,0,0,undef,undef,undef,undef, ], [ "REAL", SQL_REAL,undef,undef,undef,"PRECISION", 1,0,3,0,undef,undef,undef,0,undef,undef,undef,undef,undef, ], [ "SMALLINT", SQL_SMALLINT,undef,undef,undef,"PRECISION", 1,0,3,0,undef,undef,undef,0,0,undef,undef,undef,undef, ], [ "TEXT", SQL_LONGVARCHAR, undef,"'", ,"'", undef, 1,0,3,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef, ], [ "TIME", SQL_TIME,undef,undef,undef,undef, 1,0,3,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef, ], ]; my %odbc_types = map { ( $_->[0] => $_->[1], $_->[1] => $_->[0] ) } [ -5 => "BIGINT" ], # SQL_BIGINT [ -3 => "VARBINARY" ], # SQL_VARBINARY [ -2 => "BINARY" ], # SQL_BINARY [ -1 => "TEXT"], # SQL_LONGVARCHAR [ 0 => "UNKNOWN_TYPE"], # SQL_UNKNOWN_TYPE [ 1 => "CHAR"], # SQL_CHAR [ 2 => "NUMERIC" ], # SQL_NUMERIC [ 3 => "DECIMAL" ], # SQL_DECIMAL [ 4 => "INTEGER" ], # SQL_INTEGER [ 5 => "SMALLINT"], # SQL_SMALLINT [ 6 => "FLOAT" ], # SQL_FLOAT [ 7 => "REAL"], # SQL_REAL [ 8 => "DOUBLE
Re: Add Unicode Support to the DBI
On Wed, 09 Nov 2011 19:41:33 +, "Martin J. Evans" wrote: tl;dr; > On 09/11/2011 15:49, H.Merijn Brand wrote: > > On Tue, 08 Nov 2011 21:12:13 +, "Martin J. Evans" > > wrote: > > > >> I've just checked in unicode_test.pl to DBI's subversion trunk in /ex dir. > >> > >> It won't run right now without changing the do_connect sub as you have > >> to specify how to connect to the DB. > >> Also, there is a DBD specific section at the start where you might have > >> to add a DBD it does not know about (anything other than DBD::ODBC, > >> DBD::Oracle, DBD::SQLite, DBD::CSV, DBD::mysql) if it needs something > >> other than the defaults e.g., the name of the length function in SQL, > >> the column type for unicode columns and binary columns, the setting to > >> enable UTF8/Unicode support. It could be a bit of a pain if your DBD > >> does not support type_info_all but I'm around on irc and in this list if > >> anyone wants any help making it work. > >> > >> It needs rather a lot of tidying up so I'm not putting it forward as > >> code-of-the-year but it is a start. > >> > >> BTW, you'll need Test::More::UTF8 and perhaps a couple of other non core > >> modules to run it. > >> > >> Martin > > I'll have some deeper look at both Unify and CSV ... > > Attached is a revised version of the script (first argument is the > > driver to test, some need more work) > > > > $ perl /tmp/unicode_test.pl Unify > > # Driver DBD::Unify-0.78 > > # Using DBMS_NAME 'Unify DataServer' > > # Using DBMS_VER undef > > # Using DRIVER_NAME '/pro/asql/v83I/lib/perl/5.10.1/DBD/Unify.pm' > > # Using DRIVER_VER '00.78.' > > # SQL_IDENTIFIER_CASE 3 > > # LANGDIR = dutch > > print() on closed filehandle $fh at /tmp/unicode_test.pl line 438. > > Use of uninitialized value in concatenation (.) or string at > > /tmp/unicode_test.pl line 421. > > Use of uninitialized value in concatenation (.) or string at > > /tmp/unicode_test.pl line 421. > > # Found type (HUGE AMOUNT) size= > > Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422. > > Use of uninitialized value in concatenation (.) or string at > > /tmp/unicode_test.pl line 421. > > Use of uninitialized value in concatenation (.) or string at > > /tmp/unicode_test.pl line 421. > > # Found type (AMOUNT) size= > > Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422. > > Use of uninitialized value in concatenation (.) or string at > > /tmp/unicode_test.pl line 421. > > Use of uninitialized value in concatenation (.) or string at > > /tmp/unicode_test.pl line 421. > > # Found type (BINARY) size= > > Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422. > > Use of uninitialized value in concatenation (.) or string at > > /tmp/unicode_test.pl line 421. > > : > > : > > # Found type (TIME) size= > > Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422. > > DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. [for > > Statement "create table fredĀ ( a int)"] at /tmp/unicode_test.pl line 214. > > not ok 1 - unicode table name supported > > # Failed test 'unicode table name supported' > > # at /tmp/unicode_test.pl line 216. > > # died: DBD::Unify::db prepare failed: Syntax error in SQL dynamic > > statement. [for Statement "create table fredÄ ( a int)"] at > > /tmp/unicode_test.pl line 214. > > ok 2 # skip Failed to create unicode table name > > ok 3 # skip Failed to create unicode table name > > DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. [for > > Statement "create table fred ( daveĀ int)"] at /tmp/unicode_test.pl line > > 214. > > not ok 4 - unicode column name supported > > Your going to have a lot of problems with this test code and DBD::Unify > as we previously discovered that DBD::Unify does not decode the data I can change that as author and maintainer of this driver (if I want) > coming back from the database itself but it can be decoded by any Perl > script using DBD::Unify into the correct data. Any chance you could > change the test code to print out the results of type_info_all for > DBD::Unify and send me them? # Driver DBD::Unify-0.78 # Using DBMS_NAME 'Unify DataServer' # Using DBMS_VER undef # Using DRIVER_NAME '/pro/asql/v83I/lib/perl/5.10.1/DBD/Unify.pm' # Using DRIVER_VER '00.78.' # SQL_IDENTIFIER_CASE 3 # LANGDIR = dutch [ { AUTO_UNIQUE_VALUE => 11, CASE_SENSITIVE => 7, COLUMN_SIZE => 2, CREATE_PARAMS=> 5, DATA_TYPE=> 1, FIXED_PREC_SCALE => 10, INTERVAL_PRECISION => 18, LITERAL_PREFIX => 3, LITERAL_SUFFIX => 4, LOCAL_TYPE_NAME => 12, MAXIMUM_SCALE=> 14, MINIMUM_SCALE=> 13, NULLABLE => 6, NUM_PREC_RADIX => 17, SEARCHABLE => 8, SQL_DATA_TYPE=> 15, SQL_DATETIME_SUB => 16, TYPE_NAME=> 0, UNSIGNED_ATTRIBUTE => 9 }
Re: Add Unicode Support to the DBI
On 09/11/2011 15:49, H.Merijn Brand wrote: On Tue, 08 Nov 2011 21:12:13 +, "Martin J. Evans" wrote: I've just checked in unicode_test.pl to DBI's subversion trunk in /ex dir. It won't run right now without changing the do_connect sub as you have to specify how to connect to the DB. Also, there is a DBD specific section at the start where you might have to add a DBD it does not know about (anything other than DBD::ODBC, DBD::Oracle, DBD::SQLite, DBD::CSV, DBD::mysql) if it needs something other than the defaults e.g., the name of the length function in SQL, the column type for unicode columns and binary columns, the setting to enable UTF8/Unicode support. It could be a bit of a pain if your DBD does not support type_info_all but I'm around on irc and in this list if anyone wants any help making it work. It needs rather a lot of tidying up so I'm not putting it forward as code-of-the-year but it is a start. BTW, you'll need Test::More::UTF8 and perhaps a couple of other non core modules to run it. Martin I'll have some deeper look at both Unify and CSV ... Attached is a revised version of the script (first argument is the driver to test, some need more work) $ perl /tmp/unicode_test.pl Unify # Driver DBD::Unify-0.78 # Using DBMS_NAME 'Unify DataServer' # Using DBMS_VER undef # Using DRIVER_NAME '/pro/asql/v83I/lib/perl/5.10.1/DBD/Unify.pm' # Using DRIVER_VER '00.78.' # SQL_IDENTIFIER_CASE 3 # LANGDIR = dutch print() on closed filehandle $fh at /tmp/unicode_test.pl line 438. Use of uninitialized value in concatenation (.) or string at /tmp/unicode_test.pl line 421. Use of uninitialized value in concatenation (.) or string at /tmp/unicode_test.pl line 421. # Found type (HUGE AMOUNT) size= Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422. Use of uninitialized value in concatenation (.) or string at /tmp/unicode_test.pl line 421. Use of uninitialized value in concatenation (.) or string at /tmp/unicode_test.pl line 421. # Found type (AMOUNT) size= Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422. Use of uninitialized value in concatenation (.) or string at /tmp/unicode_test.pl line 421. Use of uninitialized value in concatenation (.) or string at /tmp/unicode_test.pl line 421. # Found type (BINARY) size= Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422. Use of uninitialized value in concatenation (.) or string at /tmp/unicode_test.pl line 421. : : # Found type (TIME) size= Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422. DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. [for Statement "create table fredĀ ( a int)"] at /tmp/unicode_test.pl line 214. not ok 1 - unicode table name supported # Failed test 'unicode table name supported' # at /tmp/unicode_test.pl line 216. # died: DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. [for Statement "create table fredÄ ( a int)"] at /tmp/unicode_test.pl line 214. ok 2 # skip Failed to create unicode table name ok 3 # skip Failed to create unicode table name DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. [for Statement "create table fred ( daveĀ int)"] at /tmp/unicode_test.pl line 214. not ok 4 - unicode column name supported Your going to have a lot of problems with this test code and DBD::Unify as we previously discovered that DBD::Unify does not decode the data coming back from the database itself but it can be decoded by any Perl script using DBD::Unify into the correct data. Any chance you could change the test code to print out the results of type_info_all for DBD::Unify and send me them? I'd like to fix all those Use of unitialized value warnings so I'll look into that. The test code has changed so the line numbers are a problem so I'll have to compare them with your attachment. I'm guessing they are all in the processing of the results from type_info_all. To be honest, this is not a unicode/encoding issue but one of my biggest annoyances when working with DBDs and that is finding the right column type for a table is really hit and miss - many DBDs don't support type_info_all and some that do only provide a subset of the columns DBI defines. Somewhere (I've lost it right now but I'll find it later) I have a document which details my experiences of trying to write DBD independent code which worked with DBD::mysql, DBD::DB2 and DBD::Oracle. In the end we gave up - it was just too difficult. It wasn't just the different SQL syntax/support. We ended up putting all the logic into procedures/functions and simply calling them from Perl which avoided SQL syntax issues and loads of other differences. Perhaps I'll go back to that if I find some time. $ perl /tmp/unicode_test.pl # Driver DBD::SQLite-1.33 # Using DBMS_NAME 'SQLite' # Using DBMS_VER '3.7.6.3' # Using DRIVER_NAME undef # Using DRIVER_VER undef # SQL_IDENTIFIER_CASE undef # LANGDIR = dutch prin
Re: Add Unicode Support to the DBI
On Wed, 9 Nov 2011 16:23:53 +, Tim Bunce wrote: > On Wed, Nov 09, 2011 at 04:50:29PM +0100, H.Merijn Brand wrote: > > On Tue, 08 Nov 2011 21:12:13 +, "Martin J. Evans" > > wrote: > > > > > I've just checked in unicode_test.pl to DBI's subversion trunk in /ex dir. > > > > So now attached > > Any chance you could rework your changes into the (recently updated) > version in the DBI svn repo? Yes, after I verify why DBD::Unify messes up on this (FWIW, the number of Unify customers is getting smaller and smaller, so the motivation to improve on DBD::Unify is diminishing) > Tim. -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using 5.00307 through 5.14 and porting perl5.15.x on HP-UX 10.20, 11.00, 11.11, 11.23 and 11.31, OpenSuSE 10.1, 11.0 .. 11.4 and AIX 5.2 and 5.3. http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
Re: Add Unicode Support to the DBI
On Wed, Nov 09, 2011 at 04:50:29PM +0100, H.Merijn Brand wrote: > On Tue, 08 Nov 2011 21:12:13 +, "Martin J. Evans" > wrote: > > > I've just checked in unicode_test.pl to DBI's subversion trunk in /ex dir. > > So now attached Any chance you could rework your changes into the (recently updated) version in the DBI svn repo? Tim.
Re: Add Unicode Support to the DBI
On Tue, 08 Nov 2011 21:12:13 +, "Martin J. Evans" wrote: > I've just checked in unicode_test.pl to DBI's subversion trunk in /ex dir. So now attached -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using 5.00307 through 5.14 and porting perl5.15.x on HP-UX 10.20, 11.00, 11.11, 11.23 and 11.31, OpenSuSE 10.1, 11.0 .. 11.4 and AIX 5.2 and 5.3. http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/ unicode_test.pl Description: Perl program
Re: Add Unicode Support to the DBI
On Tue, 08 Nov 2011 21:12:13 +, "Martin J. Evans" wrote: > I've just checked in unicode_test.pl to DBI's subversion trunk in /ex dir. > > It won't run right now without changing the do_connect sub as you have > to specify how to connect to the DB. > Also, there is a DBD specific section at the start where you might have > to add a DBD it does not know about (anything other than DBD::ODBC, > DBD::Oracle, DBD::SQLite, DBD::CSV, DBD::mysql) if it needs something > other than the defaults e.g., the name of the length function in SQL, > the column type for unicode columns and binary columns, the setting to > enable UTF8/Unicode support. It could be a bit of a pain if your DBD > does not support type_info_all but I'm around on irc and in this list if > anyone wants any help making it work. > > It needs rather a lot of tidying up so I'm not putting it forward as > code-of-the-year but it is a start. > > BTW, you'll need Test::More::UTF8 and perhaps a couple of other non core > modules to run it. > > Martin I'll have some deeper look at both Unify and CSV ... Attached is a revised version of the script (first argument is the driver to test, some need more work) $ perl /tmp/unicode_test.pl Unify # Driver DBD::Unify-0.78 # Using DBMS_NAME 'Unify DataServer' # Using DBMS_VER undef # Using DRIVER_NAME '/pro/asql/v83I/lib/perl/5.10.1/DBD/Unify.pm' # Using DRIVER_VER '00.78.' # SQL_IDENTIFIER_CASE 3 # LANGDIR = dutch print() on closed filehandle $fh at /tmp/unicode_test.pl line 438. Use of uninitialized value in concatenation (.) or string at /tmp/unicode_test.pl line 421. Use of uninitialized value in concatenation (.) or string at /tmp/unicode_test.pl line 421. # Found type (HUGE AMOUNT) size= Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422. Use of uninitialized value in concatenation (.) or string at /tmp/unicode_test.pl line 421. Use of uninitialized value in concatenation (.) or string at /tmp/unicode_test.pl line 421. # Found type (AMOUNT) size= Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422. Use of uninitialized value in concatenation (.) or string at /tmp/unicode_test.pl line 421. Use of uninitialized value in concatenation (.) or string at /tmp/unicode_test.pl line 421. # Found type (BINARY) size= Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422. Use of uninitialized value in concatenation (.) or string at /tmp/unicode_test.pl line 421. : : # Found type (TIME) size= Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422. DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. [for Statement "create table fredĀ ( a int)"] at /tmp/unicode_test.pl line 214. not ok 1 - unicode table name supported # Failed test 'unicode table name supported' # at /tmp/unicode_test.pl line 216. # died: DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. [for Statement "create table fredÄ ( a int)"] at /tmp/unicode_test.pl line 214. ok 2 # skip Failed to create unicode table name ok 3 # skip Failed to create unicode table name DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. [for Statement "create table fred ( daveĀ int)"] at /tmp/unicode_test.pl line 214. not ok 4 - unicode column name supported $ perl /tmp/unicode_test.pl # Driver DBD::SQLite-1.33 # Using DBMS_NAME 'SQLite' # Using DBMS_VER '3.7.6.3' # Using DRIVER_NAME undef # Using DRIVER_VER undef # SQL_IDENTIFIER_CASE undef # LANGDIR = dutch print() on closed filehandle $fh at /tmp/unicode_test.pl line 438. ok 1 - unicode table name supported ok 2 - unicode table found in unqualified table_info ok 3 - unicode table found by qualified table_info ok 4 - unicode column name supported ok 5 - unicode column found in unqualified column_info ok 6 - unicode column found by qualified column_info ok 7 - table for unicode data ok 8 - insert unicode data into table ok 9 - unicode data out = unicode data in, no where ok 10 - length of output data the same ok 11 - db length of unicode data correct ok 12 - select unicode data via parameterised where ok 13 - select unicode data via inline where ok 14 - table for unicode data ok 15 - insert unicode data and blob into table ok 16 - unicode data out = unicode data in, no where with blob ok 17 - utf8 flag not set on blob data Use of uninitialized value in string eq at /tmp/unicode_test.pl line 373. ok 18 - retrieved blob = inserted blob ok 19 - test table for unicode parameter markers DBD::SQLite::st bind_param failed: Unknown named parameter: fred⬠[for Statement "insert into fred (a) values (:fred€)"] at /tmp/unicode_test.pl line 390. not ok 20 - bind parameter with unicode parameter marker # Failed test 'bind parameter with unicode parameter marker' # at /tmp/unicode_test.pl line 392. # died: DBD::SQLite::st bind_param failed: Unknown named parameter: fred⬠[for Statement "insert into fred (a) values (:fredâ¬)"] at /tmp/unicode_test.pl line 39
Re: Add Unicode Support to the DBI
On 08/11/2011 17:53, David E. Wheeler wrote: On Nov 8, 2011, at 5:16 AM, Tim Bunce wrote: 1. Focus initially on categorising the capabilities of the databases. Specifically separating those that understand character encodings at one or more of column, table, schema, database level. Answer the questions: what "Unicode support" is this database capable of? [vague] are particular column data types or attributes needed? does the db have a session/connection encoding concept? does the db support binary data types. does the client api identify data encoding? A table summarizing this kind of info would be of great value. I think this is the most important kind of data we need to move forward with this topic. I suspect we'll end up with a few clear levels of "unicode support" by databases that we can then focus on more clearly. +1. Yes, this should make things pretty clear. 2. Try to make a data-driven common test script. It should fetch the length of the stored value, something like: CREATE TABLE t (c VARCHAR(10)); INSERT INTO t VALUES (?)<= $sth->execute("\x{263A}") # simley SELECT LENGTH(c), c FROM t Fetching the LENGTH is important because it tells us if the DB is treating the value as Unicode. The description of DBD::Unify, for example, doesn't clarify if the db itself regards the stored value as unicode or the underlying string of encoded bytes. Also probably best to avoid latin characters for this, I'd use something that always has a multi-byte encoding, like a simley face char. And something that doesn't have a variant that uses combining characters, so that the length should be consistent if it's treated as Unicode. 3. Focus on placeholders initially. We can ponder utf8 in literal SQL later. That's a separate ball of mud. (I'd also ignore unicode table/column/db names. It's a much lower priority and may become clearer when other issues get resolved.) +1, though good to know about. Just as important as placeholders, however, is fetching data. 4. Tests could report local LANG / LC_ALL env var value so when others report their results we'll have that context. Thanks again. I've only given it a quick skim. I'll read it again before LPW. Meanwhile, it would be great if people could contribute the info for #1. Tim. p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff would make the tests shorter. my $sample_string = "\x{263A}"; ... print data_diff($sample_string, $returned_string); Can this be turned into a complete script we can all just run? Thanks, David I've just checked in unicode_test.pl to DBI's subversion trunk in /ex dir. It won't run right now without changing the do_connect sub as you have to specify how to connect to the DB. Also, there is a DBD specific section at the start where you might have to add a DBD it does not know about (anything other than DBD::ODBC, DBD::Oracle, DBD::SQLite, DBD::CSV, DBD::mysql) if it needs something other than the defaults e.g., the name of the length function in SQL, the column type for unicode columns and binary columns, the setting to enable UTF8/Unicode support. It could be a bit of a pain if your DBD does not support type_info_all but I'm around on irc and in this list if anyone wants any help making it work. It needs rather a lot of tidying up so I'm not putting it forward as code-of-the-year but it is a start. BTW, you'll need Test::More::UTF8 and perhaps a couple of other non core modules to run it. Martin
Re: Add Unicode Support to the DBI
On 08/11/2011 13:16, Tim Bunce wrote: On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote: I didn't think I was going to make LPW but it seems I will now - although it has cost me big time leaving it until the last minute. All your beers at LPW are on me! http://www.martin-evans.me.uk/node/121 Great work Martin. Many thanks. I've some comments and suggestions for you... It says "There is no single way across DBDs to enable Unicode support" but doesn't define what "Unicode support" actually means. Clearly the "Unicode support" of Oracle will be different to that of a CSV file. So it seems that we need to be really clear about what we want. I'd suggest... 1. Focus initially on categorising the capabilities of the databases. Specifically separating those that understand character encodings at one or more of column, table, schema, database level. Answer the questions: what "Unicode support" is this database capable of? [vague] are particular column data types or attributes needed? does the db have a session/connection encoding concept? does the db support binary data types. does the client api identify data encoding? A table summarizing this kind of info would be of great value. I think this is the most important kind of data we need to move forward with this topic. I suspect we'll end up with a few clear levels of "unicode support" by databases that we can then focus on more clearly. this will take some time but I'll work on it. It could be a nightmare for DBD::ODBC as it supports dozens of dbs so I'll have to be a bit cagey about the stuff like "what "Unicode support" is this database capable of". Contributions welcome. 2. Try to make a data-driven common test script. It should fetch the length of the stored value, something like: CREATE TABLE t (c VARCHAR(10)); INSERT INTO t VALUES (?)<= $sth->execute("\x{263A}") # simley SELECT LENGTH(c), c FROM t Fetching the LENGTH is important because it tells us if the DB is treating the value as Unicode. The description of DBD::Unify, for example, doesn't clarify if the db itself regards the stored value as unicode or the underlying string of encoded bytes. Changed to add length(c) test - sorry misread that first time as Perl code length(c) not SQL - I'm not well enough to think so clearly right now. Will take some time to run through the DBDs to check whether it is length()/len() or something else - isn't SQL great e.g., DBD::CSV (SQL::Statement) is char_length and most other SQLs are len or length. BTW, DBD::mysql fails the length test using the length function char_length (documented as Return number of characters in argument) - it returns 3 for the smiley not 1, not investigated why yet. Also probably best to avoid latin characters for this, I'd use something that always has a multi-byte encoding, like a simley face char. changed to use smiley face for data but not for table/column name checks as most dbs have other rules on table/column names. 3. Focus on placeholders initially. We can ponder utf8 in literal SQL later. That's a separate ball of mud. (I'd also ignore unicode table/column/db names. It's a much lower priority and may become clearer when other issues get resolved.) test already does that so left in - it is easy to comment it out - one line change. 4. Tests could report local LANG / LC_ALL env var value so when others report their results we'll have that context. added although I stopped short of outputting everything in %ENV as I guessed people would end up having to edit it to remove stuff - it does LANG, LC_* and NLS_* right now. Thanks again. I've only given it a quick skim. I'll read it again before LPW. Meanwhile, it would be great if people could contribute the info for #1. Tim. p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff would make the tests shorter. my $sample_string = "\x{263A}"; ... print data_diff($sample_string, $returned_string); added or data_diff() to tests which fail - is($x, $y, "test") or data_diff($x,$y) Test::More::is output is horrible for unicode on a non UTF-8 enabled terminal. BTW, the smiley character causes me problems as for some reason with MS SQL Server you can do: insert into table values(?) bind_param("\{many_unicode_chars}") select column from table where column = $h->quote("\{many_unicode_chars}") but for some reason the smiley does not work in the select SQL even though the insert works :-( A parameterized select works fine too. I will upload the current script to DBI's subversion tree in ex dir but if anyone is going to do anything major with it I'd rather they mail say dbi-dev or me first and I can at least check that before making any changes myself (at least until after LPW). Having said that as the birthday boy tomorrow I can al
Re: Add Unicode Support to the DBI
On Nov 8, 2011, at 5:16 AM, Tim Bunce wrote: > 1. Focus initially on categorising the capabilities of the databases. >Specifically separating those that understand character encodings >at one or more of column, table, schema, database level. >Answer the questions: >what "Unicode support" is this database capable of? [vague] >are particular column data types or attributes needed? >does the db have a session/connection encoding concept? >does the db support binary data types. >does the client api identify data encoding? >A table summarizing this kind of info would be of great value. >I think this is the most important kind of data we need to move >forward with this topic. I suspect we'll end up with a few clear >levels of "unicode support" by databases that we can then focus on >more clearly. +1. Yes, this should make things pretty clear. > 2. Try to make a data-driven common test script. >It should fetch the length of the stored value, something like: >CREATE TABLE t (c VARCHAR(10)); >INSERT INTO t VALUES (?) <= $sth->execute("\x{263A}") # simley >SELECT LENGTH(c), c FROM t >Fetching the LENGTH is important because it tells us if the DB is >treating the value as Unicode. The description of DBD::Unify, for >example, doesn't clarify if the db itself regards the stored value >as unicode or the underlying string of encoded bytes. >Also probably best to avoid latin characters for this, I'd use >something that always has a multi-byte encoding, like a simley face char. And something that doesn't have a variant that uses combining characters, so that the length should be consistent if it's treated as Unicode. > 3. Focus on placeholders initially. >We can ponder utf8 in literal SQL later. That's a separate ball of mud. >(I'd also ignore unicode table/column/db names. It's a much lower >priority and may become clearer when other issues get resolved.) +1, though good to know about. Just as important as placeholders, however, is fetching data. > 4. Tests could report local LANG / LC_ALL env var value >so when others report their results we'll have that context. > > Thanks again. I've only given it a quick skim. I'll read it again before LPW. > > Meanwhile, it would be great if people could contribute the info for #1. > > Tim. > > p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff > would make the tests shorter. >my $sample_string = "\x{263A}"; >... >print data_diff($sample_string, $returned_string); Can this be turned into a complete script we can all just run? Thanks, David
Re: Add Unicode Support to the DBI
On Tue, Nov 08, 2011 at 02:45:39PM +, Martin J. Evans wrote: > On 08/11/11 13:16, Tim Bunce wrote: > >On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote: > > >2. Try to make a data-driven common test script. > > There is already one attached to the bottom of the post and referred to in > the post - probably was not very clear. Ah, it's in the middle, in the "DBDs" section. Could you put it in github or the DBI repo? I may find some time to hack on it. > > It should fetch the length of the stored value, something like: > > CREATE TABLE t (c VARCHAR(10)); > > INSERT INTO t VALUES (?)<= $sth->execute("\x{263A}") # simley > > SELECT LENGTH(c), c FROM t > > It does that with a euro \x{20ac} > > > Fetching the LENGTH is important because it tells us if the DB is > > treating the value as Unicode. > > It doesn't do that but it checks what went in \x{20ac} is what comes > out which is the same as checking the length since what goes in is a > euro character. For a euro to come back out at the minimum the DBD > would have to decode the data from the database. It's not only important that what goes in comes back out again, but also that what goes in is interpreted by the database in the right way. Otherwise there are lots of more subtle issues like sorting bugs caused by the database 'seeing' encoded bytes instead of unicode characters. (Hence my point about the description for DBD::Unify being incomplete.) It's _possible_ for a db to store a euro as a single byte. So it's possible for the test to yield a false positive. I prefer using a simley because it's not possible to store it in a single byte. So if LENGTH(c) returns 1 we can be very confident that the db is interpreting the data correctly. > >Thanks again. I've only given it a quick skim. I'll read it again before LPW. > > I will try and update it before then but a) it is my birthday tomorrow and b) > I'm a bit under the weather at the moment. HAPPY BIRTHDAY! Get well soon! > >Meanwhile, it would be great if people could contribute the info for #1. > > I am happy to collect any such info and write it up so please at least cc me. > >p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff > >would make the tests shorter. > > my $sample_string = "\x{263A}"; > > ... > > print data_diff($sample_string, $returned_string); > > Yes, the output is a bit screwed because I was running the test code > on Windows in a dos terminal and I've never got that working properly. Ouch. > Data::Dumper produces better output but I forgot data_diff - I will > change where relevant. > > BTW, the example code in each DBD was not really the test, it was just > an example. My aim was to produce one script which ran to any DBD and > that is attached to the end of the blog post. It's just it was too > long to incorporate into a blog posting whereas the short simple > examples were better. Yeap. Good idea. I'd simply missed the link when I skimmed it. Thanks again for driving this forward Martin. Tim.
Re: Add Unicode Support to the DBI
On 08/11/11 13:16, Tim Bunce wrote: On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote: I didn't think I was going to make LPW but it seems I will now - although it has cost me big time leaving it until the last minute. All your beers at LPW are on me! http://www.martin-evans.me.uk/node/121 Great work Martin. Many thanks. I've some comments and suggestions for you... It says "There is no single way across DBDs to enable Unicode support" but doesn't define what "Unicode support" actually means. Clearly the "Unicode support" of Oracle will be different to that of a CSV file. Point taken. So it seems that we need to be really clear about what we want. I'd suggest... 1. Focus initially on categorising the capabilities of the databases. Specifically separating those that understand character encodings at one or more of column, table, schema, database level. Answer the questions: what "Unicode support" is this database capable of? [vague] are particular column data types or attributes needed? does the db have a session/connection encoding concept? does the db support binary data types. does the client api identify data encoding? A table summarizing this kind of info would be of great value. I think this is the most important kind of data we need to move forward with this topic. I suspect we'll end up with a few clear levels of "unicode support" by databases that we can then focus on more clearly. 2. Try to make a data-driven common test script. There is already one attached to the bottom of the post and referred to in the post - probably was not very clear. It should fetch the length of the stored value, something like: CREATE TABLE t (c VARCHAR(10)); INSERT INTO t VALUES (?)<= $sth->execute("\x{263A}") # simley SELECT LENGTH(c), c FROM t It does that with a euro \x{20ac} Fetching the LENGTH is important because it tells us if the DB is treating the value as Unicode. It doesn't do that but it checks what went in \x{20ac} is what comes out which is the same as checking the length since what goes in is a euro character. For a euro to come back out at the minimum the DBD would have to decode the data from the database. It also does this with table and column names and table_info/column_info. The description of DBD::Unify, for example, doesn't clarify if the db itself regards the stored value as unicode or the underlying string of encoded bytes. Also probably best to avoid latin characters for this, I'd use something that always has a multi-byte encoding, like a simley face char. I mostly do that but had problems creating tables and columns with a euro in the name (which I can sort of understand as I can believe there are more rules to valid table/column names like they must be alphabetic but could still be unicode). The code to create unicode table/column names uses \x{0100} which is fredĀ (LATIN CAPITAL LETTER A WITH MACRON) which I think is ok. 3. Focus on placeholders initially. We can ponder utf8 in literal SQL later. That's a separate ball of mud. (I'd also ignore unicode table/column/db names. It's a much lower priority and may become clearer when other issues get resolved.) Yeah, I only added that because I knew in DBD::ODBC it did not work because the interface is char * and not Perl scalar. 4. Tests could report local LANG / LC_ALL env var value so when others report their results we'll have that context. Yes, forgot that - will add it to the test. Thanks again. I've only given it a quick skim. I'll read it again before LPW. I will try and update it before then but a) it is my birthday tomorrow and b) I'm a bit under the weather at the moment. Meanwhile, it would be great if people could contribute the info for #1. I am happy to collect any such info and write it up so please at least cc me. Tim. p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff would make the tests shorter. my $sample_string = "\x{263A}"; ... print data_diff($sample_string, $returned_string); Yes, the output is a bit screwed because I was running the test code on Windows in a dos terminal and I've never got that working properly. Data::Dumper produces better output but I forgot data_diff - I will change where relevant. BTW, the example code in each DBD was not really the test, it was just an example. My aim was to produce one script which ran to any DBD and that is attached to the end of the blog post. It's just it was too long to incorporate into a blog posting whereas the short simple examples were better. Unfortunately the differences between DBDs (and some limitations and bugs) is making the script quite complex in places e.g., one which caught me out was you have to set LongReadLen in DBD::Oracle before calling prepare but all other
Re: Add Unicode Support to the DBI
On Tue, 8 Nov 2011 13:16:17 +, Tim Bunce wrote: > On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote: > > > > > > I didn't think I was going to make LPW but it seems I will now - > > > although it has cost me big time leaving it until the last minute. > > All your beers at LPW are on me! > > > http://www.martin-evans.me.uk/node/121 > > Great work Martin. Many thanks. I more and more regret I cannot be there. Please, beside doing useful stuff, have FUN! -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using 5.00307 through 5.14 and porting perl5.15.x on HP-UX 10.20, 11.00, 11.11, 11.23 and 11.31, OpenSuSE 10.1, 11.0 .. 11.4 and AIX 5.2 and 5.3. http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
Re: Add Unicode Support to the DBI
On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote: > > > >I didn't think I was going to make LPW but it seems I will now - although it > >has cost me big time leaving it until the last minute. All your beers at LPW are on me! > http://www.martin-evans.me.uk/node/121 Great work Martin. Many thanks. I've some comments and suggestions for you... It says "There is no single way across DBDs to enable Unicode support" but doesn't define what "Unicode support" actually means. Clearly the "Unicode support" of Oracle will be different to that of a CSV file. So it seems that we need to be really clear about what we want. I'd suggest... 1. Focus initially on categorising the capabilities of the databases. Specifically separating those that understand character encodings at one or more of column, table, schema, database level. Answer the questions: what "Unicode support" is this database capable of? [vague] are particular column data types or attributes needed? does the db have a session/connection encoding concept? does the db support binary data types. does the client api identify data encoding? A table summarizing this kind of info would be of great value. I think this is the most important kind of data we need to move forward with this topic. I suspect we'll end up with a few clear levels of "unicode support" by databases that we can then focus on more clearly. 2. Try to make a data-driven common test script. It should fetch the length of the stored value, something like: CREATE TABLE t (c VARCHAR(10)); INSERT INTO t VALUES (?) <= $sth->execute("\x{263A}") # simley SELECT LENGTH(c), c FROM t Fetching the LENGTH is important because it tells us if the DB is treating the value as Unicode. The description of DBD::Unify, for example, doesn't clarify if the db itself regards the stored value as unicode or the underlying string of encoded bytes. Also probably best to avoid latin characters for this, I'd use something that always has a multi-byte encoding, like a simley face char. 3. Focus on placeholders initially. We can ponder utf8 in literal SQL later. That's a separate ball of mud. (I'd also ignore unicode table/column/db names. It's a much lower priority and may become clearer when other issues get resolved.) 4. Tests could report local LANG / LC_ALL env var value so when others report their results we'll have that context. Thanks again. I've only given it a quick skim. I'll read it again before LPW. Meanwhile, it would be great if people could contribute the info for #1. Tim. p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff would make the tests shorter. my $sample_string = "\x{263A}"; ... print data_diff($sample_string, $returned_string);
Re: Add Unicode Support to the DBI
On 04/11/11 08:39, Martin J. Evans wrote: On 03/11/11 23:25, David E. Wheeler wrote: On Oct 7, 2011, at 5:06 PM, David E. Wheeler wrote: Perhaps we could carve out some time at LPW to sit together and try to progress this. That would be awesome you guys! So gents, do you plan to do this a bit? Martin, do you have the data you wanted to collect on this? Thanks, David Sorry David, I've been snowed under. I will try very hard to publish the research I found this weekend. I didn't think I was going to make LPW but it seems I will now - although it has cost me big time leaving it until the last minute. Martin http://www.martin-evans.me.uk/node/121 Martin -- Martin J. Evans Easysoft Limited http://www.easysoft.com
Re: Add Unicode Support to the DBI
On 05/10/2011 00:06, Jonathan Leffler wrote: On Tue, Oct 4, 2011 at 15:24, Martin J. Evanswrote: On 04/10/2011 22:38, Tim Bunce wrote: I've not had time to devote to this thread. Sorry. I'd be grateful if someone could post a summary of it if/when it approaches some kind of consensus. I don't think there is a "kind of consensus" right now (although some useful discussion which probably will bear fruit) and I'd prefer to work out what unicode support already exists and how it is implemented first. For instance, Pg is very focussed on UTF-8 (as are most DBDs) and yet ODBC uses UCS2 under the hood and CSV can use anything you like. Greg/David/Postgres seem to have an immediate problem with unicode support in Postgres and I can imagine they are keen to resolve it and I'd suggest they do it now in the most appropriate way for DBD::Pg. I don't see why this should necessarily impact on any discussion as to what DBI should_do/should_say as already the DBDs which support unicode mostly do it in different ways. I've started gathering together details of what unicode support there is in DBDs, how it is implemented and what special flags there are to support it. However, this is a massive task. So far I've done ODBC, Oracle, CSV, Unify, mysql, SQLite, Firebird and sort of held off on Pg as I knew Greg was working on it. Some might disagree but DB2 is a main one I no longer have access to (please contact me if you use DBD::DB2 and are prepared to spare half an hour or so to modify examples I have which verify unicode support). Of course, if you use another DBD and can send me info on unicode support I'd love to hear from you. I thought the whole issue was an interesting topic and I had toyed with doing a talk for LPW but to be honest, it is already taking a lot of time and I have personal issues right now (and of course my $work) which mean my time is severely limited so I'm doubtful right now if I could have it ready in time as a talk. I might just post what I have gathered in a weeks time in the hope I get a little more input in the mean time. DBD::Informix has had a couple of UTF-8 patches sent and one has been applied to the code. The other arrived this morning and has to be compared. The attribute names chosen are different, but both contain 'ix_' as a prefix and UTF-8 in some form or another. I couldn't find a respository for Informix so I couldn't see what these changes were. I've added your comments to my document - to be released tonight hopefully. What I'm not sure about is how to test the code. Creating an Informix database that has UTF-8 data in it is trivial (well, nearly trivial). The difficulty is demonstrating where there were problems before and that the problems are gone after. If anyone has suggestions for how to show that UTF8 is working properly - in the form of a fairly simple test case - I'd be very grateful to receive it as guidance. Attached to this post is some code I'm using however: o it tries to use type_info_all to find relevant column types and so far DBD::SQLite is the only one not to support type_info_all. If you've got type_info_all support try and make it work. You'll need to look for the find_type calls (2 of them) and adjust to add informix SQL types. Alternatively, if you have not got type_info_all support you'll need to replicate what I did for DBD::SQLite - see start of script. The code attempts to work around the problem that some DBDs do not report all possible columns back from type_info_all. o few DBDs seem to support column_info_all - if you have not got it - that test is skipped o you'll need an in.png PNG image which should preferably be smaller than the size of your blob column type. It does not matter what it is - any old small png will do. o no drivers so far support unicode parameter markers - what a surprise o some drivers have issues with blobs and unicode data in the same table - DBD::CSV notably o some drivers cannot create unicode table names or tables with unicode column names e.g., DBD::mysql. o no driver so far can do a table_info with a specified unicode (and existing) table name and return the table info. I suspect this is because the table_info and column_info calls are in XS using char * instead of Perl scalars. If you make this work for Informix please send it back to me. Same applies to any other DBD maintainers. So, DBD::Informix is endeavouring to move forward, knowing that the underlying database layers (ESQL/C on the client, and the Informix data server) handle UTF-8 OK, so any problems are in persuading Perl (and perhaps DBI) to handle it appropriately too. [...Did I hear a chorus of "nice theory - shame about the practice"?...] Martin # # Test unicode in a DBD - written for DBD::ODBC but should work for other # DBDs if you change the column types at the start of this script. # To run properly it needs an in.png PNG image file in the local working # directory but it does not matter what
Re: Add Unicode Support to the DBI
On Nov 4, 2011, at 10:33 AM, Martin J. Evans wrote: >> Did you ever get any data from DBD::SQLite folks? > > Yes. I found a bug in the process and it was fixed but I have a working > SQLite example. Oh, great. > I'm only really missing DB2 but I have contacts for that on #dbix-class who > I've just not yet poked. Cool, thanks for the update. Best, David
Re: Add Unicode Support to the DBI
On 04/11/11 16:39, David E. Wheeler wrote: On Nov 4, 2011, at 1:39 AM, Martin J. Evans wrote: Sorry David, I've been snowed under. I will try very hard to publish the research I found this weekend. Awesome, thanks. Did you ever get any data from DBD::SQLite folks? Yes. I found a bug in the process and it was fixed but I have a working SQLite example. I didn't think I was going to make LPW but it seems I will now - although it has cost me big time leaving it until the last minute. Your sacrifice is greatly appreciated. :-) Best, David I'm only really missing DB2 but I have contacts for that on #dbix-class who I've just not yet poked. Martin -- Martin J. Evans Easysoft Limited http://www.easysoft.com
Re: Add Unicode Support to the DBI
On Nov 4, 2011, at 1:39 AM, Martin J. Evans wrote: > Sorry David, I've been snowed under. I will try very hard to publish the > research I found this weekend. Awesome, thanks. Did you ever get any data from DBD::SQLite folks? > I didn't think I was going to make LPW but it seems I will now - although it > has cost me big time leaving it until the last minute. Your sacrifice is greatly appreciated. :-) Best, David
Re: Add Unicode Support to the DBI
On 03/11/11 23:25, David E. Wheeler wrote: On Oct 7, 2011, at 5:06 PM, David E. Wheeler wrote: Perhaps we could carve out some time at LPW to sit together and try to progress this. That would be awesome you guys! So gents, do you plan to do this a bit? Martin, do you have the data you wanted to collect on this? Thanks, David Sorry David, I've been snowed under. I will try very hard to publish the research I found this weekend. I didn't think I was going to make LPW but it seems I will now - although it has cost me big time leaving it until the last minute. Martin -- Martin J. Evans Easysoft Limited http://www.easysoft.com
Re: Add Unicode Support to the DBI
On Oct 7, 2011, at 5:06 PM, David E. Wheeler wrote: >> Perhaps we could carve out some time at LPW to sit together and try to >> progress this. > > That would be awesome you guys! So gents, do you plan to do this a bit? Martin, do you have the data you wanted to collect on this? Thanks, David
Re: Add Unicode Support to the DBI
On Oct 13, 2011, at 6:03 AM, Greg Sabino Mullane wrote: >> I think what I haven't said is that we should just use the same >> names that Perl I/O uses. Er, well, for the :raw and :utf8 >> varieties I was, anyway. Perhaps we should adopt it wholesale, >> so you'd use ":encoding(UTF-8)" instead of "UTF-8". > > That's pretty ugly. I don't think we need to adopt the I/O > convention, as there is no direct mapping anyway, it just > confuses the issue. Sure. In that case, I'd say ":utf8", ":raw", or "$encoding". >> For DBD::Pg, at least, if client-encoding is set to Big5, then >> you *have* to encode to send it to the database. Or change the >> client encoding, of course. > > Not sure I'm following this completely. Or rather, why this should > be the DBDs role. By default, yes, the DBD should DTRT here. But I think there also ought to be a way to tell it what to do. >>> How would one map things - just demand that >>> whatever is given must be a literal encoding the particular database >>> can understand? >> >> I think we should standardize on the Perl IO names for these things. >> Some databases may not support them all, of course. > > Hm... I don't know enough about the various DB's encodings to see > how good an idea that is. I assume that it's all over the map, so we should be as general as we can. Specifying an encoding by name should cover everything. >> No, I think just "encoding", and "utf8" would be invalid, >> but ":encoding(UTF-8)" would not. > > Again, ugh. Although a *little* less confusing when contrasting: > > $dbh->{encoding} = ':encoding(utf-8)'; > > $dbh->{encoding} = 'utf8'; Yeah, or we can go with my original suggestion: $dbh->{encoding} = 'UTF-8'; $dbh->{encoding} = ':utf8'; >> Well, I think we might have to have it with the pg_prefix until >> this stuff is finalized here. Not sure, though. > > That's my point - if we can get it finalized here, we can avoif the > pg_prefix entirely, rather than add it now and then deprecate it later. Sure. I suspect this is going to take a while, though. Best, David
Re: Add Unicode Support to the DBI
-BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 David E. Wheeler wrote: > I think what I haven't said is that we should just use the same > names that Perl I/O uses. Er, well, for the :raw and :utf8 > varieties I was, anyway. Perhaps we should adopt it wholesale, > so you'd use ":encoding(UTF-8)" instead of "UTF-8". That's pretty ugly. I don't think we need to adopt the I/O convention, as there is no direct mapping anyway, it just confuses the issue. > For DBD::Pg, at least, if client-encoding is set to Big5, then > you *have* to encode to send it to the database. Or change the > client encoding, of course. Not sure I'm following this completely. Or rather, why this should be the DBDs role. >> How would one map things - just demand that >> whatever is given must be a literal encoding the particular database >> can understand? > > I think we should standardize on the Perl IO names for these things. > Some databases may not support them all, of course. Hm... I don't know enough about the various DB's encodings to see how good an idea that is. >> So the above means these two actually behave very differently: >> >> $dbh->{encoding} = ':utf8'; >> >> $dbh->{encoding} = 'utf8'; >> >> Could be a little confusing, no? Methinks we some long ugly name, maybe >> even worse than "perl_native". Perhaps "perl_internal_utf8_flag"? 1/2 :) > No, I think just "encoding", and "utf8" would be invalid, > but ":encoding(UTF-8)" would not. Again, ugh. Although a *little* less confusing when contrasting: $dbh->{encoding} = ':encoding(utf-8)'; $dbh->{encoding} = 'utf8'; > Well, I think we might have to have it with the pg_prefix until > this stuff is finalized here. Not sure, though. That's my point - if we can get it finalized here, we can avoif the pg_prefix entirely, rather than add it now and then deprecate it later. - -- Greg Sabino Mullane g...@turnstep.com PGP Key: 0x14964AC8 201110130902 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -BEGIN PGP SIGNATURE- iEYEAREDAAYFAk6W4ZQACgkQvJuQZxSWSsiqUQCgo/icUz0enqn0BWSygNSeNJGW lDsAoMbjgZrsGJyS7kS60RgNNkpXMIjG =43Q3 -END PGP SIGNATURE-
Re: Add Unicode Support to the DBI
On Oct 7, 2011, at 1:47 AM, Tim Bunce wrote: > Perhaps we could carve out some time at LPW to sit together and try to > progress this. That would be awesome you guys! D
Re: Add Unicode Support to the DBI
On Tue, Oct 04, 2011 at 11:24:51PM +0100, Martin J. Evans wrote: > On 04/10/2011 22:38, Tim Bunce wrote: > >I've not had time to devote to this thread. Sorry. > > > >I'd be grateful if someone could post a summary of it if/when it > >approaches some kind of consensus. > I don't think there is a "kind of consensus" right now (although > some useful discussion which probably will bear fruit) and I'd > prefer to work out what unicode support already exists and how it is > implemented first. [...] > > I've started gathering together details of what unicode support > there is in DBDs, how it is implemented and what special flags there > are to support it. However, this is a massive task. So far I've done > ODBC, Oracle, CSV, Unify, mysql, SQLite, Firebird That seems like enough to make a start on design issues. There's certainly a wide range of databases in that list :) > examples I have which verify unicode support Are the examples in a repo somewhere? > I thought the whole issue was an interesting topic and I had toyed > with doing a talk for LPW but to be honest, it is already taking a > lot of time and I have personal issues right now (and of course my > $work) which mean my time is severely limited so I'm doubtful right > now if I could have it ready in time as a talk. I might just post > what I have gathered in a weeks time in the hope I get a little more > input in the mean time. Good idea. I'd like to see the examples sometime to see what scope they have and if/how the might be improved. Perhaps we could carve out some time at LPW to sit together and try to progress this. Tim.
Re: Add Unicode Support to the DBI
On Oct 6, 2011, at 8:56 AM, Greg Sabino Mullane wrote: >> I still prefer an encoding attribute that you can set as follows: > >> * undef: Default; same as your A. >> * ':utf8': Same as your B: >> * ':raw': Same as your C >> * $encoding: Encode/decode to/from $encoding > > I like that. Although the names are still odd. I think what I haven't said is that we should just use the same names that Perl I/O uses. Er, well, for the :raw and :utf8 varieties I was, anyway. Perhaps we should adopt it wholesale, so you'd use ":encoding(UTF-8)" instead of "UTF-8". Some details on the PerlIO names is here: https://metacpan.org/module/open I'm sure there's a more detailed description somewhere, though I couldn't find it. > I guess it does map > though: raw means no utf8 flag. Right. And the default would be either :raw or :utf8, and should probably be left to the DBD (depending on the importance of backward compatibility vs. trying to DTRT). > Still not sure about the encode > 'to', but I'll start thinking about how we could implement the > 'from' in DBD::Pg. For DBD::Pg, at least, if client-encoding is set to Big5, then you *have* to encode to send it to the database. Or change the client encoding, of course. > How would one map things - just demand that > whatever is given must be a literal encoding the particular database > can understand? I think we should standardize on the Perl IO names for these things. Some databases may not support them all, of course. >> With an encoding attribute, you don't need the utf8_flag at all. > > Right, +1 > > So the above means these two actually behave very differently: > > $dbh->{encoding} = ':utf8'; > > $dbh->{encoding} = 'utf8'; > > Could be a little confusing, no? Methinks we some long ugly name, maybe > even worse than "perl_native". Perhaps "perl_internal_utf8_flag"? 1/2 :) No, I think just "encoding", and "utf8" would be invalid, but ":encoding(UTF-8)" would not. > Thanks for plugging away at this. My short term goal is to get this finalized > enough that I can release the next version of DBD::Pg without a 'pg_' prefix > to control the encoding items. Well, I think we might have to have it with the pg_prefix until this stuff is finalized here. Not sure, though. Best, David
Re: Add Unicode Support to the DBI
-BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 > Uh, say what? Just as I need to > > binmode STDOUT, ':utf8'; > Before sending stuff to STDOUT (that is, turn off the flag), I would > expect DBDs to do the same before sending data to the database. > Unless, of course, it "just works". I cannot imagine the flag really matters or not. We (Pg) simply dump a bunch of chars to the database, and build it by slurping in the string character by character until we hit a null. I suppose other databases may do things differently, but I can't imagine how/why. >> Yes, very bad example. Let's call it utf8. Forget 'unicode' entirely. > Yeah, better, though it' just perpetuates Perl's unfortunate use of > the term "utf8" for "internal string representation." Though I suppose > that ship has sunk already. Yep. To paraphrase horribly, "Perl's unicode support is the worst, except for all the other languages". >> Because it may still need to convert things. See the ODBC discussion. > > Oh, so you're saying it will decode and encode between Perl's internal > form and UTF-8, rather than just flip the flag on and off? Yes, that's a possibility. > Yes, because you were only talking about utf8 and UTF-8, not any > other encodings. Unless I missed something. If the data coming back > from the DB is Big5, I may well want to have some way to decode it > (and to encode it for write statements). You mean at the DBD level - such that you can say to the database, I don't care what encoding you stored it as, I want it encoded as X when you give it back to me? (update: yes, see below) >> Well, because utf-8 is pretty much a defacto encoding, or at least >> way, way more popular than things like ucs2. Also, the Perl utf8 >> flag encourages us to put everything into UTF-8. > > Yeah, but again, that might be some reason to call it something else, > like "perl_native" or something. The fact that it happens to be UTF-8 > should be irrelevant. ER, except, I guess, you still have to know the > encoding of the database. Well, I wouldn't call it irrelevant, but at the end of the day, we can call it perl_native, but that's just going to cause people to look it up in the docs and then say "aha! that means the utf8 flag is on" and then they have "perl_native -> utf8" burned into their head. Or worse, "perl_native -> unicode". :) >> * 'A': the default, it means the DBD should do the best thing, which in most >> cases means setting SvUTF8_on if the data coming back is UTF-8. >> * 'B': (on). The DBD should make every effort to set SvUTF8_on for returned >> data, even if it thinks it may not be UTF-8. >> * 'C': (off). The DBD should not call SvUTF8_on, regardless of what it >> thinks the data is. > I still prefer an encoding attribute that you can set as follows: > * undef: Default; same as your A. > * ':utf8': Same as your B: > * ':raw': Same as your C > * $encoding: Encode/decode to/from $encoding I like that. Although the names are still odd. I guess it does map though: raw means no utf8 flag. Still not sure about the encode 'to', but I'll start thinking about how we could implement the 'from' in DBD::Pg. How would one map things - just demand that whatever is given must be a literal encoding the particular database can understand? > With an encoding attribute, you don't need the utf8_flag at all. Right, +1 So the above means these two actually behave very differently: $dbh->{encoding} = ':utf8'; $dbh->{encoding} = 'utf8'; Could be a little confusing, no? Methinks we some long ugly name, maybe even worse than "perl_native". Perhaps "perl_internal_utf8_flag"? 1/2 :) Thanks for plugging away at this. My short term goal is to get this finalized enough that I can release the next version of DBD::Pg without a 'pg_' prefix to control the encoding items. - -- Greg Sabino Mullane g...@turnstep.com End Point Corporation http://www.endpoint.com/ PGP Key: 0x14964AC8 201110061151 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -BEGIN PGP SIGNATURE- iEYEAREDAAYFAk6Nz28ACgkQvJuQZxSWSsiWJQCgt/F0r/sCPDa9GuYrGZpZHlQ2 WfYAn0asIYHmPKz1BDfcBo7wLADHmH7N =eJmk -END PGP SIGNATURE-
Re: Add Unicode Support to the DBI
On Tue, 04 Oct 2011 23:24:51 +0100, "Martin J. Evans" wrote: > Some might disagree but DB2 is a main > one I no longer have access to (please contact me if you use DBD::DB2 > and are prepared to spare half an hour or so to modify examples I have > which verify unicode support). Of course, if you use another DBD and can > send me info on unicode support I'd love to hear from you. There are two DB2 users on PerlMonks, who are rather helpful in testing areas. Ask [Tanktalus] & [talexb] :) $ grep -i -w db2 < "30 FreeNode-#cbstream.log" | grep -i -e unicode -e utf Sep 30 19:28:20 [Tanktalus] /me contemplates how to get unicode strings back out of db2 ... Feb 27 16:29:01 [talexb] /me returns to fiddling with utf-8 in DB2. Mar 06 17:56:26 [talexb] [Tanktalus]: Hey, looks like I solved the utf-8 DB2 problem .. just increase the field to 3n from n and it seems to work. Dec 16 21:19:11 [Tanktalus] The DB does. Thus, I would expect that the DBD *could* request data in whatever encoding it wants. And since it would want utf8 to mesh with Perl properly, and every encoding that DB2 supports has a mapping to utf8, this should be doable. Dec 16 21:25:24 [Tanktalus] [ambrus]: yeah, I would like to see it as "mandatory" for byte strings that are already utf-8 (which the DBD driver should be able to tell trivially, at least on DB2), or for connections/statements/something where said decoding was requested. -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using 5.00307 through 5.14 and porting perl5.15.x on HP-UX 10.20, 11.00, 11.11, 11.23 and 11.31, OpenSuSE 10.1, 11.0 .. 11.4 and AIX 5.2 and 5.3. http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
Re: Add Unicode Support to the DBI
On Tue, Oct 4, 2011 at 15:24, Martin J. Evans wrote: > On 04/10/2011 22:38, Tim Bunce wrote: > >> I've not had time to devote to this thread. Sorry. >> >> I'd be grateful if someone could post a summary of it if/when it >> approaches some kind of consensus. >> >> I don't think there is a "kind of consensus" right now (although some > useful discussion which probably will bear fruit) and I'd prefer to work out > what unicode support already exists and how it is implemented first. For > instance, Pg is very focussed on UTF-8 (as are most DBDs) and yet ODBC uses > UCS2 under the hood and CSV can use anything you like. Greg/David/Postgres > seem to have an immediate problem with unicode support in Postgres and I can > imagine they are keen to resolve it and I'd suggest they do it now in the > most appropriate way for DBD::Pg. I don't see why this should necessarily > impact on any discussion as to what DBI should_do/should_say as already the > DBDs which support unicode mostly do it in different ways. > > I've started gathering together details of what unicode support there is in > DBDs, how it is implemented and what special flags there are to support it. > However, this is a massive task. So far I've done ODBC, Oracle, CSV, Unify, > mysql, SQLite, Firebird and sort of held off on Pg as I knew Greg was > working on it. Some might disagree but DB2 is a main one I no longer have > access to (please contact me if you use DBD::DB2 and are prepared to spare > half an hour or so to modify examples I have which verify unicode support). > Of course, if you use another DBD and can send me info on unicode support > I'd love to hear from you. > > I thought the whole issue was an interesting topic and I had toyed with > doing a talk for LPW but to be honest, it is already taking a lot of time > and I have personal issues right now (and of course my $work) which mean my > time is severely limited so I'm doubtful right now if I could have it ready > in time as a talk. I might just post what I have gathered in a weeks time in > the hope I get a little more input in the mean time. DBD::Informix has had a couple of UTF-8 patches sent and one has been applied to the code. The other arrived this morning and has to be compared. The attribute names chosen are different, but both contain 'ix_' as a prefix and UTF-8 in some form or another. What I'm not sure about is how to test the code. Creating an Informix database that has UTF-8 data in it is trivial (well, nearly trivial). The difficulty is demonstrating where there were problems before and that the problems are gone after. If anyone has suggestions for how to show that UTF8 is working properly - in the form of a fairly simple test case - I'd be very grateful to receive it as guidance. So, DBD::Informix is endeavouring to move forward, knowing that the underlying database layers (ESQL/C on the client, and the Informix data server) handle UTF-8 OK, so any problems are in persuading Perl (and perhaps DBI) to handle it appropriately too. [...Did I hear a chorus of "nice theory - shame about the practice"?...] -- Jonathan Leffler #include Guardian of DBD::Informix - v2011.0612 - http://dbi.perl.org "Blessed are we who can laugh at ourselves, for we shall never cease to be amused."
Re: Add Unicode Support to the DBI
On 04/10/2011 22:38, Tim Bunce wrote: I've not had time to devote to this thread. Sorry. I'd be grateful if someone could post a summary of it if/when it approaches some kind of consensus. Thanks. Tim. I don't think there is a "kind of consensus" right now (although some useful discussion which probably will bear fruit) and I'd prefer to work out what unicode support already exists and how it is implemented first. For instance, Pg is very focussed on UTF-8 (as are most DBDs) and yet ODBC uses UCS2 under the hood and CSV can use anything you like. Greg/David/Postgres seem to have an immediate problem with unicode support in Postgres and I can imagine they are keen to resolve it and I'd suggest they do it now in the most appropriate way for DBD::Pg. I don't see why this should necessarily impact on any discussion as to what DBI should_do/should_say as already the DBDs which support unicode mostly do it in different ways. I've started gathering together details of what unicode support there is in DBDs, how it is implemented and what special flags there are to support it. However, this is a massive task. So far I've done ODBC, Oracle, CSV, Unify, mysql, SQLite, Firebird and sort of held off on Pg as I knew Greg was working on it. Some might disagree but DB2 is a main one I no longer have access to (please contact me if you use DBD::DB2 and are prepared to spare half an hour or so to modify examples I have which verify unicode support). Of course, if you use another DBD and can send me info on unicode support I'd love to hear from you. I thought the whole issue was an interesting topic and I had toyed with doing a talk for LPW but to be honest, it is already taking a lot of time and I have personal issues right now (and of course my $work) which mean my time is severely limited so I'm doubtful right now if I could have it ready in time as a talk. I might just post what I have gathered in a weeks time in the hope I get a little more input in the mean time. Martin
Re: Add Unicode Support to the DBI
I've not had time to devote to this thread. Sorry. I'd be grateful if someone could post a summary of it if/when it approaches some kind of consensus. Thanks. Tim.
Re: Add Unicode Support to the DBI
On Oct 2, 2011, at 8:49 PM, Greg Sabino Mullane wrote: > DEW> I assume you also mean to say that data sent *to* the database > DEW> has the flag turned off, yes? > > No: that is undefined. I don't see it as the DBDs job to massage data > going into the database. Or at least, I cannot imagine a DBI interface > for that. Uh, say what? Just as I need to binmode STDOUT, ':utf8'; Before sending stuff to STDOUT (that is, turn off the flag), I would expect DBDs to do the same before sending data to the database. Unless, of course, it "just works". > DEW> Yeah, maybe should be utf8_flag instead. > > Yes, very bad example. Let's call it utf8. Forget 'unicode' entirely. Yeah, better, though it' just perpetuates Perl's unfortunate use of the term "utf8" for "internal string representation." Though I suppose that ship has sunk already. > Yeah, that last one is the current Postgres plan. Which I think should > be best practice and a default DBI expectation. Agreed. > DEW> DBDs will decode the data as needed. > DEW> I don't understand this sentence. If the flag is > DEW> flipped, why will it decode? > > Because it may still need to convert things. See the ODBC discussion. Oh, so you're saying it will decode and encode between Perl's internal form and UTF-8, rather than just flip the flag on and off? > GSM>> If this is set off, the utf8 flag will never be set, and no > GSM>> decoding will be done on data coming back from the database. > > DEW> What if the data coming back from the database > DEW> is Big5 and I want to decode it? > > Eh? You just asked above why would we ever decode it? Yes, because you were only talking about utf8 and UTF-8, not any other encodings. Unless I missed something. If the data coming back from the DB is Big5, I may well want to have some way to decode it (and to encode it for write statements). > DEW> You mean never allow it to be flipped when the > DEW> database encoding is SQL_ASCII? > > Yes, basically. But perhaps it does not matter too much. SQL_ASCII > is such a bad idea anyway, I feel no need to coddle people using it. :) +1 > MJE> So is the problem that sometimes a DBD does not know what to encode data > MJE> being sent to the database or how/whether to decode data coming back > from > MJE> the database? and if that is the case do we need some settings in DBI > MJE> to tell a DBD? > > I think that's one of the things that is being argued for, here. Yes. > MJE> I think this was my point above, i.e., why utf8? databases accept and > MJE> supply a number of encodings so why have a flag called utf8? are we > MJE> going to have ucs2, utf16, utf32 flags as well. Surely, it makes more > MJE> sense to have a flag where you can set the encoding in the same form > MJE> Encode uses. > > Well, because utf-8 is pretty much a defacto encoding, or at least way, way > more popular than things like ucs2. Also, the Perl utf8 flag encourages > us to put everything into UTF-8. Yeah, but again, that might be some reason to call it something else, like "perl_native" or something. The fact that it happens to be UTF-8 should be irrelevant. ER, except, I guess, you still have to know the encoding of the database. > MJE> and what about when the DBD knows you are wrong because the database > MJE> says it is returning data in encoding X but you ask for Y. > > I would assume that the DBD should attempt to convert it to Y if that > is what the user wants. And throw exceptions as appropriate (encoding/decoding failure). > MJE> (examples of DBD flags) > > Almost all the examples from DBDs seem to be focusing on the SvUTF8 flag, so > perhaps we should start by focusing on that, or at least decoupling that > entirely from decoding? If we assume that the default DBI behavior, or more > specifically the default behavior for a random DBD someone picks up is > "flip the flag on if the data is known to be UTF-8", then we can propose a > DBI attribute, call it utf8_flag, that has three states: > > * 'A': the default, it means the DBD should do the best thing, which in most > cases means setting SvUTF8_on if the data coming back is UTF-8. > * 'B': (on). The DBD should make every effort to set SvUTF8_on for returned > data, even if it thinks it may not be UTF-8. > * 'C': (off). The DBD should not call SvUTF8_on, regardless of what it > thinks the data is. I still prefer an encoding attribute that you can set as follows: * undef: Default; same as your A. * ':utf8': Same as your B: * ':raw': Same as your C * $encoding: Encode/decode to/from $encoding > I presume the other half would be an encoding, such that > $h->{encoding} would basically ask the DBD to make any returned > data into that encoding, by hook or by crook. With an encoding attribute, you don't need the utf8_flag at all. Best, David
Re: Add Unicode Support to the DBI
-BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 From: "David E. Wheeler" GSM>> * $h->{unicode_flag} GSM>> If this is set on, data returned from the database is assumed to be UTF-8, and GSM>> the utf8 flag will be set. DEW> I assume you also mean to say that data sent *to* the database DEW> has the flag turned off, yes? No: that is undefined. I don't see it as the DBDs job to massage data going into the database. Or at least, I cannot imagine a DBI interface for that. From: "Martin J. Evans" MJE> There is more than one way to encode unicode - not everyone uses MJE> UTF-8; although some encodings don't support all of unicode. Right, but I'm talking utf8 here. There are only two things that can be done with the strings returned from a database: flip the utf8 flag, or convert/decode it to something else. If it's anything but utf-8, the utf-8 flag is useless at best, harmful at worse. DEW> Yeah, maybe should be utf8_flag instead. Yes, very bad example. Let's call it utf8. Forget 'unicode' entirely. MJE> 4) don't decode the data, the DBD knows it is say UTF-8 encoded MJE> and simply sets the UTF-8 flag (which from what I read is horribly MJE> flawed but seems to work for me). Yeah, that last one is the current Postgres plan. Which I think should be best practice and a default DBI expectation. DEW> DBDs will decode the data as needed. DEW> I don't understand this sentence. If the flag is DEW> flipped, why will it decode? Because it may still need to convert things. See the ODBC discussion. GSM>> If this is set off, the utf8 flag will never be set, and no GSM>> decoding will be done on data coming back from the database. DEW> What if the data coming back from the database DEW> is Big5 and I want to decode it? Eh? You just asked above why would we ever decode it? DEW> You mean never allow it to be flipped when the DEW> database encoding is SQL_ASCII? Yes, basically. But perhaps it does not matter too much. SQL_ASCII is such a bad idea anyway, I feel no need to coddle people using it. :) MJE> So is the problem that sometimes a DBD does not know what to encode data MJE> being sent to the database or how/whether to decode data coming back from MJE> the database? and if that is the case do we need some settings in DBI MJE> to tell a DBD? I think that's one of the things that is being argued for, here. MJE> I think this was my point above, i.e., why utf8? databases accept and MJE> supply a number of encodings so why have a flag called utf8? are we MJE> going to have ucs2, utf16, utf32 flags as well. Surely, it makes more MJE> sense to have a flag where you can set the encoding in the same form MJE> Encode uses. Well, because utf-8 is pretty much a defacto encoding, or at least way, way more popular than things like ucs2. Also, the Perl utf8 flag encourages us to put everything into UTF-8. MJE> and what about when the DBD knows you are wrong because the database MJE> says it is returning data in encoding X but you ask for Y. I would assume that the DBD should attempt to convert it to Y if that is what the user wants. MJE> DBD::Oracle to my knowledge has no special flags; it just attempts to do MJE> the right thing but it favours speed so most data that is supposed to be MJE> UTF-8 encoded has SvUTF8_on set but in one case (error messages) it MJE> properly and strictly decodes the message so long as your Perl is recent MJE> enough else it uses SvUTF8_on. I'm not sure I understand this. It takes UTF-8 errors from the database, changes them to something else, and does NOT set SvUTF8? MJE> (examples of DBD flags) Almost all the examples from DBDs seem to be focusing on the SvUTF8 flag, so perhaps we should start by focusing on that, or at least decoupling that entirely from decoding? If we assume that the default DBI behavior, or more specifically the default behavior for a random DBD someone picks up is "flip the flag on if the data is known to be UTF-8", then we can propose a DBI attribute, call it utf8_flag, that has three states: * 'A': the default, it means the DBD should do the best thing, which in most cases means setting SvUTF8_on if the data coming back is UTF-8. * 'B': (on). The DBD should make every effort to set SvUTF8_on for returned data, even if it thinks it may not be UTF-8. * 'C': (off). The DBD should not call SvUTF8_on, regardless of what it thinks the data is. I presume the other half would be an encoding, such that $h->{encoding} would basically ask the DBD to make any returned data into that encoding, by hook or by crook. - -- Greg Sabino Mullane g...@turnstep.com End Point Corporation http://www.endpoint.com/ PGP Key: 0x14964AC8 201110022345 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -BEGIN PGP SIGNATURE- iEYEAREDAAYFAk6JMKgACgkQvJuQZxSWSsgMsQCfdsB6cBwxEmcjvm1WLi9Khncc I10AoM+M+UGjHjXrtpcQ2PcQOdmmU/n0 =BuvK -END PGP SIGNATURE-
Re: Add Unicode Support to the DBI
On Sep 22, 2011, at 11:57 AM, Martin J. Evans wrote: > ok except what the oracle client libraries accept does not match with Encode > accepted strings so someone would have to come up with some sort of mapping > between the two. Yes. That's one of the consequences of providing a single interface to multiple databases. >>> and what about when it conflicts with your locale/LANG? >> So what? > I'm not so sure this is a "So what" as Perl itself uses locale settings in > some cases - just thought it needed mentioning for consideration. I'm not really concerned about locales at this point. I tend to leave collation, for example, up to the database. Right now I'm strictly concerned about encoding. >>> and what about PERL_UNICODE flags, do they come into this? >> What are those? > See http://perldoc.perl.org/perlrun.html > > In particular "UTF-8 is the default PerlIO layer for input streams" of which > reading data from a database could be considered one? That'd be cool, but it's not currently implemented that way, obviously. DBI and PerlIO are completely independent AFAIK, and the DBI doesn't look like a file handle. > ok, I'm thinking through the ramifications of this. > > To add to the list I see DBD::SQLite has |sqlite_unicode |"strings coming > from the database and passed to the collation function will be properly > tagged with the utf8 flag; but this only works if the |sqlite_unicode| > attribute is set before the first call to a perl collation sequence" and "The > current FTS3 implementation in SQLite is far from complete with respect to > utf8 handling : in particular, variable-length characters are not treated > correctly by the builtin functions |offsets()| and |snippet()|." > > and DBD::CSV has > > f_encoding => "utf8", > > DBD::mysql has mysql_enable_utf8 which apparently "This attribute determines > whether DBD::mysql should assume strings stored in the database are utf8. > This feature defaults to off." > > I could not find any special flags for DBD::DB2. > > DBD::Sybase has syb_enable_utf8 "If this attribute is set then DBD::Sybase > will convert UNIVARCHAR, UNICHAR, and UNITEXT data to Perl's internal utf-8 > encoding when they are retrieved. Updating a unicode column will cause Sybase > to convert any incoming data from utf-8 to its internal utf-16 encoding." Yeah, so I think that can be generalized. Best, David
Re: Add Unicode Support to the DBI
On 22/09/2011 19:28, David E. Wheeler wrote: On Sep 22, 2011, at 11:14 AM, Martin J. Evans wrote: Right. There needs to be a way to tell the DBI what encoding the server sends and expects to be sent. If it's not UTF-8, then the utf8_flag option is kind of useless. I think this was my point above, i.e., why utf8? databases accept and supply a number of encodings so why have a flag called utf8? are we going to have ucs2, utf16, utf32 flags as well. Surely, it makes more sense to have a flag where you can set the encoding in the same form Encode uses. Yes, I agreed with you. :-) progress then ;-) Unless I'm mistaken as to what you refer to I believe that is a feature of the Oracle client libraries and not one of DBD::Oracle so there is little we can do about that. Sure you can. I set something via the DBI interface and the DBD sets the environment variable for the Oracle client libraries. ok except what the oracle client libraries accept does not match with Encode accepted strings so someone would have to come up with some sort of mapping between the two. e.g., my current setting is: NLS_LANG=AMERICAN_AMERICA.AL32UTF8 So to try and move forward, we'd we talking about a flag or flags which say: 1 encode the data sent to the database like "this" (which could be nothing) 2 decode the data retrieved from the database like "this" (which could be nothing but if not nothing it could be using strict or loose for the UTF-8 and utf-8 case) 3 don't decode but use SvUTF8_on (a specific case since Perl uses that internally and a number of database return UTF-8) one that seems to work but I worry about. 4 do what the DBD thinks is best - whatever the behaviour is now? Yes. and what about when it conflicts with your locale/LANG? So what? I'm not so sure this is a "So what" as Perl itself uses locale settings in some cases - just thought it needed mentioning for consideration. and what about PERL_UNICODE flags, do they come into this? What are those? See http://perldoc.perl.org/perlrun.html In particular "UTF-8 is the default PerlIO layer for input streams" of which reading data from a database could be considered one? and what about when the DBD knows you are wrong because the database says it is returning data in encoding X but you ask for Y. Throw an exception or a warning. ok and for DBD::ODBC built for unicode API am I expected to try and decode UCS2 as x just because the flag tells me to and I know it will not work? Seems like it only applies to the ANSI API in DBD::ODBC where the data could be UTF-8 encoded in a few (possibly broken see http://www.martin-evans.me.uk/node/20#unicode) cases. If the user does something that makes no sense, tell them it makes no sense. Die if necessary. I still think it would help to name some specific cases per DBD of flags in use and why they exist: DBD::ODBC has a odbc_utf8_on flag to say that data returned by the database when using the ANSI APIs is UTF-8 encoded and currently it calls SvUTF8_on on that data (I've never used or verified it works myself but the person supplying the patch said it had a purpose with a particular Postgres based database). That's what the new DBD::Pg flag that Greg's working on does, too. Beyond that DBD::ODBC has no other flags as it knows in the unicode/wide APIs the data is UCS2 encoded and it checks it is valid when decoding it. Similarly when sending data to the database in the wide APIs it takes the Perl scalar and encodes it in UCS2. Yeah, ideally, by default, if the DBD knows the encoding used by the database, it should just DTRT. There are backward compatibility issues with that for DBD::Pg, though. So there probably should be a knob to say "don't do any encoding or decoding at all", because a lot of older apps likely expect that. DBD::Oracle to my knowledge has no special flags; it just attempts to do the right thing but it favours speed so most data that is supposed to be UTF-8 encoded has SvUTF8_on set but in one case (error messages) it properly and strictly decodes the message so long as your Perl is recent enough else it uses SvUTF8_on. So, what are the other flags in use and what purpose do they fulfill. I think we could really just start with one flag, "encoding". By default the DBD should just try to do the right thing. If "encoding" is set to ":raw" then it should do no encoding or decoding. If it's set to ":utf8" it should just turn the flag on or off. If it's set to an actual encoding it should encode and decode. I think that would be a good start. Best, David ok, I'm thinking through the ramifications of this. To add to the list I see DBD::SQLite has |sqlite_unicode |"strings coming from the database and passed to the collation function will be properly tagged with the utf8 flag; but this only works if the |sqlite_unicode| attribute is set before the first call to a perl collation sequence" and "The current FTS3 implementation in SQLite is far
Re: Add Unicode Support to the DBI
On Sep 22, 2011, at 11:14 AM, Martin J. Evans wrote: >> Right. There needs to be a way to tell the DBI what encoding the server >> sends and expects to be sent. If it's not UTF-8, then the utf8_flag option >> is kind of useless. > I think this was my point above, i.e., why utf8? databases accept and supply > a number of encodings so why have a flag called utf8? are we going to have > ucs2, utf16, utf32 flags as well. Surely, it makes more sense to have a flag > where you can set the encoding in the same form Encode uses. Yes, I agreed with you. :-) > Unless I'm mistaken as to what you refer to I believe that is a feature of > the Oracle client libraries and not one of DBD::Oracle so there is little we > can do about that. Sure you can. I set something via the DBI interface and the DBD sets the environment variable for the Oracle client libraries. > So to try and move forward, we'd we talking about a flag or flags which say: > > 1 encode the data sent to the database like "this" (which could be nothing) > 2 decode the data retrieved from the database like "this" (which could be > nothing but if not nothing it could be using strict or loose for the UTF-8 > and utf-8 case) > 3 don't decode but use SvUTF8_on (a specific case since Perl uses that > internally and a number of database return UTF-8) > one that seems to work but I worry about. > 4 do what the DBD thinks is best - whatever the behaviour is now? Yes. > and what about when it conflicts with your locale/LANG? So what? > and what about PERL_UNICODE flags, do they come into this? What are those? > and what about when the DBD knows you are wrong because the database says it > is returning data in encoding X but you ask for Y. Throw an exception or a warning. > and for DBD::ODBC built for unicode API am I expected to try and decode UCS2 > as x just because the flag tells me to and I know it will not work? Seems > like it only applies to the ANSI API in DBD::ODBC where the data could be > UTF-8 encoded in a few (possibly broken see > http://www.martin-evans.me.uk/node/20#unicode) cases. If the user does something that makes no sense, tell them it makes no sense. Die if necessary. > I still think it would help to name some specific cases per DBD of flags in > use and why they exist: > > DBD::ODBC has a odbc_utf8_on flag to say that data returned by the database > when using the ANSI APIs is UTF-8 encoded and currently it calls SvUTF8_on on > that data (I've never used or verified it works myself but the person > supplying the patch said it had a purpose with a particular Postgres based > database). That's what the new DBD::Pg flag that Greg's working on does, too. > Beyond that DBD::ODBC has no other flags as it knows in the unicode/wide APIs > the data is UCS2 encoded and it checks it is valid when decoding it. > Similarly when sending data to the database in the wide APIs it takes the > Perl scalar and encodes it in UCS2. Yeah, ideally, by default, if the DBD knows the encoding used by the database, it should just DTRT. There are backward compatibility issues with that for DBD::Pg, though. So there probably should be a knob to say "don't do any encoding or decoding at all", because a lot of older apps likely expect that. > DBD::Oracle to my knowledge has no special flags; it just attempts to do the > right thing but it favours speed so most data that is supposed to be UTF-8 > encoded has SvUTF8_on set but in one case (error messages) it properly and > strictly decodes the message so long as your Perl is recent enough else it > uses SvUTF8_on. > > So, what are the other flags in use and what purpose do they fulfill. I think we could really just start with one flag, "encoding". By default the DBD should just try to do the right thing. If "encoding" is set to ":raw" then it should do no encoding or decoding. If it's set to ":utf8" it should just turn the flag on or off. If it's set to an actual encoding it should encode and decode. I think that would be a good start. Best, David
Re: Add Unicode Support to the DBI
On 22/09/2011 17:36, David E. Wheeler wrote: On Sep 22, 2011, at 2:26 AM, Martin J. Evans wrote: There is more than one way to encode unicode - not everyone uses UTF-8; although some encodings don't support all of unicode. Yeah, maybe should be utf8_flag instead. see below. unicode is not encoded as UTF-8 in ODBC using the wide APIs. Using the wide ODBC APIs returns data in UCS2 encoding and DBD::ODBC decodes it. Using the ANSI APIs data is returned as octets and is whatever it is - it may be ASCII, it may be UTF-8 encoded (only in 2 cases I know and I believe they are flawed anyway) it may be something else in which case the application needs to know what it is. In the case of octets which are UTF-8 encoded DBD::ODBC has no idea that is the case unless you tell it and it will then set the UTF-8 flag (but see later). Right. There needs to be a way to tell the DBI what encoding the server sends and expects to be sent. If it's not UTF-8, then the utf8_flag option is kind of useless. I think this was my point above, i.e., why utf8? databases accept and supply a number of encodings so why have a flag called utf8? are we going to have ucs2, utf16, utf32 flags as well. Surely, it makes more sense to have a flag where you can set the encoding in the same form Encode uses. I'm not that familiar with Postgres (I've used a few times and not to any great degree) and I used MySQL for a while years ago. I occasionally use SQLite. I do use DBD::Oracle and DBD::ODBC all the time. I'm still struggling to see the problem that needs fixing. Is it just that some people would like a DBI flag which tells the DBD: 1) decode any data coming back from the database strictly such that if it is invalid you die 2) decode any data coming back from the database loosely (e.g., utf-8 vs UTF-8) 3) don't decode the data from the database at all 4) don't decode the data, the DBD knows it is say UTF-8 encoded and simply sets the UTF-8 flag (which from what I read is horribly flawed but seems to work for me). and the reverse. Yes, with one API for all drivers, if possible, and guidelines for how it should work (when to encode and decode, what to encode and decode, when to just flip the utf8 flag on and off, etc.). ok DBD::Oracle does 1 some of the time and it does 4 the rest of the time e.g. error messages are fully decoded from UTF-8 IF Oracle is sending UTF-8 and it does 4 on most of the column data IF Oracle is sending UTF-8. Yeah, but to enable it *you set a bloody environment variable*. WHAT? Unless I'm mistaken as to what you refer to I believe that is a feature of the Oracle client libraries and not one of DBD::Oracle so there is little we can do about that. My point being, doesn't the DBD know how the data is encoded when it gets it from the database? and it would hopefully know what the database needs when sending data. Perhaps in some conditions the DBD does not know this and needs to be told (I could imagine SQLite reading/writing straight to files for instance might want to know to open the file with UTF-8 layer). Or to turn it off, so you can just pass the encoded UTF-8 through to the file without the decode/encode round-trip. So is the problem that sometimes a DBD does not know what to encode data being sent to the database or how/whether to decode data coming back from the database? and if that is the case do we need some settings in DBI to tell a DBD? That's an issue, yes, but the main issue is that all the drivers do it differently, sometimes with different semantics, and lack all the functionality one might want (e.g., your examples 1-4). Best, David So to try and move forward, we'd we talking about a flag or flags which say: 1 encode the data sent to the database like "this" (which could be nothing) 2 decode the data retrieved from the database like "this" (which could be nothing but if not nothing it could be using strict or loose for the UTF-8 and utf-8 case) 3 don't decode but use SvUTF8_on (a specific case since Perl uses that internally and a number of database return UTF-8) one that seems to work but I worry about. 4 do what the DBD thinks is best - whatever the behaviour is now? and what about when it conflicts with your locale/LANG? and what about PERL_UNICODE flags, do they come into this? and what about when the DBD knows you are wrong because the database says it is returning data in encoding X but you ask for Y. and for DBD::ODBC built for unicode API am I expected to try and decode UCS2 as x just because the flag tells me to and I know it will not work? Seems like it only applies to the ANSI API in DBD::ODBC where the data could be UTF-8 encoded in a few (possibly broken see http://www.martin-evans.me.uk/node/20#unicode) cases. I still think it would help to name some specific cases per DBD of flags in use and why they exist: DBD::ODBC has a odbc_utf8_on flag to say that data returned by the database when using the ANSI APIs i
Re: Add Unicode Support to the DBI
On Sep 22, 2011, at 2:26 AM, Martin J. Evans wrote: > There is more than one way to encode unicode - not everyone uses UTF-8; > although some encodings don't support all of unicode. Yeah, maybe should be utf8_flag instead. > unicode is not encoded as UTF-8 in ODBC using the wide APIs. > > Using the wide ODBC APIs returns data in UCS2 encoding and DBD::ODBC decodes > it. Using the ANSI APIs data is returned as octets and is whatever it is - it > may be ASCII, it may be UTF-8 encoded (only in 2 cases I know and I believe > they are flawed anyway) it may be something else in which case the > application needs to know what it is. In the case of octets which are UTF-8 > encoded DBD::ODBC has no idea that is the case unless you tell it and it will > then set the UTF-8 flag (but see later). Right. There needs to be a way to tell the DBI what encoding the server sends and expects to be sent. If it's not UTF-8, then the utf8_flag option is kind of useless. > I'm not that familiar with Postgres (I've used a few times and not to any > great degree) and I used MySQL for a while years ago. I occasionally use > SQLite. I do use DBD::Oracle and DBD::ODBC all the time. I'm still struggling > to see the problem that needs fixing. Is it just that some people would like > a DBI flag which tells the DBD: > > 1) decode any data coming back from the database strictly such that if it is > invalid you die > 2) decode any data coming back from the database loosely (e.g., utf-8 vs > UTF-8) > 3) don't decode the data from the database at all > 4) don't decode the data, the DBD knows it is say UTF-8 encoded and simply > sets the UTF-8 flag (which from what I read is horribly flawed but seems to > work for me). > > and the reverse. Yes, with one API for all drivers, if possible, and guidelines for how it should work (when to encode and decode, what to encode and decode, when to just flip the utf8 flag on and off, etc.). > DBD::Oracle does 1 some of the time and it does 4 the rest of the time e.g. > error messages are fully decoded from UTF-8 IF Oracle is sending UTF-8 and it > does 4 on most of the column data IF Oracle is sending UTF-8. Yeah, but to enable it *you set a bloody environment variable*. WHAT? > My point being, doesn't the DBD know how the data is encoded when it gets it > from the database? and it would hopefully know what the database needs when > sending data. Perhaps in some conditions the DBD does not know this and needs > to be told (I could imagine SQLite reading/writing straight to files for > instance might want to know to open the file with UTF-8 layer). Or to turn it off, so you can just pass the encoded UTF-8 through to the file without the decode/encode round-trip. > So is the problem that sometimes a DBD does not know what to encode data > being sent to the database or how/whether to decode data coming back from the > database? and if that is the case do we need some settings in DBI to tell a > DBD? That's an issue, yes, but the main issue is that all the drivers do it differently, sometimes with different semantics, and lack all the functionality one might want (e.g., your examples 1-4). Best, David
Re: Add Unicode Support to the DBI
David, I forgot to answer your post first and ended up putting most of my comments in a reply to Greg's posting - sorry, it was a long night last night. Some further comments below: On 21/09/11 19:44, David E. Wheeler wrote: On Sep 10, 2011, at 3:08 AM, Martin J. Evans wrote: I'm not sure any change is required to DBI to support unicode. As far as I'm aware unicode already works with DBI if the DBDs do the right thing. Right, but the problem is that, IME, none of them do "the right thing." As I said, I've submitted encoding-related bug reports for every DBD I've used in production code. And they all have different interfaces for tweaking things. So it is these "different interfaces for tweaking things" you'd like to standardise I presume. If you stick to the rule that all data Perl receives must be decoded and all data Perl exports must be encoded it works (ignoring any issues in Perl itself). Er, was there supposed to be a ", then …" statement there? I bow to Tom's experience but I'm still not sure how that applies to DBI so long as the interface between the database and Perl always encodes and decodes then the issues Tom describes are all Perl ones - no? The trouble is that: 1. They don't always encode or decode 2. When they do, the tend to get subtle bits wrong 3. And they all have different interfaces and philosophies for doing so Surely Oracle should return the data encoded as you asked for it and if it did not Oracle is broken. I'd still like to see this case and then we can see if Oracle is broken and if there is a fix for it. Oh I don't doubt that Oracle is broken. In some places DBD::Oracle does sv_utf8_decode(scalar) or SvUTF8_on(scalar) (depending on your Perl) and in some places it just does SvUTF8_on(scalar). I believe the latter is much quicker as the data is not checked. Many people (myself included) are particularly interested in DBD::Oracle being fast and if all the occurrences were changed to decode I'd patch that out in my copy as I know the data I receive is UTF-8 encoded. IME It needs an "assume Oracle is broken" knob. That is, I should have the option to enface encoding and decoding, rather than just flipping SvUTF8. And I think that such an interface should be standardized in the DBI along with detailed information for driver authors how how to "get it right." ok, I get that. See above. I'd like the chance to go with speed and take the consequences rather than go with slower but know incorrect UTF-8 is spotted. And maybe that's the default. But I should be able to tell it to be pedantic when the data is known to be bad (see, for example data from an SQL_ASCII-encoded PostgreSQL database). I thought UTF-8 when used in Perl used the strict definition and utf-8 used Perl's looser definition - see http://search.cpan.org/~dankogai/Encode-2.44/Encode.pm#UTF-8_vs._utf8_vs._UTF8 That's right. So if I want to ensure that I'm getting strict encoding in my database, It needs to encode and decode, not simply flip SvUTF8. Don't DBDs do this now? I know the encoding of the data I receive in DBD::ODBC and decode it when I get it and encode it when I send it and I believe that is what DBD::Oracle does as well. There is one exception in ODBC for drivers which don't truly abide by ODBC spec and send 8 bit data back UTF-8 encoded (see later). There is no single API for configuring this in the DBI, and I argue there should be. I've spent a lot of effort getting unicode working in DBD::ODBC (for UNIX and with patches from Alexander Foken for Windows) which is implemented in an awkward fashion in ODBC. I'd like to hear from DBD authors what support they already have and how it is implemented so we can see what ground is already covered and where the problems were. DBD::Pg's approach is currently broken. Greg is working on fixing it, but for compatibility reasons the fix is non-trivial (an the API might be, too). In a perfect world DBD::Pg would just always do the right thing, as the database tells it what encodings to use when you connect (and *all* data is encoded as such, not just certain data types). But the world is not perfect, there's a lot of legacy stuff. Greg, care to add any other details? as I remain unconvinced a problem exists other than incorrectly coded DBDs. I'm happy to collate that information. As a start I'll describe the DBD::ODBC: 1. ODBC has 2 sets of APIs, SQLxxxA (each chr is 8 bits) and SQLxxxW (each chr is 16 bits and UCS-2). This is how Microsoft did it and yes I know that does not support all of unicode but code pages get involved too. 2. You select which API you are using with a macro when you compile your application so you cannot change your mind. You can in theory call SQLxxxA or SQLxxxW functions directly but if you use SQLxxx you get the A or W depending on what the macro is set to. Problem: DBD::ODBC has to built one way or the other. 3. When using the SQLxxxA functions you can still bind columns/parameters as wide
Re: Add Unicode Support to the DBI
On 21/09/11 21:52, Greg Sabino Mullane wrote: -BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 ... And maybe that's the default. But I should be able to tell it to be pedantic when the data is known to be bad (see, for example data from an SQL_ASCII-encoded PostgreSQL database). ... DBD::Pg's approach is currently broken. Greg is working on fixing it, but for compatibility reasons the fix is non-trivial (an the API might be, too). In a perfect world DBD::Pg would just always do the right thing, as the database tells it what encodings to use when you connect (and *all* data is encoded as such, not just certain data types). But the world is not perfect, there's a lot of legacy stuff. Greg, care to add any other details? My thinking on this has changed a bit. See the DBD::Pg in git head for a sample, but basically, DBD::Pg is going to: * Flip the flag on if the client_encoding is UTF-8 (and server_encoding is not SQL_ASCII) * Flip if off if not The single switch will be pg_unicode_flag, which will basiccaly override the automatic choice above, just in case you really want your SQL_ASCII byte soup marked as utf8 for some reason, or (more likely), you want your data unmarked as utf8 despite being so. This does rely on PostgreSQL doing the right thing when it comes to encoding/decoding/storing all the encodings, but I'm pretty sure it's doing well in that regard. ... Since nobody has actally defined a specific interface yet, let me throw out a straw man. It may look familiar :) === * $h->{unicode_flag} If this is set on, data returned from the database is assumed to be UTF-8, and the utf8 flag will be set. DBDs will decode the data as needed. There is more than one way to encode unicode - not everyone uses UTF-8; although some encodings don't support all of unicode. unicode is not encoded as UTF-8 in ODBC using the wide APIs. Using the wide ODBC APIs returns data in UCS2 encoding and DBD::ODBC decodes it. Using the ANSI APIs data is returned as octets and is whatever it is - it may be ASCII, it may be UTF-8 encoded (only in 2 cases I know and I believe they are flawed anyway) it may be something else in which case the application needs to know what it is. In the case of octets which are UTF-8 encoded DBD::ODBC has no idea that is the case unless you tell it and it will then set the UTF-8 flag (but see later). If this is set off, the utf8 flag will never be set, and no decoding will be done on data coming back from the database. If this is not set (undefined), the underlying DBD is responsible for doing the correct thing. In other words, the behaviour is undefined. === I don't think this will fit into DBD::Pgs current implementation perfectly, as we wouldn't want people to simply leave $h->{unicode_flag} on, as that would force SQL_ASCII text to have utf8 flipped on. Perhaps we simply never, ever allow that. I'm not that familiar with Postgres (I've used a few times and not to any great degree) and I used MySQL for a while years ago. I occasionally use SQLite. I do use DBD::Oracle and DBD::ODBC all the time. I'm still struggling to see the problem that needs fixing. Is it just that some people would like a DBI flag which tells the DBD: 1) decode any data coming back from the database strictly such that if it is invalid you die 2) decode any data coming back from the database loosely (e.g., utf-8 vs UTF-8) 3) don't decode the data from the database at all 4) don't decode the data, the DBD knows it is say UTF-8 encoded and simply sets the UTF-8 flag (which from what I read is horribly flawed but seems to work for me). and the reverse. DBD::Oracle does 1 some of the time and it does 4 the rest of the time e.g. error messages are fully decoded from UTF-8 IF Oracle is sending UTF-8 and it does 4 on most of the column data IF Oracle is sending UTF-8. DBD::ODBC does nothing via the ANSI APIs unless the odbc_utf8 flag is turned on in which case it does 4 (and it only does this because there is apparently a version of the Postgres ODBC driver out there somewhere that returns UTF-8 encoded data, I've never seen it, I just accepted the patch). DBD::ODBC does 1 if using the wide APIs and it has little choice since no one would want to accept UCS2 and have to decode it all the time. My point being, doesn't the DBD know how the data is encoded when it gets it from the database? and it would hopefully know what the database needs when sending data. Perhaps in some conditions the DBD does not know this and needs to be told (I could imagine SQLite reading/writing straight to files for instance might want to know to open the file with UTF-8 layer). So is the problem that sometimes a DBD does not know what to encode data being sent to the database or how/whether to decode data coming back from the database? and if that is the case do we need some settings in DBI to tell a DBD? Martin -- Martin J. Evans Easysoft Limited http://www.easysoft.com
Re: Add Unicode Support to the DBI
On Sep 21, 2011, at 1:52 PM, Greg Sabino Mullane wrote: > Since nobody has actally defined a specific interface yet, let me throw out a > straw man. It may look familiar :) > > === > * $h->{unicode_flag} > > If this is set on, data returned from the database is assumed to be UTF-8, > and > the utf8 flag will be set. I assume you also mean to say that data sent *to* the database has the flag turned off, yes? > DBDs will decode the data as needed. I don't understand this sentence. If the flag is flipped, why will it decode? > If this is set off, the utf8 flag will never be set, and no decoding will be > done > on data coming back from the database. What if the data coming back from the database is Big5 and I want to decode it? > If this is not set (undefined), the underlying DBD is responsible for doing > the > correct thing. In other words, the behaviour is undefined. > === > > I don't think this will fit into DBD::Pgs current implementation perfectly, > as > we wouldn't want people to simply leave $h->{unicode_flag} on, as that would > force SQL_ASCII text to have utf8 flipped on. Perhaps we simply never, ever > allow that. You mean never allow it to be flipped when the database encoding is SQL_ASCII? Best, David
Re: Add Unicode Support to the DBI
-BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 ... > And maybe that's the default. But I should be able to tell it to be pedantic > when the > data is known to be bad (see, for example data from an SQL_ASCII-encoded > PostgreSQL database). ... > DBD::Pg's approach is currently broken. Greg is working on fixing it, but for > compatibility > reasons the fix is non-trivial (an the API might be, too). In a perfect world > DBD::Pg would > just always do the right thing, as the database tells it what encodings to > use when you > connect (and *all* data is encoded as such, not just certain data types). But > the world is > not perfect, there's a lot of legacy stuff. > > Greg, care to add any other details? My thinking on this has changed a bit. See the DBD::Pg in git head for a sample, but basically, DBD::Pg is going to: * Flip the flag on if the client_encoding is UTF-8 (and server_encoding is not SQL_ASCII) * Flip if off if not The single switch will be pg_unicode_flag, which will basiccaly override the automatic choice above, just in case you really want your SQL_ASCII byte soup marked as utf8 for some reason, or (more likely), you want your data unmarked as utf8 despite being so. This does rely on PostgreSQL doing the right thing when it comes to encoding/decoding/storing all the encodings, but I'm pretty sure it's doing well in that regard. ... Since nobody has actally defined a specific interface yet, let me throw out a straw man. It may look familiar :) === * $h->{unicode_flag} If this is set on, data returned from the database is assumed to be UTF-8, and the utf8 flag will be set. DBDs will decode the data as needed. If this is set off, the utf8 flag will never be set, and no decoding will be done on data coming back from the database. If this is not set (undefined), the underlying DBD is responsible for doing the correct thing. In other words, the behaviour is undefined. === I don't think this will fit into DBD::Pgs current implementation perfectly, as we wouldn't want people to simply leave $h->{unicode_flag} on, as that would force SQL_ASCII text to have utf8 flipped on. Perhaps we simply never, ever allow that. - -- Greg Sabino Mullane g...@turnstep.com End Point Corporation http://www.endpoint.com/ PGP Key: 0x14964AC8 201109211651 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -BEGIN PGP SIGNATURE- iEYEAREDAAYFAk56TngACgkQvJuQZxSWSsiIfwCeKMfsg2RYsCzDuwb8FnmZhhbu 8LgAn2TNLuKirq5IDAhlCNmQ3gxbnuq7 =k+Fi -END PGP SIGNATURE-
Re: Add Unicode Support to the DBI
On Sep 10, 2011, at 3:08 AM, Martin J. Evans wrote: > I'm not sure any change is required to DBI to support unicode. As far as I'm > aware unicode already works with DBI if the DBDs do the right thing. Right, but the problem is that, IME, none of them do "the right thing." As I said, I've submitted encoding-related bug reports for every DBD I've used in production code. And they all have different interfaces for tweaking things. > If you stick to the rule that all data Perl receives must be decoded and all > data Perl exports must be encoded it works (ignoring any issues in Perl > itself). Er, was there supposed to be a ", then …" statement there? > I bow to Tom's experience but I'm still not sure how that applies to DBI so > long as the interface between the database and Perl always encodes and > decodes then the issues Tom describes are all Perl ones - no? The trouble is that: 1. They don't always encode or decode 2. When they do, the tend to get subtle bits wrong 3. And they all have different interfaces and philosophies for doing so > Surely Oracle should return the data encoded as you asked for it and if it > did not Oracle is broken. > I'd still like to see this case and then we can see if Oracle is broken and > if there is a fix for it. Oh I don't doubt that Oracle is broken. > In some places DBD::Oracle does sv_utf8_decode(scalar) or SvUTF8_on(scalar) > (depending on your Perl) and in some places it just does SvUTF8_on(scalar). I > believe the latter is much quicker as the data is not checked. Many people > (myself included) are particularly interested in DBD::Oracle being fast and > if all the occurrences were changed to decode I'd patch that out in my copy > as I know the data I receive is UTF-8 encoded. IME It needs an "assume Oracle is broken" knob. That is, I should have the option to enface encoding and decoding, rather than just flipping SvUTF8. And I think that such an interface should be standardized in the DBI along with detailed information for driver authors how how to "get it right." > See above. I'd like the chance to go with speed and take the consequences > rather than go with slower but know incorrect UTF-8 is spotted. And maybe that's the default. But I should be able to tell it to be pedantic when the data is known to be bad (see, for example data from an SQL_ASCII-encoded PostgreSQL database). > I thought UTF-8 when used in Perl used the strict definition and utf-8 used > Perl's looser definition - see > http://search.cpan.org/~dankogai/Encode-2.44/Encode.pm#UTF-8_vs._utf8_vs._UTF8 That's right. So if I want to ensure that I'm getting strict encoding in my database, It needs to encode and decode, not simply flip SvUTF8. > Don't DBDs do this now? I know the encoding of the data I receive in > DBD::ODBC and decode it when I get it and encode it when I send it and I > believe that is what DBD::Oracle does as well. There is one exception in ODBC > for drivers which don't truly abide by ODBC spec and send 8 bit data back > UTF-8 encoded (see later). There is no single API for configuring this in the DBI, and I argue there should be. > I've spent a lot of effort getting unicode working in DBD::ODBC (for UNIX and > with patches from Alexander Foken for Windows) which is implemented in an > awkward fashion in ODBC. I'd like to hear from DBD authors what support they > already have and how it is implemented so we can see what ground is already > covered and where the problems were. DBD::Pg's approach is currently broken. Greg is working on fixing it, but for compatibility reasons the fix is non-trivial (an the API might be, too). In a perfect world DBD::Pg would just always do the right thing, as the database tells it what encodings to use when you connect (and *all* data is encoded as such, not just certain data types). But the world is not perfect, there's a lot of legacy stuff. Greg, care to add any other details? > as I remain unconvinced a problem exists other than incorrectly coded DBDs. > I'm happy to collate that information. As a start I'll describe the DBD::ODBC: > > 1. ODBC has 2 sets of APIs, SQLxxxA (each chr is 8 bits) and SQLxxxW (each > chr is 16 bits and UCS-2). This is how Microsoft did it and yes I know that > does not support all of unicode but code pages get involved too. > > 2. You select which API you are using with a macro when you compile your > application so you cannot change your mind. > You can in theory call SQLxxxA or SQLxxxW functions directly but if you use > SQLxxx you get the A or W depending on what the macro is set to. > Problem: DBD::ODBC has to built one way or the other. > > 3. When using the SQLxxxA functions you can still bind columns/parameters as > wide characters but the ODBC driver needs to support this. > > 4. When using SQLxxxW functions all strings are expected in UCS-2. You can > bind columns and parameters as whatever type you like but obviously if you > bind a unicode
Re: Add Unicode Support to the DBI
On Sep 10, 2011, at 7:44 AM, Lyle wrote: >> Right now 5.8 is the required minimum for DBI: should we consider bumping >> this? > > I know a lot of servers in the wild are still running RHEL5 and it's > variants, which are stuck on 5.8 in the standard package management. The new > RHEL6 only has 5.10... > So at this time the impact of such change could be significant. Yes, which is why we can't just impose a solution on people. Best, David
Re: Add Unicode Support to the DBI
DBI peeps, Sorry for the delayed response, I've been busy, looking to reply to this thread now. On Sep 9, 2011, at 8:06 PM, Greg Sabino Mullane wrote: > One thing I see bandied about a lot is that Perl 5.14 is highly preferred. > However, it's not clear exactly what the gains are and how bad 5.12 is > compared to 5.14, how bad 5.10 is, how bad 5.8 is, etc. Right now 5.8 is > the required minimum for DBI: should we consider bumping this? I know TC > would be horrified to see us attempting to talk about Unicode support > with a 5.8.1 requirement, but how much of that will affect database > drivers? I have no idea myself. I think I'd just follow TC's recommendations here. DBI should stay compatible as far back as is reasonable without unduly affecting further development and improvement (not that there's much of that right now). So if proper encoding is important to you, use at least 5.12 and prefer 5.14. And if proper encoding is not important to you, well, it is, you just don't know it yet. > Another aspect to think about that came up during some offline DBD::Pg > talks was the need to support legacy scripts and legacy data. While the > *correct* thing is to blaze forward and use Do Things Correctly everywhere, > I think we at least need some prominent knobs so that we can maintain > backwards compatiblity for existing scripts that expect a bunch of > Latin1, or need the data to come back in the current, undecoded, > un-utf8-flagged way. Agreed. I suspect the existing behavior should remain the default, with a knob to make it "do things correctly," with perhaps a deprecation plan to turn on the "correctly" knob by default in a year or so. Best, David
Re: Add Unicode Support to the DBI
On 10/09/2011 04:06, Greg Sabino Mullane wrote: Right now 5.8 is the required minimum for DBI: should we consider bumping this? I know a lot of servers in the wild are still running RHEL5 and it's variants, which are stuck on 5.8 in the standard package management. The new RHEL6 only has 5.10... So at this time the impact of such change could be significant. Lyle
Re: Add Unicode Support to the DBI
On 10/09/2011 03:52, David E. Wheeler wrote: DBIers, tl;dr: I think it's time to add proper Unicode support to the DBI. What do you think it should look like? I'm not sure any change is required to DBI to support unicode. As far as I'm aware unicode already works with DBI if the DBDs do the right thing. If you stick to the rule that all data Perl receives must be decoded and all data Perl exports must be encoded it works (ignoring any issues in Perl itself). Background I've brought this up a time or two in the past, but a number of things have happened lately to make me think that it was again time: First, on the DBD::Pg list, we've been having a discussion about improving the DBD::Pg encoding interface. http://www.nntp.perl.org/group/perl.dbd.pg/2011/07/msg603.html That design discussion followed on the extended discussion in this bug report: https://rt.cpan.org/Ticket/Display.html?id=40199 Seems that the pg_enable_utf8 flag that's been in DBD::Pg for a long time is rather broken in a few ways. Notably, PostgreSQL sends *all* data back to clients in a single encoding -- even binary data (which is usually hex-encoded). So it made no sense to only decode certain columns. How to go about fixing it, though, and adding a useful interface, has proven a bit tricky. Then there was Tom Christiansen's StackOverflow comment: stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129 I bow to Tom's experience but I'm still not sure how that applies to DBI so long as the interface between the database and Perl always encodes and decodes then the issues Tom describes are all Perl ones - no? This made me realize that Unicode handling is much trickier than I ever realized. But it also emphasized for me how important it is to do everything on can to do Unicode right. Tom followed up with a *lot* more detail in three OSCON presentations this year, all of which you can read here: http://98.245.80.27/tcpc/OSCON2011/index.html (You're likely gonna want to install the fonts linked at the bottom of that page before you read the presentations in HTML). And finally, I ran into an issue recently with Oracle, where we have an Oracle database that should have only UTF-8 data but some row values are actually in other encodings. This was a problem because I told DBD::Oracle that the encoding was Unicode, and it just blindly turned on the Perl utf8 flag. So I got broken data back from the database and then my app crashed when I tried to act on a string with the utf8 flag on but containing non-unicode bytes. I reported this issue in a DBD::Oracle bug report: https://rt.cpan.org/Public/Bug/Display.html?id=70819 Surely Oracle should return the data encoded as you asked for it and if it did not Oracle is broken. I'd still like to see this case and then we can see if Oracle is broken and if there is a fix for it. In some places DBD::Oracle does sv_utf8_decode(scalar) or SvUTF8_on(scalar) (depending on your Perl) and in some places it just does SvUTF8_on(scalar). I believe the latter is much quicker as the data is not checked. Many people (myself included) are particularly interested in DBD::Oracle being fast and if all the occurrences were changed to decode I'd patch that out in my copy as I know the data I receive is UTF-8 encoded. But all this together leads me to believe that it's time to examine adding explicit Unicode support to the DBI. But it needs to be designed as carefully as possible to account for a few key points: * The API must be as straightforward as possible without sacrificing necessary flexibility. I think it should mostly stay out of users ways and have reasonable defaults. But it should be clear what each knob we offer does and how it affects things. Side-effects should be avoided. * Ability to enforce the correctness of encoding and decoding must be given priority. Perl has pretty specific ideas about is and is not Unicode, so we should respect that as much as possible. If that means encoding and decoding rather than just flipping the utf8 bit, then fine. See above. I'd like the chance to go with speed and take the consequences rather than go with slower but know incorrect UTF-8 is spotted. * The performance impact must be kept as minimal as possible. So if we can get away with just flipping the UTF-8 bit on and off, it should be so. I'm not entirely clear on that, though, since Perl's internal representation, called "utf8", is not the same thing as UTF-8. But if there's an efficient way to convert between the two, then it should be adopted. For other encodings, obviously a full encode/decode path must be followed. I thought UTF-8 when used in Perl used the strict definition and utf-8 used Perl's looser definition - see http://search.cpan.org/~dankogai/Encode-2.44/Encode.pm#UTF-8_vs._utf8_vs._UTF8 * Drivers must be able to adopt the API in a straight-forward way. That is to say, we need to
Re: Add Unicode Support to the DBI
On Sat, 10 Sep 2011 03:06:49 -, "Greg Sabino Mullane" wrote: > One thing I see bandied about a lot is that Perl 5.14 is highly preferred. > However, it's not clear exactly what the gains are and how bad 5.12 is > compared to 5.14, how bad 5.10 is, how bad 5.8 is, etc. Right now 5.8 is > the required minimum for DBI: should we consider bumping this? I know TC > would be horrified to see us attempting to talk about Unicode support > with a 5.8.1 requirement, but how much of that will affect database > drivers? I have no idea myself. Unicode-6.0 and Unicode improvements in general are *THE* reason for me (our company) to plan for a 5.10.1 -> 5.14.2 update I use Unicode a lot, and we require 5.8.4 as an absolute minimum when dealing with Unicode. 5.8.1 is not good enough. > Another aspect to think about that came up during some offline DBD::Pg > talks was the need to support legacy scripts and legacy data. While the > *correct* thing is to blaze forward and use Do Things Correctly everywhere, > I think we at least need some prominent knobs so that we can maintain > backwards compatiblity for existing scripts that expect a bunch of > Latin1, or need the data to come back in the current, undecoded, > un-utf8-flagged way. > > - -- > Greg Sabino Mullane g...@turnstep.com -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using 5.00307 through 5.14 and porting perl5.15.x on HP-UX 10.20, 11.00, 11.11, 11.23 and 11.31, OpenSuSE 10.1, 11.0 .. 11.4 and AIX 5.2 and 5.3. http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
Re: Add Unicode Support to the DBI
Another wrinkle to this is the fact that identifiers in the database, such as column names and such, are also character data, and have an encoding. So for any DBMSs that support Unicode identifiers (as I believe a complete one should, even if they have to be quoted in SQL) or identifiers with trans-ASCII characters, we have to account for those too, making sure that the various Perl-side code correctly matches or doesn't match those identifiers, and so on. -- Darren Duncan
Re: Add Unicode Support to the DBI
-BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 One thing I see bandied about a lot is that Perl 5.14 is highly preferred. However, it's not clear exactly what the gains are and how bad 5.12 is compared to 5.14, how bad 5.10 is, how bad 5.8 is, etc. Right now 5.8 is the required minimum for DBI: should we consider bumping this? I know TC would be horrified to see us attempting to talk about Unicode support with a 5.8.1 requirement, but how much of that will affect database drivers? I have no idea myself. Another aspect to think about that came up during some offline DBD::Pg talks was the need to support legacy scripts and legacy data. While the *correct* thing is to blaze forward and use Do Things Correctly everywhere, I think we at least need some prominent knobs so that we can maintain backwards compatiblity for existing scripts that expect a bunch of Latin1, or need the data to come back in the current, undecoded, un-utf8-flagged way. - -- Greg Sabino Mullane g...@turnstep.com End Point Corporation http://www.endpoint.com/ PGP Key: 0x14964AC8 201109092305 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -BEGIN PGP SIGNATURE- iEYEAREDAAYFAk5q1DkACgkQvJuQZxSWSsj+nQCg9TvEVVrkz8GVvfUlanXJc0X7 vhkAoJEoJFSHXUkcoKI28a7aZH5HUd/s =Wpyl -END PGP SIGNATURE-
Add Unicode Support to the DBI
DBIers, tl;dr: I think it's time to add proper Unicode support to the DBI. What do you think it should look like? Background I've brought this up a time or two in the past, but a number of things have happened lately to make me think that it was again time: First, on the DBD::Pg list, we've been having a discussion about improving the DBD::Pg encoding interface. http://www.nntp.perl.org/group/perl.dbd.pg/2011/07/msg603.html That design discussion followed on the extended discussion in this bug report: https://rt.cpan.org/Ticket/Display.html?id=40199 Seems that the pg_enable_utf8 flag that's been in DBD::Pg for a long time is rather broken in a few ways. Notably, PostgreSQL sends *all* data back to clients in a single encoding -- even binary data (which is usually hex-encoded). So it made no sense to only decode certain columns. How to go about fixing it, though, and adding a useful interface, has proven a bit tricky. Then there was Tom Christiansen's StackOverflow comment: stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129 This made me realize that Unicode handling is much trickier than I ever realized. But it also emphasized for me how important it is to do everything on can to do Unicode right. Tom followed up with a *lot* more detail in three OSCON presentations this year, all of which you can read here: http://98.245.80.27/tcpc/OSCON2011/index.html (You're likely gonna want to install the fonts linked at the bottom of that page before you read the presentations in HTML). And finally, I ran into an issue recently with Oracle, where we have an Oracle database that should have only UTF-8 data but some row values are actually in other encodings. This was a problem because I told DBD::Oracle that the encoding was Unicode, and it just blindly turned on the Perl utf8 flag. So I got broken data back from the database and then my app crashed when I tried to act on a string with the utf8 flag on but containing non-unicode bytes. I reported this issue in a DBD::Oracle bug report: https://rt.cpan.org/Public/Bug/Display.html?id=70819 But all this together leads me to believe that it's time to examine adding explicit Unicode support to the DBI. But it needs to be designed as carefully as possible to account for a few key points: * The API must be as straightforward as possible without sacrificing necessary flexibility. I think it should mostly stay out of users ways and have reasonable defaults. But it should be clear what each knob we offer does and how it affects things. Side-effects should be avoided. * Ability to enforce the correctness of encoding and decoding must be given priority. Perl has pretty specific ideas about is and is not Unicode, so we should respect that as much as possible. If that means encoding and decoding rather than just flipping the utf8 bit, then fine. * The performance impact must be kept as minimal as possible. So if we can get away with just flipping the UTF-8 bit on and off, it should be so. I'm not entirely clear on that, though, since Perl's internal representation, called "utf8", is not the same thing as UTF-8. But if there's an efficient way to convert between the two, then it should be adopted. For other encodings, obviously a full encode/decode path must be followed. * Drivers must be able to adopt the API in a straight-forward way. That is to say, we need to make sure that the interface covers what most (all?) drivers need. Some, like DBD::Pg, can specify that only one encoding come back from the database. Maybe others (DBD::mysql) can have individual columns in different encodings? It needs to cover that case, too. * It must be able to give the drivers some flexibility. Where we can't account for everything that all drivers need forever, we should make it possible for them to add what they need without changing the overall API or the meaning of the interfaces provided by the DBI. I'm not at all clear what such an API should look like. Based on my extensive experience with DBD::Pg, a fair amount of experience with DBD::SQLite, and limited experience with DBD::Oracle and DBD::mysql, I'd say it'd be useful to have at least these knobs: 1. An attribute indicating the database encoding. This is the encoding one expects all data coming from the database to be in. When this is set, the DBI or the driver would decode incoming data to Perl's internal format and encode data sent to the database. 2. A fourth param to bind_param() to indicate the encoding in which to send column data to the database. Defaults to the database encoding. 3. A new parameter to prepare() to indicate the encodings of specific columns to be selected. 4. An ENCODING attribute on statement handles that indicates the encoding of each columns. This is just a preliminary proposal, but covers most of the basics, I think. (I'm sure I'm suggesting the wrong places for some things). It does as