subject:"Add Unicode Support to the DBI"

On Wed, 09 Nov 2011 19:41:33 +, "Martin J. Evans"
 wrote:

> Your going to have a lot of problems with this test code and DBD::Unify 
> as we previously discovered that DBD::Unify does not decode the data 
> coming back from the database itself but it can be decoded by any Perl 
> script using DBD::Unify into the correct data. Any chance you could 
> change the test code to print out the results of type_info_all for 
> DBD::Unify and send me them?

Right, I have this:
--8<---
$type_info_all = [
{   TYPE_NAME  =>  0,
DATA_TYPE  =>  1,
COLUMN_SIZE=>  2,
LITERAL_PREFIX =>  3,
LITERAL_SUFFIX =>  4,
CREATE_PARAMS  =>  5,
NULLABLE   =>  6,
CASE_SENSITIVE =>  7,
SEARCHABLE =>  8,
UNSIGNED_ATTRIBUTE =>  9,
FIXED_PREC_SCALE   => 10,
AUTO_UNIQUE_VALUE  => 11,
LOCAL_TYPE_NAME=> 12,
MINIMUM_SCALE  => 13,
MAXIMUM_SCALE  => 14,
SQL_DATA_TYPE  => 15,
SQL_DATETIME_SUB   => 16,
NUM_PREC_RADIX => 17,
INTERVAL_PRECISION => 18,
},

# TYPE_NAME   DATA_TYPESIZE  PFX   SFX   PARAMS
N C S UNSIG FPS   AUTO  LOCAL MINSC MAXSC SDT   SDS   RADIX PREC
[ "UNKNOWN",  0,   undef,undef,undef,undef,
1,0,3,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef, ],
#   [ "GIANT AMOUNT",  
undef,undef,undef,"PRECISION,SCALE",1,0,3,0,undef,undef,undef,2,2,
undef,undef,undef,undef, ],
[ "HUGE AMOUNT",  -207,
undef,undef,undef,"PRECISION,SCALE",1,0,3,0,undef,undef,undef,2,2,
undef,undef,undef,undef, ],
[ "AMOUNT",   -206,
undef,undef,undef,"PRECISION,SCALE",1,0,3,0,undef,undef,undef,2,2,
undef,undef,undef,undef, ],
[ "VARBINARY",SQL_VARBINARY,   undef,"'", ,"'",  undef,
1,0,3,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef, ],
[ "BINARY",   SQL_BINARY,  undef,undef,undef,undef,
1,0,3,0,undef,undef,undef,0,undef,undef,undef,undef,undef, ],
[ "CHAR", SQL_CHAR,undef,undef,undef,"PRECISION",  
1,0,3,0,undef,undef,undef,0,undef,undef,undef,undef,undef, ],
[ "CURRENCY", -218,
undef,undef,undef,"PRECISION,SCALE",1,0,3,0,2,undef,undef,0,8,
undef,undef,undef,undef, ],
[ "TIMESTAMP",SQL_TIMESTAMP,   undef,undef,undef,undef,
1,0,3,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef, ],
[ "DATE", SQL_DATE,undef,undef,undef,undef,
1,0,3,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef, ],
[ "DECIMAL",  SQL_DECIMAL, undef,undef,undef,"PRECISION",  
1,0,3,0,undef,undef,undef,0,undef,undef,undef,undef,undef, ],
[ "DOUBLE PRECISION", SQL_DOUBLE,  undef,undef,undef,"PRECISION",  
1,0,3,0,undef,undef,undef,0,undef,undef,undef,undef,undef, ],
[ "FLOAT",SQL_FLOAT,   undef,undef,undef,"PRECISION",  
1,0,3,0,undef,undef,undef,0,undef,undef,undef,undef,undef, ],
[ "HUGE INTEGER", SQL_BIGINT,  undef,undef,undef,"PRECISION",  
1,0,3,0,undef,undef,undef,0,0,undef,undef,undef,undef, ],
[ "INTEGER",  SQL_INTEGER, undef,undef,undef,"PRECISION",  
1,0,3,0,undef,undef,undef,0,undef,undef,undef,undef,undef, ],
[ "NUMERIC",  SQL_NUMERIC, undef,undef,undef,"PRECISION",  
1,0,3,0,undef,undef,undef,0,0,undef,undef,undef,undef, ],
[ "REAL", SQL_REAL,undef,undef,undef,"PRECISION",  
1,0,3,0,undef,undef,undef,0,undef,undef,undef,undef,undef, ],
[ "SMALLINT", SQL_SMALLINT,undef,undef,undef,"PRECISION",  
1,0,3,0,undef,undef,undef,0,0,undef,undef,undef,undef, ],
[ "TEXT", SQL_LONGVARCHAR, undef,"'", ,"'",  undef,
1,0,3,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef, ],
[ "TIME", SQL_TIME,undef,undef,undef,undef,
1,0,3,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef, ],
];

my %odbc_types = map { ( $_->[0] => $_->[1], $_->[1] => $_->[0] ) }
[  -5  => "BIGINT"  ], # SQL_BIGINT
[  -3  => "VARBINARY"   ], # SQL_VARBINARY
[  -2  => "BINARY"  ], # SQL_BINARY
[  -1  => "TEXT"], # SQL_LONGVARCHAR
[   0  => "UNKNOWN_TYPE"], # SQL_UNKNOWN_TYPE
[   1  => "CHAR"], # SQL_CHAR
[   2  => "NUMERIC" ], # SQL_NUMERIC
[   3  => "DECIMAL" ], # SQL_DECIMAL
[   4  => "INTEGER" ], # SQL_INTEGER
[   5  => "SMALLINT"], # SQL_SMALLINT
[   6  => "FLOAT"   ], # SQL_FLOAT
[   7  => "REAL"], # SQL_REAL
[   8  => "DOUBLE

Re: Add Unicode Support to the DBI

On Wed, 09 Nov 2011 19:41:33 +, "Martin J. Evans"
 wrote:

tl;dr;

> On 09/11/2011 15:49, H.Merijn Brand wrote:
> > On Tue, 08 Nov 2011 21:12:13 +, "Martin J. Evans"
> >   wrote:
> >
> >> I've just checked in unicode_test.pl to DBI's subversion trunk in /ex dir.
> >>
> >> It won't run right now without changing the do_connect sub as you have
> >> to specify how to connect to the DB.
> >> Also, there is a DBD specific section at the start where you might have
> >> to add a DBD it does not know about (anything other than DBD::ODBC,
> >> DBD::Oracle, DBD::SQLite, DBD::CSV, DBD::mysql) if it needs something
> >> other than the defaults e.g., the name of the length function in SQL,
> >> the column type for unicode columns and binary columns, the setting to
> >> enable UTF8/Unicode support. It could be a bit of a pain if your DBD
> >> does not support type_info_all but I'm around on irc and in this list if
> >> anyone wants any help making it work.
> >>
> >> It needs rather a lot of tidying up so I'm not putting it forward as
> >> code-of-the-year but it is a start.
> >>
> >> BTW, you'll need Test::More::UTF8 and perhaps a couple of other non core
> >> modules to run it.
> >>
> >> Martin
> > I'll have some deeper look at both Unify and CSV ...
> > Attached is a revised version of the script (first argument is the
> > driver to test, some need more work)
> >
> > $ perl /tmp/unicode_test.pl Unify
> > # Driver DBD::Unify-0.78
> > # Using DBMS_NAME 'Unify DataServer'
> > # Using DBMS_VER undef
> > # Using DRIVER_NAME '/pro/asql/v83I/lib/perl/5.10.1/DBD/Unify.pm'
> > # Using DRIVER_VER '00.78.'
> > # SQL_IDENTIFIER_CASE 3
> > # LANGDIR = dutch
> > print() on closed filehandle $fh at /tmp/unicode_test.pl line 438.
> > Use of uninitialized value in concatenation (.) or string at 
> > /tmp/unicode_test.pl line 421.
> > Use of uninitialized value in concatenation (.) or string at 
> > /tmp/unicode_test.pl line 421.
> > # Found type  (HUGE AMOUNT) size=
> > Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422.
> > Use of uninitialized value in concatenation (.) or string at 
> > /tmp/unicode_test.pl line 421.
> > Use of uninitialized value in concatenation (.) or string at 
> > /tmp/unicode_test.pl line 421.
> > # Found type  (AMOUNT) size=
> > Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422.
> > Use of uninitialized value in concatenation (.) or string at 
> > /tmp/unicode_test.pl line 421.
> > Use of uninitialized value in concatenation (.) or string at 
> > /tmp/unicode_test.pl line 421.
> > # Found type  (BINARY) size=
> > Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422.
> > Use of uninitialized value in concatenation (.) or string at 
> > /tmp/unicode_test.pl line 421.
> > :
> > :
> > # Found type  (TIME) size=
> > Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422.
> > DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. [for 
> > Statement "create table fredĀ ( a int)"] at /tmp/unicode_test.pl line 214.
> > not ok 1 - unicode table name supported
> > #   Failed test 'unicode table name supported'
> > #   at /tmp/unicode_test.pl line 216.
> > # died: DBD::Unify::db prepare failed: Syntax error in SQL dynamic 
> > statement. [for Statement "create table fredÄ ( a int)"] at 
> > /tmp/unicode_test.pl line 214.
> > ok 2 # skip Failed to create unicode table name
> > ok 3 # skip Failed to create unicode table name
> > DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. [for 
> > Statement "create table fred ( daveĀ int)"] at /tmp/unicode_test.pl line 
> > 214.
> > not ok 4 - unicode column name supported
>
> Your going to have a lot of problems with this test code and DBD::Unify 
> as we previously discovered that DBD::Unify does not decode the data

I can change that as author and maintainer of this driver (if I want)

> coming back from the database itself but it can be decoded by any Perl 
> script using DBD::Unify into the correct data. Any chance you could 
> change the test code to print out the results of type_info_all for 
> DBD::Unify and send me them?

# Driver DBD::Unify-0.78
# Using DBMS_NAME 'Unify DataServer'
# Using DBMS_VER undef
# Using DRIVER_NAME '/pro/asql/v83I/lib/perl/5.10.1/DBD/Unify.pm'
# Using DRIVER_VER '00.78.'
# SQL_IDENTIFIER_CASE 3
# LANGDIR = dutch
[
{   AUTO_UNIQUE_VALUE => 11,
CASE_SENSITIVE   => 7,
COLUMN_SIZE  => 2,
CREATE_PARAMS=> 5,
DATA_TYPE=> 1,
FIXED_PREC_SCALE => 10,
INTERVAL_PRECISION => 18,
LITERAL_PREFIX   => 3,
LITERAL_SUFFIX   => 4,
LOCAL_TYPE_NAME  => 12,
MAXIMUM_SCALE=> 14,
MINIMUM_SCALE=> 13,
NULLABLE => 6,
NUM_PREC_RADIX   => 17,
SEARCHABLE   => 8,
SQL_DATA_TYPE=> 15,
SQL_DATETIME_SUB => 16,
TYPE_NAME=> 0,
UNSIGNED_ATTRIBUTE => 9
}

Re: Add Unicode Support to the DBI

2011-11-09 Thread Martin J. Evans


On 09/11/2011 15:49, H.Merijn Brand wrote:

On Tue, 08 Nov 2011 21:12:13 +, "Martin J. Evans"
  wrote:


I've just checked in unicode_test.pl to DBI's subversion trunk in /ex dir.

It won't run right now without changing the do_connect sub as you have
to specify how to connect to the DB.
Also, there is a DBD specific section at the start where you might have
to add a DBD it does not know about (anything other than DBD::ODBC,
DBD::Oracle, DBD::SQLite, DBD::CSV, DBD::mysql) if it needs something
other than the defaults e.g., the name of the length function in SQL,
the column type for unicode columns and binary columns, the setting to
enable UTF8/Unicode support. It could be a bit of a pain if your DBD
does not support type_info_all but I'm around on irc and in this list if
anyone wants any help making it work.

It needs rather a lot of tidying up so I'm not putting it forward as
code-of-the-year but it is a start.

BTW, you'll need Test::More::UTF8 and perhaps a couple of other non core
modules to run it.

Martin

I'll have some deeper look at both Unify and CSV ...
Attached is a revised version of the script (first argument is the
driver to test, some need more work)

$ perl /tmp/unicode_test.pl Unify
# Driver DBD::Unify-0.78
# Using DBMS_NAME 'Unify DataServer'
# Using DBMS_VER undef
# Using DRIVER_NAME '/pro/asql/v83I/lib/perl/5.10.1/DBD/Unify.pm'
# Using DRIVER_VER '00.78.'
# SQL_IDENTIFIER_CASE 3
# LANGDIR = dutch
print() on closed filehandle $fh at /tmp/unicode_test.pl line 438.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
# Found type  (HUGE AMOUNT) size=
Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
# Found type  (AMOUNT) size=
Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
# Found type  (BINARY) size=
Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
:
:
# Found type  (TIME) size=
Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422.
DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. [for Statement 
"create table fredĀ ( a int)"] at /tmp/unicode_test.pl line 214.
not ok 1 - unicode table name supported
#   Failed test 'unicode table name supported'
#   at /tmp/unicode_test.pl line 216.
# died: DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. [for 
Statement "create table fredÄ ( a int)"] at /tmp/unicode_test.pl line 214.
ok 2 # skip Failed to create unicode table name
ok 3 # skip Failed to create unicode table name
DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. [for Statement 
"create table fred ( daveĀ int)"] at /tmp/unicode_test.pl line 214.
not ok 4 - unicode column name supported
Your going to have a lot of problems with this test code and DBD::Unify 
as we previously discovered that DBD::Unify does not decode the data 
coming back from the database itself but it can be decoded by any Perl 
script using DBD::Unify into the correct data. Any chance you could 
change the test code to print out the results of type_info_all for 
DBD::Unify and send me them?


I'd like to fix all those Use of unitialized value warnings so I'll look 
into that. The test code has changed so the line numbers are a problem 
so I'll have to compare them with your attachment. I'm guessing they are 
all in the processing of the results from type_info_all. To be honest, 
this is not a unicode/encoding issue but one of my biggest annoyances 
when working with DBDs and that is finding the right column type for a 
table is really hit and miss - many DBDs don't support type_info_all and 
some that do only provide a subset of the columns DBI defines.


Somewhere (I've lost it right now but I'll find it later) I have a 
document which details my experiences of trying to write DBD independent 
code which worked with DBD::mysql, DBD::DB2 and DBD::Oracle. In the end 
we gave up - it was just too difficult. It wasn't just the different SQL 
syntax/support. We ended up putting all the logic into 
procedures/functions and simply calling them from Perl which avoided SQL 
syntax issues and loads of other differences. Perhaps I'll go back to 
that if I find some time.




$ perl /tmp/unicode_test.pl
# Driver DBD::SQLite-1.33
# Using DBMS_NAME 'SQLite'
# Using DBMS_VER '3.7.6.3'
# Using DRIVER_NAME undef
# Using DRIVER_VER undef
# SQL_IDENTIFIER_CASE undef
# LANGDIR = dutch
prin

Re: Add Unicode Support to the DBI

On Wed, 9 Nov 2011 16:23:53 +, Tim Bunce 
wrote:

> On Wed, Nov 09, 2011 at 04:50:29PM +0100, H.Merijn Brand wrote:
> > On Tue, 08 Nov 2011 21:12:13 +, "Martin J. Evans"
> >  wrote:
> > 
> > > I've just checked in unicode_test.pl to DBI's subversion trunk in /ex dir.
> > 
> > So now attached
> 
> Any chance you could rework your changes into the (recently updated)
> version in the DBI svn repo?

Yes, after I verify why DBD::Unify messes up on this

(FWIW, the number of Unify customers is getting smaller and smaller, so
the motivation to improve on DBD::Unify is diminishing)

> Tim.

-- 
H.Merijn Brand  http://tux.nl  Perl Monger  http://amsterdam.pm.org/
using 5.00307 through 5.14 and porting perl5.15.x on HP-UX 10.20, 11.00,
11.11, 11.23 and 11.31, OpenSuSE 10.1, 11.0 .. 11.4 and AIX 5.2 and 5.3.
http://mirrors.develooper.com/hpux/   http://www.test-smoke.org/
http://qa.perl.org  http://www.goldmark.org/jeff/stupid-disclaimers/

Re: Add Unicode Support to the DBI

2011-11-09 Thread Tim Bunce

On Wed, Nov 09, 2011 at 04:50:29PM +0100, H.Merijn Brand wrote:
> On Tue, 08 Nov 2011 21:12:13 +, "Martin J. Evans"
>  wrote:
> 
> > I've just checked in unicode_test.pl to DBI's subversion trunk in /ex dir.
> 
> So now attached

Any chance you could rework your changes into the (recently updated)
version in the DBI svn repo?

Tim.

Re: Add Unicode Support to the DBI

On Tue, 08 Nov 2011 21:12:13 +, "Martin J. Evans"
 wrote:

> I've just checked in unicode_test.pl to DBI's subversion trunk in /ex dir.

So now attached

-- 
H.Merijn Brand  http://tux.nl  Perl Monger  http://amsterdam.pm.org/
using 5.00307 through 5.14 and porting perl5.15.x on HP-UX 10.20, 11.00,
11.11, 11.23 and 11.31, OpenSuSE 10.1, 11.0 .. 11.4 and AIX 5.2 and 5.3.
http://mirrors.develooper.com/hpux/   http://www.test-smoke.org/
http://qa.perl.org  http://www.goldmark.org/jeff/stupid-disclaimers/


unicode_test.pl
Description: Perl program

Re: Add Unicode Support to the DBI

On Tue, 08 Nov 2011 21:12:13 +, "Martin J. Evans"
 wrote:

> I've just checked in unicode_test.pl to DBI's subversion trunk in /ex dir.
> 
> It won't run right now without changing the do_connect sub as you have 
> to specify how to connect to the DB.
> Also, there is a DBD specific section at the start where you might have 
> to add a DBD it does not know about (anything other than DBD::ODBC, 
> DBD::Oracle, DBD::SQLite, DBD::CSV, DBD::mysql) if it needs something 
> other than the defaults e.g., the name of the length function in SQL, 
> the column type for unicode columns and binary columns, the setting to 
> enable UTF8/Unicode support. It could be a bit of a pain if your DBD 
> does not support type_info_all but I'm around on irc and in this list if 
> anyone wants any help making it work.
> 
> It needs rather a lot of tidying up so I'm not putting it forward as 
> code-of-the-year but it is a start.
> 
> BTW, you'll need Test::More::UTF8 and perhaps a couple of other non core 
> modules to run it.
> 
> Martin

I'll have some deeper look at both Unify and CSV ...
Attached is a revised version of the script (first argument is the
driver to test, some need more work)

$ perl /tmp/unicode_test.pl Unify
# Driver DBD::Unify-0.78
# Using DBMS_NAME 'Unify DataServer'
# Using DBMS_VER undef
# Using DRIVER_NAME '/pro/asql/v83I/lib/perl/5.10.1/DBD/Unify.pm'
# Using DRIVER_VER '00.78.'
# SQL_IDENTIFIER_CASE 3
# LANGDIR = dutch
print() on closed filehandle $fh at /tmp/unicode_test.pl line 438.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
# Found type  (HUGE AMOUNT) size=
Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
# Found type  (AMOUNT) size=
Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
# Found type  (BINARY) size=
Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422.
Use of uninitialized value in concatenation (.) or string at 
/tmp/unicode_test.pl line 421.
:
:
# Found type  (TIME) size=
Use of uninitialized value in string eq at /tmp/unicode_test.pl line 422.
DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. [for 
Statement "create table fredĀ ( a int)"] at /tmp/unicode_test.pl line 214.
not ok 1 - unicode table name supported
#   Failed test 'unicode table name supported'
#   at /tmp/unicode_test.pl line 216.
# died: DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. 
[for Statement "create table fredÄ ( a int)"] at /tmp/unicode_test.pl line 214.
ok 2 # skip Failed to create unicode table name
ok 3 # skip Failed to create unicode table name
DBD::Unify::db prepare failed: Syntax error in SQL dynamic statement. [for 
Statement "create table fred ( daveĀ int)"] at /tmp/unicode_test.pl line 214.
not ok 4 - unicode column name supported

$ perl /tmp/unicode_test.pl
# Driver DBD::SQLite-1.33
# Using DBMS_NAME 'SQLite'
# Using DBMS_VER '3.7.6.3'
# Using DRIVER_NAME undef
# Using DRIVER_VER undef
# SQL_IDENTIFIER_CASE undef
# LANGDIR = dutch
print() on closed filehandle $fh at /tmp/unicode_test.pl line 438.
ok 1 - unicode table name supported
ok 2 - unicode table found in unqualified table_info
ok 3 - unicode table found by qualified table_info
ok 4 - unicode column name supported
ok 5 - unicode column found in unqualified column_info
ok 6 - unicode column found by qualified column_info
ok 7 - table for unicode data
ok 8 - insert unicode data into table
ok 9 - unicode data out = unicode data in, no where
ok 10 - length of output data the same
ok 11 - db length of unicode data correct
ok 12 - select unicode data via parameterised where
ok 13 - select unicode data via inline where
ok 14 - table for unicode data
ok 15 - insert unicode data and blob into table
ok 16 - unicode data out = unicode data in, no where with blob
ok 17 - utf8 flag not set on blob data
Use of uninitialized value in string eq at /tmp/unicode_test.pl line 373.
ok 18 - retrieved blob = inserted blob
ok 19 - test table for unicode parameter markers
DBD::SQLite::st bind_param failed: Unknown named parameter: fredâ¬ [for 
Statement "insert into fred (a) values (:fred€)"] at /tmp/unicode_test.pl line 
390.
not ok 20 - bind parameter with unicode parameter marker
#   Failed test 'bind parameter with unicode parameter marker'
#   at /tmp/unicode_test.pl line 392.
# died: DBD::SQLite::st bind_param failed: Unknown named parameter: fredÃ¢ÂÂ¬ 
[for Statement "insert into fred (a) values (:fredâ¬)"] at /tmp/unicode_test.pl 
line 39

Re: Add Unicode Support to the DBI

2011-11-08 Thread Martin J. Evans


On 08/11/2011 17:53, David E. Wheeler wrote:

On Nov 8, 2011, at 5:16 AM, Tim Bunce wrote:


1. Focus initially on categorising the capabilities of the databases.
Specifically separating those that understand character encodings
at one or more of column, table, schema, database level.
Answer the questions:
what "Unicode support" is this database capable of? [vague]
are particular column data types or attributes needed?
does the db have a session/connection encoding concept?
does the db support binary data types.
does the client api identify data encoding?
A table summarizing this kind of info would be of great value.
I think this is the most important kind of data we need to move
forward with this topic.  I suspect we'll end up with a few clear
levels of "unicode support" by databases that we can then focus on
more clearly.

+1. Yes, this should make things pretty clear.


2. Try to make a data-driven common test script.
It should fetch the length of the stored value, something like:
CREATE TABLE t (c VARCHAR(10));
INSERT INTO t VALUES (?)<=  $sth->execute("\x{263A}") # simley
SELECT LENGTH(c), c FROM t
Fetching the LENGTH is important because it tells us if the DB is
treating the value as Unicode.  The description of DBD::Unify, for
example, doesn't clarify if the db itself regards the stored value
as unicode or the underlying string of encoded bytes.
Also probably best to avoid latin characters for this, I'd use
something that always has a multi-byte encoding, like a simley face char.

And something that doesn't have a variant that uses combining characters, so 
that the length should be consistent if it's treated as Unicode.


3. Focus on placeholders initially.
We can ponder utf8 in literal SQL later. That's a separate ball of mud.
(I'd also ignore unicode table/column/db names. It's a much lower
priority and may become clearer when other issues get resolved.)

+1, though good to know about. Just as important as placeholders, however, is 
fetching data.


4. Tests could report local LANG / LC_ALL env var value
so when others report their results we'll have that context.

Thanks again. I've only given it a quick skim. I'll read it again before LPW.

Meanwhile, it would be great if people could contribute the info for #1.

Tim.

p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
would make the tests shorter.
my $sample_string = "\x{263A}";
...
print data_diff($sample_string, $returned_string);

Can this be turned into a complete script we can all just run?

Thanks,

David




I've just checked in unicode_test.pl to DBI's subversion trunk in /ex dir.

It won't run right now without changing the do_connect sub as you have 
to specify how to connect to the DB.
Also, there is a DBD specific section at the start where you might have 
to add a DBD it does not know about (anything other than DBD::ODBC, 
DBD::Oracle, DBD::SQLite, DBD::CSV, DBD::mysql) if it needs something 
other than the defaults e.g., the name of the length function in SQL, 
the column type for unicode columns and binary columns, the setting to 
enable UTF8/Unicode support. It could be a bit of a pain if your DBD 
does not support type_info_all but I'm around on irc and in this list if 
anyone wants any help making it work.


It needs rather a lot of tidying up so I'm not putting it forward as 
code-of-the-year but it is a start.


BTW, you'll need Test::More::UTF8 and perhaps a couple of other non core 
modules to run it.


Martin

Re: Add Unicode Support to the DBI

2011-11-08 Thread Martin J. Evans


On 08/11/2011 13:16, Tim Bunce wrote:

On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote:

I didn't think I was going to make LPW but it seems I will now - although it 
has cost me big time leaving it until the last minute.

All your beers at LPW are on me!


http://www.martin-evans.me.uk/node/121

Great work Martin. Many thanks.

I've some comments and suggestions for you...

It says "There is no single way across DBDs to enable Unicode support"
but doesn't define what "Unicode support" actually means. Clearly the
"Unicode support" of Oracle will be different to that of a CSV file.
So it seems that we need to be really clear about what we want.

I'd suggest...

1. Focus initially on categorising the capabilities of the databases.
 Specifically separating those that understand character encodings
 at one or more of column, table, schema, database level.
 Answer the questions:
 what "Unicode support" is this database capable of? [vague]
 are particular column data types or attributes needed?
 does the db have a session/connection encoding concept?
 does the db support binary data types.
 does the client api identify data encoding?
 A table summarizing this kind of info would be of great value.
 I think this is the most important kind of data we need to move
 forward with this topic.  I suspect we'll end up with a few clear
 levels of "unicode support" by databases that we can then focus on
 more clearly.
this will take some time but I'll work on it. It could be a nightmare 
for DBD::ODBC as it supports dozens of dbs so I'll have to be a bit 
cagey about the stuff like "what "Unicode support" is this database 
capable of".


Contributions welcome.


2. Try to make a data-driven common test script.
 It should fetch the length of the stored value, something like:
 CREATE TABLE t (c VARCHAR(10));
 INSERT INTO t VALUES (?)<=  $sth->execute("\x{263A}") # simley
 SELECT LENGTH(c), c FROM t
 Fetching the LENGTH is important because it tells us if the DB is
 treating the value as Unicode.  The description of DBD::Unify, for
 example, doesn't clarify if the db itself regards the stored value
 as unicode or the underlying string of encoded bytes.


Changed to add length(c) test - sorry misread that first time as Perl 
code length(c) not SQL - I'm not well enough to think so clearly right now.


Will take some time to run through the DBDs to check whether it is 
length()/len() or something else - isn't SQL great
e.g., DBD::CSV (SQL::Statement) is char_length and most other SQLs are 
len or length.


BTW, DBD::mysql fails the length test using the length function 
char_length (documented as Return number of characters in argument) - it 
returns 3 for the smiley not 1, not investigated why yet.



 Also probably best to avoid latin characters for this, I'd use
 something that always has a multi-byte encoding, like a simley face char.
changed to use smiley face for data but not for table/column name checks 
as most dbs have other rules on table/column names.



3. Focus on placeholders initially.
 We can ponder utf8 in literal SQL later. That's a separate ball of mud.
 (I'd also ignore unicode table/column/db names. It's a much lower
 priority and may become clearer when other issues get resolved.)
test already does that so left in - it is easy to comment it out - one 
line change.

4. Tests could report local LANG / LC_ALL env var value
 so when others report their results we'll have that context.
added although I stopped short of outputting everything in %ENV as I 
guessed people would end up having to edit it to remove stuff - it does 
LANG, LC_* and NLS_* right now.



Thanks again. I've only given it a quick skim. I'll read it again before LPW.

Meanwhile, it would be great if people could contribute the info for #1.

Tim.

p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
would make the tests shorter.
 my $sample_string = "\x{263A}";
 ...
 print data_diff($sample_string, $returned_string);

added or data_diff() to tests which fail -
is($x, $y, "test") or data_diff($x,$y)

Test::More::is output is horrible for unicode on a non UTF-8 enabled 
terminal.


BTW, the smiley character causes me problems as for some reason with MS 
SQL Server you can do:


insert into table values(?)
bind_param("\{many_unicode_chars}")
select column from table where column = $h->quote("\{many_unicode_chars}")

but for some reason the smiley does not work in the select SQL even 
though the insert works :-( A parameterized select works fine too.


I will upload the current script to DBI's subversion tree in ex dir but 
if anyone is going to do anything major with it I'd rather they mail say 
dbi-dev or me first and I can at least check that before making any 
changes myself (at least until after LPW). Having said that as the 
birthday boy tomorrow I can al

Re: Add Unicode Support to the DBI

2011-11-08 Thread David E. Wheeler

On Nov 8, 2011, at 5:16 AM, Tim Bunce wrote:

> 1. Focus initially on categorising the capabilities of the databases.
>Specifically separating those that understand character encodings
>at one or more of column, table, schema, database level.
>Answer the questions:
>what "Unicode support" is this database capable of? [vague]
>are particular column data types or attributes needed?
>does the db have a session/connection encoding concept?
>does the db support binary data types.
>does the client api identify data encoding?
>A table summarizing this kind of info would be of great value.
>I think this is the most important kind of data we need to move
>forward with this topic.  I suspect we'll end up with a few clear
>levels of "unicode support" by databases that we can then focus on
>more clearly.

+1. Yes, this should make things pretty clear.

> 2. Try to make a data-driven common test script.
>It should fetch the length of the stored value, something like:
>CREATE TABLE t (c VARCHAR(10));
>INSERT INTO t VALUES (?)   <=  $sth->execute("\x{263A}") # simley
>SELECT LENGTH(c), c FROM t
>Fetching the LENGTH is important because it tells us if the DB is
>treating the value as Unicode.  The description of DBD::Unify, for
>example, doesn't clarify if the db itself regards the stored value
>as unicode or the underlying string of encoded bytes.
>Also probably best to avoid latin characters for this, I'd use
>something that always has a multi-byte encoding, like a simley face char.

And something that doesn't have a variant that uses combining characters, so 
that the length should be consistent if it's treated as Unicode.

> 3. Focus on placeholders initially.
>We can ponder utf8 in literal SQL later. That's a separate ball of mud.
>(I'd also ignore unicode table/column/db names. It's a much lower
>priority and may become clearer when other issues get resolved.)

+1, though good to know about. Just as important as placeholders, however, is 
fetching data.

> 4. Tests could report local LANG / LC_ALL env var value
>so when others report their results we'll have that context.
> 
> Thanks again. I've only given it a quick skim. I'll read it again before LPW.
> 
> Meanwhile, it would be great if people could contribute the info for #1.
> 
> Tim.
> 
> p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
> would make the tests shorter.
>my $sample_string = "\x{263A}";
>...
>print data_diff($sample_string, $returned_string);

Can this be turned into a complete script we can all just run?

Thanks,

David

Re: Add Unicode Support to the DBI

2011-11-08 Thread Tim Bunce

On Tue, Nov 08, 2011 at 02:45:39PM +, Martin J. Evans wrote:
> On 08/11/11 13:16, Tim Bunce wrote:
> >On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote:
> 
> >2. Try to make a data-driven common test script.
> 
> There is already one attached to the bottom of the post and referred to in 
> the post - probably was not very clear.

Ah, it's in the middle, in the "DBDs" section.
Could you put it in github or the DBI repo?
I may find some time to hack on it.

> > It should fetch the length of the stored value, something like:
> > CREATE TABLE t (c VARCHAR(10));
> > INSERT INTO t VALUES (?)<=  $sth->execute("\x{263A}") # simley
> > SELECT LENGTH(c), c FROM t
> 
> It does that with a euro \x{20ac}
> 
> > Fetching the LENGTH is important because it tells us if the DB is
> > treating the value as Unicode.
> 
> It doesn't do that but it checks what went in \x{20ac} is what comes
> out which is the same as checking the length since what goes in is a
> euro character. For a euro to come back out at the minimum the DBD
> would have to decode the data from the database.

It's not only important that what goes in comes back out again, but also
that what goes in is interpreted by the database in the right way.
Otherwise there are lots of more subtle issues like sorting bugs caused
by the database 'seeing' encoded bytes instead of unicode characters.
(Hence my point about the description for DBD::Unify being incomplete.)

It's _possible_ for a db to store a euro as a single byte. So it's
possible for the test to yield a false positive.  I prefer using a
simley because it's not possible to store it in a single byte.
So if LENGTH(c) returns 1 we can be very confident that the db is
interpreting the data correctly.

> >Thanks again. I've only given it a quick skim. I'll read it again before LPW.
> 
> I will try and update it before then but a) it is my birthday tomorrow and b) 
> I'm a bit under the weather at the moment.

HAPPY BIRTHDAY!
Get well soon!

> >Meanwhile, it would be great if people could contribute the info for #1.
> 
> I am happy to collect any such info and write it up so please at least cc me.

> >p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
> >would make the tests shorter.
> > my $sample_string = "\x{263A}";
> > ...
> > print data_diff($sample_string, $returned_string);
> 
> Yes, the output is a bit screwed because I was running the test code
> on Windows in a dos terminal and I've never got that working properly.

Ouch.

> Data::Dumper produces better output but I forgot data_diff - I will
> change where relevant.
> 
> BTW, the example code in each DBD was not really the test, it was just
> an example. My aim was to produce one script which ran to any DBD and
> that is attached to the end of the blog post. It's just it was too
> long to incorporate into a blog posting whereas the short simple
> examples were better.

Yeap. Good idea. I'd simply missed the link when I skimmed it.

Thanks again for driving this forward Martin.

Tim.

Re: Add Unicode Support to the DBI

2011-11-08 Thread Martin J. Evans


On 08/11/11 13:16, Tim Bunce wrote:

On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote:


I didn't think I was going to make LPW but it seems I will now - although it 
has cost me big time leaving it until the last minute.


All your beers at LPW are on me!


http://www.martin-evans.me.uk/node/121


Great work Martin. Many thanks.

I've some comments and suggestions for you...

It says "There is no single way across DBDs to enable Unicode support"
but doesn't define what "Unicode support" actually means. Clearly the
"Unicode support" of Oracle will be different to that of a CSV file.


Point taken.


So it seems that we need to be really clear about what we want.

I'd suggest...

1. Focus initially on categorising the capabilities of the databases.
 Specifically separating those that understand character encodings
 at one or more of column, table, schema, database level.
 Answer the questions:
 what "Unicode support" is this database capable of? [vague]
 are particular column data types or attributes needed?
 does the db have a session/connection encoding concept?
 does the db support binary data types.
 does the client api identify data encoding?
 A table summarizing this kind of info would be of great value.
 I think this is the most important kind of data we need to move
 forward with this topic.  I suspect we'll end up with a few clear
 levels of "unicode support" by databases that we can then focus on
 more clearly.




2. Try to make a data-driven common test script.


There is already one attached to the bottom of the post and referred to in the 
post - probably was not very clear.


 It should fetch the length of the stored value, something like:
 CREATE TABLE t (c VARCHAR(10));
 INSERT INTO t VALUES (?)<=  $sth->execute("\x{263A}") # simley
 SELECT LENGTH(c), c FROM t


It does that with a euro \x{20ac}


 Fetching the LENGTH is important because it tells us if the DB is
 treating the value as Unicode.


It doesn't do that but it checks what went in \x{20ac} is what comes out which 
is the same as checking the length since what goes in is a euro character. For 
a euro to come back out at the minimum the DBD would have to decode the data 
from the database.

It also does this with table and column names and table_info/column_info.


 The description of DBD::Unify, for
 example, doesn't clarify if the db itself regards the stored value
 as unicode or the underlying string of encoded bytes.
 Also probably best to avoid latin characters for this, I'd use
 something that always has a multi-byte encoding, like a simley face char.


I mostly do that but had problems creating tables and columns with a euro in 
the name (which I can sort of understand as I can believe there are more rules 
to valid table/column names like they must be alphabetic but could still be 
unicode). The code to create unicode table/column names uses \x{0100} which is

fredĀ (LATIN CAPITAL LETTER A WITH MACRON)

which I think is ok.


3. Focus on placeholders initially.
 We can ponder utf8 in literal SQL later. That's a separate ball of mud.
 (I'd also ignore unicode table/column/db names. It's a much lower
 priority and may become clearer when other issues get resolved.)


Yeah, I only added that because I knew in DBD::ODBC it did not work because the 
interface is char * and not Perl scalar.
 

4. Tests could report local LANG / LC_ALL env var value
 so when others report their results we'll have that context.


Yes, forgot that - will add it to the test.
 

Thanks again. I've only given it a quick skim. I'll read it again before LPW.


I will try and update it before then but a) it is my birthday tomorrow and b) 
I'm a bit under the weather at the moment.

 

Meanwhile, it would be great if people could contribute the info for #1.


I am happy to collect any such info and write it up so please at least cc me.
 

Tim.

p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
would make the tests shorter.
 my $sample_string = "\x{263A}";
 ...
 print data_diff($sample_string, $returned_string);


Yes, the output is a bit screwed because I was running the test code on Windows 
in a dos terminal and I've never got that working properly. Data::Dumper 
produces better output but I forgot data_diff - I will change where relevant.

BTW, the example code in each DBD was not really the test, it was just an 
example. My aim was to produce one script which ran to any DBD and that is 
attached to the end of the blog post. It's just it was too long to incorporate 
into a blog posting whereas the short simple examples were better. 
Unfortunately the differences between DBDs (and some limitations and bugs) is 
making the script quite complex in places e.g., one which caught me out was you 
have to set LongReadLen in DBD::Oracle before calling prepare but all other

Re: Add Unicode Support to the DBI

2011-11-08 Thread H.Merijn Brand

On Tue, 8 Nov 2011 13:16:17 +, Tim Bunce 
wrote:

> On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote:
> > >
> > > I didn't think I was going to make LPW but it seems I will now -
> > > although it has cost me big time leaving it until the last minute.  
> 
> All your beers at LPW are on me!
> 
> > http://www.martin-evans.me.uk/node/121  
> 
> Great work Martin. Many thanks.

I more and more regret I cannot be there.
Please, beside doing useful stuff, have FUN!

-- 
H.Merijn Brand  http://tux.nl  Perl Monger  http://amsterdam.pm.org/
using 5.00307 through 5.14 and porting perl5.15.x on HP-UX 10.20, 11.00,
11.11, 11.23 and 11.31, OpenSuSE 10.1, 11.0 .. 11.4 and AIX 5.2 and 5.3.
http://mirrors.develooper.com/hpux/   http://www.test-smoke.org/
http://qa.perl.org  http://www.goldmark.org/jeff/stupid-disclaimers/

Re: Add Unicode Support to the DBI

2011-11-08 Thread Tim Bunce

On Mon, Nov 07, 2011 at 01:37:38PM +, Martin J. Evans wrote:
> >
> >I didn't think I was going to make LPW but it seems I will now - although it 
> >has cost me big time leaving it until the last minute.

All your beers at LPW are on me!

> http://www.martin-evans.me.uk/node/121

Great work Martin. Many thanks.

I've some comments and suggestions for you...

It says "There is no single way across DBDs to enable Unicode support"
but doesn't define what "Unicode support" actually means. Clearly the
"Unicode support" of Oracle will be different to that of a CSV file.
So it seems that we need to be really clear about what we want.

I'd suggest...

1. Focus initially on categorising the capabilities of the databases.
Specifically separating those that understand character encodings
at one or more of column, table, schema, database level.
Answer the questions:
what "Unicode support" is this database capable of? [vague]
are particular column data types or attributes needed?
does the db have a session/connection encoding concept?
does the db support binary data types.
does the client api identify data encoding?
A table summarizing this kind of info would be of great value.
I think this is the most important kind of data we need to move
forward with this topic.  I suspect we'll end up with a few clear
levels of "unicode support" by databases that we can then focus on
more clearly.

2. Try to make a data-driven common test script.
It should fetch the length of the stored value, something like:
CREATE TABLE t (c VARCHAR(10));
INSERT INTO t VALUES (?)   <=  $sth->execute("\x{263A}") # simley
SELECT LENGTH(c), c FROM t
Fetching the LENGTH is important because it tells us if the DB is
treating the value as Unicode.  The description of DBD::Unify, for
example, doesn't clarify if the db itself regards the stored value
as unicode or the underlying string of encoded bytes.
Also probably best to avoid latin characters for this, I'd use
something that always has a multi-byte encoding, like a simley face char.

3. Focus on placeholders initially.
We can ponder utf8 in literal SQL later. That's a separate ball of mud.
(I'd also ignore unicode table/column/db names. It's a much lower
priority and may become clearer when other issues get resolved.)

4. Tests could report local LANG / LC_ALL env var value
so when others report their results we'll have that context.

Thanks again. I've only given it a quick skim. I'll read it again before LPW.

Meanwhile, it would be great if people could contribute the info for #1.

Tim.

p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
would make the tests shorter.
my $sample_string = "\x{263A}";
...
print data_diff($sample_string, $returned_string);

Re: Add Unicode Support to the DBI

2011-11-07 Thread Martin J. Evans


On 04/11/11 08:39, Martin J. Evans wrote:

On 03/11/11 23:25, David E. Wheeler wrote:

On Oct 7, 2011, at 5:06 PM, David E. Wheeler wrote:


Perhaps we could carve out some time at LPW to sit together and try to
progress this.


That would be awesome you guys!


So gents, do you plan to do this a bit? Martin, do you have the data you wanted 
to collect on this?

Thanks,

David



Sorry David, I've been snowed under. I will try very hard to publish the 
research I found this weekend.

I didn't think I was going to make LPW but it seems I will now - although it 
has cost me big time leaving it until the last minute.

Martin


http://www.martin-evans.me.uk/node/121

Martin
--
Martin J. Evans
Easysoft Limited
http://www.easysoft.com

Re: Add Unicode Support to the DBI

2011-11-06 Thread Martin J. Evans


On 05/10/2011 00:06, Jonathan Leffler wrote:

On Tue, Oct 4, 2011 at 15:24, Martin J. Evanswrote:


On 04/10/2011 22:38, Tim Bunce wrote:


I've not had time to devote to this thread. Sorry.

I'd be grateful if someone could post a summary of it if/when it
approaches some kind of consensus.

  I don't think there is a "kind of consensus" right now (although some

useful discussion which probably will bear fruit) and I'd prefer to work out
what unicode support already exists and how it is implemented first. For
instance, Pg is very focussed on UTF-8 (as are most DBDs) and yet ODBC uses
UCS2 under the hood and CSV can use anything you like. Greg/David/Postgres
seem to have an immediate problem with unicode support in Postgres and I can
imagine they are keen to resolve it and I'd suggest they do it now in the
most appropriate way for DBD::Pg. I don't see why this should necessarily
impact on any discussion as to what DBI should_do/should_say as already the
DBDs which support unicode mostly do it in different ways.

I've started gathering together details of what unicode support there is in
DBDs, how it is implemented and what special flags there are to support it.
However, this is a massive task. So far I've done ODBC, Oracle, CSV, Unify,
mysql, SQLite, Firebird and sort of held off on Pg as I knew Greg was
working on it. Some might disagree but DB2 is a main one I no longer have
access to (please contact me if you use DBD::DB2 and are prepared to spare
half an hour or so to modify examples I have which verify unicode support).
Of course, if you use another DBD and can send me info on unicode support
I'd love to hear from you.

I thought the whole issue was an interesting topic and I had toyed with
doing a talk for LPW but to be honest, it is already taking a lot of time
and I have personal issues right now (and of course my $work) which mean my
time is severely limited so I'm doubtful right now if I could have it ready
in time as a talk. I might just post what I have gathered in a weeks time in
the hope I get a little more input in the mean time.



DBD::Informix has had a couple of UTF-8 patches sent and one has been
applied to the code.  The other arrived this morning and has to be
compared.  The attribute names chosen are different, but both contain 'ix_'
as a prefix and UTF-8 in some form or another.
I couldn't find a respository for Informix so I couldn't see what these 
changes were.

I've added your comments to my document - to be released tonight hopefully.


What I'm not sure about is how to test the code.  Creating an Informix
database that has UTF-8 data in it is trivial (well, nearly trivial).  The
difficulty is demonstrating where there were problems before and that the
problems are gone after.




If anyone has suggestions for how to show that UTF8 is working properly - in
the form of a fairly simple test case - I'd be very grateful to receive it
as guidance.

Attached to this post is some code I'm using however:

o it tries to use type_info_all to find relevant column types and so far 
DBD::SQLite is the only one not to support type_info_all. If you've got 
type_info_all support try and make it work. You'll need to look for the 
find_type calls (2 of them) and adjust to add informix SQL types. 
Alternatively, if you have not got type_info_all support you'll need to 
replicate what I did for DBD::SQLite - see start of script.


The code attempts to work around the problem that some DBDs do not 
report all possible columns back from type_info_all.


o few DBDs seem to support column_info_all - if you have not got it - 
that test is skipped


o you'll need an in.png PNG image which should preferably be smaller 
than the size of your blob column type. It does not matter what it is - 
any old small png will do.


o no drivers so far support unicode parameter markers - what a surprise

o some drivers have issues with blobs and unicode data in the same table 
- DBD::CSV notably


o some drivers cannot create unicode table names or tables with unicode 
column names e.g., DBD::mysql.


o no driver so far can do a table_info with a specified unicode (and 
existing) table name and return the table info. I suspect this is 
because the table_info and column_info calls are in XS using char * 
instead of Perl scalars.


If you make this work for Informix please send it back to me.

Same applies to any other DBD maintainers.

So, DBD::Informix is endeavouring to move forward, knowing that the
underlying database layers (ESQL/C on the client, and the Informix data
server) handle UTF-8 OK, so any problems are in persuading Perl (and perhaps
DBI) to handle it appropriately too.

[...Did I hear a chorus of "nice theory - shame about the practice"?...]


Martin

#
# Test unicode in a DBD - written for DBD::ODBC but should work for other
# DBDs if you change the column types at the start of this script.
# To run properly it needs an in.png PNG image file in the local working
# directory but it does not matter what

Re: Add Unicode Support to the DBI

2011-11-04 Thread David E. Wheeler

On Nov 4, 2011, at 10:33 AM, Martin J. Evans wrote:

>> Did you ever get any data from DBD::SQLite folks?
> 
> Yes. I found a bug in the process and it was fixed but I have a working 
> SQLite example.

Oh, great.

> I'm only really missing DB2 but I have contacts for that on #dbix-class who 
> I've just not yet poked.

Cool, thanks for the update.

Best,

David

Re: Add Unicode Support to the DBI

2011-11-04 Thread Martin J. Evans


On 04/11/11 16:39, David E. Wheeler wrote:

On Nov 4, 2011, at 1:39 AM, Martin J. Evans wrote:


Sorry David, I've been snowed under. I will try very hard to publish the 
research I found this weekend.


Awesome, thanks.

Did you ever get any data from DBD::SQLite folks?


Yes. I found a bug in the process and it was fixed but I have a working SQLite 
example.


I didn't think I was going to make LPW but it seems I will now - although it 
has cost me big time leaving it until the last minute.


Your sacrifice is greatly appreciated. :-)

Best,

David


I'm only really missing DB2 but I have contacts for that on #dbix-class who 
I've just not yet poked.

Martin
--
Martin J. Evans
Easysoft Limited
http://www.easysoft.com

Re: Add Unicode Support to the DBI

2011-11-04 Thread David E. Wheeler

On Nov 4, 2011, at 1:39 AM, Martin J. Evans wrote:

> Sorry David, I've been snowed under. I will try very hard to publish the 
> research I found this weekend.

Awesome, thanks.

Did you ever get any data from DBD::SQLite folks?

> I didn't think I was going to make LPW but it seems I will now - although it 
> has cost me big time leaving it until the last minute.

Your sacrifice is greatly appreciated. :-)

Best,

David

Re: Add Unicode Support to the DBI

2011-11-04 Thread Martin J. Evans


On 03/11/11 23:25, David E. Wheeler wrote:

On Oct 7, 2011, at 5:06 PM, David E. Wheeler wrote:


Perhaps we could carve out some time at LPW to sit together and try to
progress this.


That would be awesome you guys!


So gents, do you plan to do this a bit? Martin, do you have the data you wanted 
to collect on this?

Thanks,

David



Sorry David, I've been snowed under. I will try very hard to publish the 
research I found this weekend.

I didn't think I was going to make LPW but it seems I will now - although it 
has cost me big time leaving it until the last minute.

Martin
--
Martin J. Evans
Easysoft Limited
http://www.easysoft.com

Re: Add Unicode Support to the DBI

2011-11-03 Thread David E. Wheeler

On Oct 7, 2011, at 5:06 PM, David E. Wheeler wrote:

>> Perhaps we could carve out some time at LPW to sit together and try to
>> progress this.
> 
> That would be awesome you guys!

So gents, do you plan to do this a bit? Martin, do you have the data you wanted 
to collect on this?

Thanks,

David

Re: Add Unicode Support to the DBI

2011-10-13 Thread David E. Wheeler

On Oct 13, 2011, at 6:03 AM, Greg Sabino Mullane wrote:

>> I think what I haven't said is that we should just use the same 
>> names that Perl I/O uses. Er, well, for the :raw and :utf8 
>> varieties I was, anyway. Perhaps we should adopt it wholesale, 
>> so you'd use ":encoding(UTF-8)" instead of "UTF-8".
> 
> That's pretty ugly. I don't think we need to adopt the I/O 
> convention, as there is no direct mapping anyway, it just 
> confuses the issue.

Sure. In that case, I'd say ":utf8", ":raw", or "$encoding".

>> For DBD::Pg, at least, if client-encoding is set to Big5, then 
>> you *have* to encode to send it to the database. Or change the 
>> client encoding, of course.
> 
> Not sure I'm following this completely. Or rather, why this should 
> be the DBDs role.

By default, yes, the DBD should DTRT here. But I think there also ought to be a 
way to tell it what to do.

>>> How would one map things - just demand that 
>>> whatever is given must be a literal encoding the particular database 
>>> can understand?
>> 
>> I think we should standardize on the Perl IO names for these things. 
>> Some databases may not support them all, of course.
> 
> Hm... I don't know enough about the various DB's encodings to see 
> how good an idea that is.

I assume that it's all over the map, so we should be as general as we can. 
Specifying an encoding by name should cover everything.

>> No, I think just "encoding", and "utf8" would be invalid, 
>> but ":encoding(UTF-8)" would not.
> 
> Again, ugh. Although a *little* less confusing when contrasting:
> 
> $dbh->{encoding} = ':encoding(utf-8)';
> 
> $dbh->{encoding} = 'utf8';

Yeah, or we can go with my original suggestion:

$dbh->{encoding} =  'UTF-8';
$dbh->{encoding} =  ':utf8';

>> Well, I think we might have to have it with the pg_prefix until 
>> this stuff is finalized here. Not sure, though.
> 
> That's my point - if we can get it finalized here, we can avoif the 
> pg_prefix entirely, rather than add it now and then deprecate it later.

Sure. I suspect this is going to take a while, though.

Best,

David

Re: Add Unicode Support to the DBI

2011-10-13 Thread Greg Sabino Mullane


-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160


David E. Wheeler wrote:
> I think what I haven't said is that we should just use the same 
> names that Perl I/O uses. Er, well, for the :raw and :utf8 
> varieties I was, anyway. Perhaps we should adopt it wholesale, 
> so you'd use ":encoding(UTF-8)" instead of "UTF-8".

That's pretty ugly. I don't think we need to adopt the I/O 
convention, as there is no direct mapping anyway, it just 
confuses the issue.

> For DBD::Pg, at least, if client-encoding is set to Big5, then 
> you *have* to encode to send it to the database. Or change the 
> client encoding, of course.

Not sure I'm following this completely. Or rather, why this should 
be the DBDs role.

>> How would one map things - just demand that 
>> whatever is given must be a literal encoding the particular database 
>> can understand?
>
> I think we should standardize on the Perl IO names for these things. 
> Some databases may not support them all, of course.

Hm... I don't know enough about the various DB's encodings to see 
how good an idea that is.

>> So the above means these two actually behave very differently:
>> 
>> $dbh->{encoding} = ':utf8';
>> 
>> $dbh->{encoding} = 'utf8';
>> 
>> Could be a little confusing, no? Methinks we some long ugly name, maybe 
>> even worse than "perl_native". Perhaps "perl_internal_utf8_flag"? 1/2 :)

> No, I think just "encoding", and "utf8" would be invalid, 
> but ":encoding(UTF-8)" would not.

Again, ugh. Although a *little* less confusing when contrasting:

$dbh->{encoding} = ':encoding(utf-8)';

$dbh->{encoding} = 'utf8';

> Well, I think we might have to have it with the pg_prefix until 
> this stuff is finalized here. Not sure, though.

That's my point - if we can get it finalized here, we can avoif the 
pg_prefix entirely, rather than add it now and then deprecate it later.

- -- 
Greg Sabino Mullane g...@turnstep.com
PGP Key: 0x14964AC8 201110130902
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-BEGIN PGP SIGNATURE-

iEYEAREDAAYFAk6W4ZQACgkQvJuQZxSWSsiqUQCgo/icUz0enqn0BWSygNSeNJGW
lDsAoMbjgZrsGJyS7kS60RgNNkpXMIjG
=43Q3
-END PGP SIGNATURE-

Re: Add Unicode Support to the DBI

2011-10-07 Thread David E. Wheeler

On Oct 7, 2011, at 1:47 AM, Tim Bunce wrote:

> Perhaps we could carve out some time at LPW to sit together and try to
> progress this.

That would be awesome you guys!

D

Re: Add Unicode Support to the DBI

2011-10-07 Thread Tim Bunce

On Tue, Oct 04, 2011 at 11:24:51PM +0100, Martin J. Evans wrote:
> On 04/10/2011 22:38, Tim Bunce wrote:
> >I've not had time to devote to this thread. Sorry.
> >
> >I'd be grateful if someone could post a summary of it if/when it
> >approaches some kind of consensus.

> I don't think there is a "kind of consensus" right now (although
> some useful discussion which probably will bear fruit) and I'd
> prefer to work out what unicode support already exists and how it is
> implemented first. [...]
> 
> I've started gathering together details of what unicode support
> there is in DBDs, how it is implemented and what special flags there
> are to support it. However, this is a massive task. So far I've done
> ODBC, Oracle, CSV, Unify, mysql, SQLite, Firebird

That seems like enough to make a start on design issues.
There's certainly a wide range of databases in that list :)

> examples I have which verify unicode support

Are the examples in a repo somewhere?

> I thought the whole issue was an interesting topic and I had toyed
> with doing a talk for LPW but to be honest, it is already taking a
> lot of time and I have personal issues right now (and of course my
> $work) which mean my time is severely limited so I'm doubtful right
> now if I could have it ready in time as a talk. I might just post
> what I have gathered in a weeks time in the hope I get a little more
> input in the mean time.

Good idea. I'd like to see the examples sometime to see what scope they
have and if/how the might be improved.

Perhaps we could carve out some time at LPW to sit together and try to
progress this.

Tim.

Re: Add Unicode Support to the DBI

2011-10-06 Thread David E. Wheeler

On Oct 6, 2011, at 8:56 AM, Greg Sabino Mullane wrote:

>> I still prefer an encoding attribute that you can set as follows:
> 
>> * undef: Default; same as your A.
>> * ':utf8': Same as your B:
>> * ':raw': Same as your C
>> * $encoding: Encode/decode to/from $encoding
> 
> I like that. Although the names are still odd.

I think what I haven't said is that we should just use the same names that Perl 
I/O uses. Er, well, for the :raw and :utf8 varieties I was, anyway. Perhaps we 
should adopt it wholesale, so you'd use ":encoding(UTF-8)" instead of "UTF-8".

Some details on the PerlIO names is here:

  https://metacpan.org/module/open

I'm sure there's a more detailed description somewhere, though I couldn't find 
it.

> I guess it does map 
> though: raw means no utf8 flag.

Right. And the default would be either :raw or :utf8, and should probably be 
left to the DBD (depending on the importance of backward compatibility vs. 
trying to DTRT).

> Still not sure about the encode 
> 'to', but I'll start thinking about how we could implement the 
> 'from' in DBD::Pg.

For DBD::Pg, at least, if client-encoding is set to Big5, then you *have* to 
encode to send it to the database. Or change the client encoding, of course.

> How would one map things - just demand that 
> whatever is given must be a literal encoding the particular database 
> can understand?

I think we should standardize on the Perl IO names for these things. Some 
databases may not support them all, of course.

>> With an encoding attribute, you don't need the utf8_flag at all.
> 
> Right, +1
> 
> So the above means these two actually behave very differently:
> 
> $dbh->{encoding} = ':utf8';
> 
> $dbh->{encoding} = 'utf8';
> 
> Could be a little confusing, no? Methinks we some long ugly name, maybe 
> even worse than "perl_native". Perhaps "perl_internal_utf8_flag"? 1/2 :)

No, I think just "encoding", and "utf8" would be invalid, but 
":encoding(UTF-8)" would not.

> Thanks for plugging away at this. My short term goal is to get this finalized 
> enough that I can release the next version of DBD::Pg without a 'pg_' prefix 
> to control the encoding items.

Well, I think we might have to have it with the pg_prefix until this stuff is 
finalized here. Not sure, though.

Best,

David

Re: Add Unicode Support to the DBI

2011-10-06 Thread Greg Sabino Mullane


-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160


> Uh, say what? Just as I need to
>
> binmode STDOUT, ':utf8';
> Before sending stuff to STDOUT (that is, turn off the flag), I would 
> expect DBDs to do the same before sending data to the database. 
> Unless, of course, it "just works".

I cannot imagine the flag really matters or not. We (Pg) simply dump a 
bunch of chars to the database, and build it by slurping in the string 
character by character until we hit a null. I suppose other databases 
may do things differently, but I can't imagine how/why.

>> Yes, very bad example. Let's call it utf8. Forget 'unicode' entirely.

> Yeah, better, though it' just perpetuates Perl's unfortunate use of 
> the term "utf8" for "internal string representation." Though I suppose 
> that ship has sunk already.

Yep. To paraphrase horribly, "Perl's unicode support is the worst, except for 
all the other languages".

>> Because it may still need to convert things. See the ODBC discussion.
>
> Oh, so you're saying it will decode and encode between Perl's internal 
> form and UTF-8, rather than just flip the flag on and off?

Yes, that's a possibility.

> Yes, because you were only talking about utf8 and UTF-8, not any 
> other encodings. Unless I missed something. If the data coming back 
> from the DB is Big5, I may well want to have some way to decode it 
> (and to encode it for write statements).

You mean at the DBD level -  such that you can say to the database, 
I don't care what encoding you stored it as, I want it encoded 
as X when you give it back to me? (update: yes, see below)

>> Well, because utf-8 is pretty much a defacto encoding, or at least 
>> way, way more popular than things like ucs2. Also, the Perl utf8 
>> flag encourages us to put everything into UTF-8.
>
> Yeah, but again, that might be some reason to call it something else, 
> like "perl_native" or something. The fact that it happens to be UTF-8 
> should be irrelevant. ER, except, I guess, you still have to know the 
> encoding of the database.

Well, I wouldn't call it irrelevant, but at the end of the day, we can 
call it perl_native, but that's just going to cause people to look it up 
in the docs and then say "aha! that means the utf8 flag is on" and then 
they have "perl_native -> utf8" burned into their head. Or worse, 
"perl_native -> unicode". :)

>> * 'A': the default, it means the DBD should do the best thing, which in most 
>> cases means setting SvUTF8_on if the data coming back is UTF-8.
>> * 'B': (on). The DBD should make every effort to set SvUTF8_on for returned 
>> data, even if it thinks it may not be UTF-8.
>> * 'C': (off). The DBD should not call SvUTF8_on, regardless of what it 
>> thinks the data is.

> I still prefer an encoding attribute that you can set as follows:

> * undef: Default; same as your A.
> * ':utf8': Same as your B:
> * ':raw': Same as your C
> * $encoding: Encode/decode to/from $encoding

I like that. Although the names are still odd. I guess it does map 
though: raw means no utf8 flag. Still not sure about the encode 
'to', but I'll start thinking about how we could implement the 
'from' in DBD::Pg. How would one map things - just demand that 
whatever is given must be a literal encoding the particular database 
can understand?

> With an encoding attribute, you don't need the utf8_flag at all.

Right, +1

So the above means these two actually behave very differently:

$dbh->{encoding} = ':utf8';

$dbh->{encoding} = 'utf8';

Could be a little confusing, no? Methinks we some long ugly name, maybe 
even worse than "perl_native". Perhaps "perl_internal_utf8_flag"? 1/2 :)

Thanks for plugging away at this. My short term goal is to get this finalized 
enough that I can release the next version of DBD::Pg without a 'pg_' prefix 
to control the encoding items.


- -- 
Greg Sabino Mullane g...@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201110061151
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-BEGIN PGP SIGNATURE-

iEYEAREDAAYFAk6Nz28ACgkQvJuQZxSWSsiWJQCgt/F0r/sCPDa9GuYrGZpZHlQ2
WfYAn0asIYHmPKz1BDfcBo7wLADHmH7N
=eJmk
-END PGP SIGNATURE-

Re: Add Unicode Support to the DBI

2011-10-04 Thread H.Merijn Brand

On Tue, 04 Oct 2011 23:24:51 +0100, "Martin J. Evans"
 wrote:

> Some might disagree but DB2 is a main 
> one I no longer have access to (please contact me if you use DBD::DB2 
> and are prepared to spare half an hour or so to modify examples I have 
> which verify unicode support). Of course, if you use another DBD and can 
> send me info on unicode support I'd love to hear from you.

There are two DB2 users on PerlMonks, who are rather helpful in testing
areas. Ask [Tanktalus] & [talexb] :)

$ grep -i -w db2 < "30 FreeNode-#cbstream.log" | grep -i -e unicode -e
utf Sep 30 19:28:20   [Tanktalus] /me contemplates how to get 
unicode strings back out of db2 ...
Feb 27 16:29:01   [talexb] /me returns to fiddling with utf-8 in 
DB2.
Mar 06 17:56:26   [talexb] [Tanktalus]: Hey, looks like I solved 
the utf-8 DB2 problem .. just increase the field to 3n from n and it seems to 
work.
Dec 16 21:19:11   [Tanktalus] The DB does. Thus, I would expect 
that the DBD *could* request data in whatever encoding it wants. And since it 
would want utf8 to mesh with Perl properly, and every encoding that DB2 
supports has a mapping to utf8, this should be doable.
Dec 16 21:25:24   [Tanktalus] [ambrus]: yeah, I would like to see 
it as "mandatory" for byte strings that are already utf-8 (which the DBD driver 
should be able to tell trivially, at least on DB2), or for 
connections/statements/something where said decoding was requested.

-- 
H.Merijn Brand  http://tux.nl  Perl Monger  http://amsterdam.pm.org/
using 5.00307 through 5.14 and porting perl5.15.x on HP-UX 10.20, 11.00,
11.11, 11.23 and 11.31, OpenSuSE 10.1, 11.0 .. 11.4 and AIX 5.2 and 5.3.
http://mirrors.develooper.com/hpux/   http://www.test-smoke.org/
http://qa.perl.org  http://www.goldmark.org/jeff/stupid-disclaimers/

Re: Add Unicode Support to the DBI

2011-10-04 Thread Jonathan Leffler

On Tue, Oct 4, 2011 at 15:24, Martin J. Evans wrote:

> On 04/10/2011 22:38, Tim Bunce wrote:
>
>> I've not had time to devote to this thread. Sorry.
>>
>> I'd be grateful if someone could post a summary of it if/when it
>> approaches some kind of consensus.
>>
>>  I don't think there is a "kind of consensus" right now (although some
> useful discussion which probably will bear fruit) and I'd prefer to work out
> what unicode support already exists and how it is implemented first. For
> instance, Pg is very focussed on UTF-8 (as are most DBDs) and yet ODBC uses
> UCS2 under the hood and CSV can use anything you like. Greg/David/Postgres
> seem to have an immediate problem with unicode support in Postgres and I can
> imagine they are keen to resolve it and I'd suggest they do it now in the
> most appropriate way for DBD::Pg. I don't see why this should necessarily
> impact on any discussion as to what DBI should_do/should_say as already the
> DBDs which support unicode mostly do it in different ways.
>
> I've started gathering together details of what unicode support there is in
> DBDs, how it is implemented and what special flags there are to support it.
> However, this is a massive task. So far I've done ODBC, Oracle, CSV, Unify,
> mysql, SQLite, Firebird and sort of held off on Pg as I knew Greg was
> working on it. Some might disagree but DB2 is a main one I no longer have
> access to (please contact me if you use DBD::DB2 and are prepared to spare
> half an hour or so to modify examples I have which verify unicode support).
> Of course, if you use another DBD and can send me info on unicode support
> I'd love to hear from you.
>
> I thought the whole issue was an interesting topic and I had toyed with
> doing a talk for LPW but to be honest, it is already taking a lot of time
> and I have personal issues right now (and of course my $work) which mean my
> time is severely limited so I'm doubtful right now if I could have it ready
> in time as a talk. I might just post what I have gathered in a weeks time in
> the hope I get a little more input in the mean time.



DBD::Informix has had a couple of UTF-8 patches sent and one has been
applied to the code.  The other arrived this morning and has to be
compared.  The attribute names chosen are different, but both contain 'ix_'
as a prefix and UTF-8 in some form or another.

What I'm not sure about is how to test the code.  Creating an Informix
database that has UTF-8 data in it is trivial (well, nearly trivial).  The
difficulty is demonstrating where there were problems before and that the
problems are gone after.

If anyone has suggestions for how to show that UTF8 is working properly - in
the form of a fairly simple test case - I'd be very grateful to receive it
as guidance.

So, DBD::Informix is endeavouring to move forward, knowing that the
underlying database layers (ESQL/C on the client, and the Informix data
server) handle UTF-8 OK, so any problems are in persuading Perl (and perhaps
DBI) to handle it appropriately too.

[...Did I hear a chorus of "nice theory - shame about the practice"?...]

-- 
Jonathan Leffler   #include 
Guardian of DBD::Informix - v2011.0612 - http://dbi.perl.org
"Blessed are we who can laugh at ourselves, for we shall never cease to be
amused."

Re: Add Unicode Support to the DBI

2011-10-04 Thread Martin J. Evans


On 04/10/2011 22:38, Tim Bunce wrote:

I've not had time to devote to this thread. Sorry.

I'd be grateful if someone could post a summary of it if/when it
approaches some kind of consensus.

Thanks.

Tim.
I don't think there is a "kind of consensus" right now (although some 
useful discussion which probably will bear fruit) and I'd prefer to work 
out what unicode support already exists and how it is implemented first. 
For instance, Pg is very focussed on UTF-8 (as are most DBDs) and yet 
ODBC uses UCS2 under the hood and CSV can use anything you like. 
Greg/David/Postgres seem to have an immediate problem with unicode 
support in Postgres and I can imagine they are keen to resolve it and 
I'd suggest they do it now in the most appropriate way for DBD::Pg. I 
don't see why this should necessarily impact on any discussion as to 
what DBI should_do/should_say as already the DBDs which support unicode 
mostly do it in different ways.


I've started gathering together details of what unicode support there is 
in DBDs, how it is implemented and what special flags there are to 
support it. However, this is a massive task. So far I've done ODBC, 
Oracle, CSV, Unify, mysql, SQLite, Firebird and sort of held off on Pg 
as I knew Greg was working on it. Some might disagree but DB2 is a main 
one I no longer have access to (please contact me if you use DBD::DB2 
and are prepared to spare half an hour or so to modify examples I have 
which verify unicode support). Of course, if you use another DBD and can 
send me info on unicode support I'd love to hear from you.


I thought the whole issue was an interesting topic and I had toyed with 
doing a talk for LPW but to be honest, it is already taking a lot of 
time and I have personal issues right now (and of course my $work) which 
mean my time is severely limited so I'm doubtful right now if I could 
have it ready in time as a talk. I might just post what I have gathered 
in a weeks time in the hope I get a little more input in the mean time.


Martin

Re: Add Unicode Support to the DBI

2011-10-04 Thread Tim Bunce

I've not had time to devote to this thread. Sorry.

I'd be grateful if someone could post a summary of it if/when it
approaches some kind of consensus.

Thanks.

Tim.

Re: Add Unicode Support to the DBI

2011-10-03 Thread David E . Wheeler

On Oct 2, 2011, at 8:49 PM, Greg Sabino Mullane wrote:

> DEW> I assume you also mean to say that data sent *to* the database 
> DEW> has the flag turned off, yes?
> 
> No: that is undefined. I don't see it as the DBDs job to massage data 
> going into the database. Or at least, I cannot imagine a DBI interface 
> for that.

Uh, say what? Just as I need to

   binmode STDOUT, ':utf8';

Before sending stuff to STDOUT (that is, turn off the flag), I would expect 
DBDs to do the same before sending data to the database. Unless, of course, it 
"just works".

> DEW> Yeah, maybe should be utf8_flag instead.
> 
> Yes, very bad example. Let's call it utf8. Forget 'unicode' entirely.

Yeah, better, though it' just perpetuates Perl's unfortunate use of the term 
"utf8" for "internal string representation." Though I suppose that ship has 
sunk already.

> Yeah, that last one is the current Postgres plan. Which I think should 
> be best practice and a default DBI expectation.

Agreed.

> DEW> DBDs will decode the data as needed.
> DEW> I don't understand this sentence. If the flag is 
> DEW> flipped, why will it decode?
> 
> Because it may still need to convert things. See the ODBC discussion.

Oh, so you're saying it will decode and encode between Perl's internal form and 
UTF-8, rather than just flip the flag on and off?

> GSM>> If this is set off, the utf8 flag will never be set, and no 
> GSM>> decoding will be done on data coming back from the database.
> 
> DEW> What if the data coming back from the database 
> DEW> is Big5 and I want to decode it?
> 
> Eh? You just asked above why would we ever decode it?

Yes, because you were only talking about utf8 and UTF-8, not any other 
encodings. Unless I missed something. If the data coming back from the DB is 
Big5, I may well want to have some way to decode it (and to encode it for write 
statements).

> DEW> You mean never allow it to be flipped when the 
> DEW> database encoding is SQL_ASCII?
> 
> Yes, basically. But perhaps it does not matter too much. SQL_ASCII 
> is such a bad idea anyway, I feel no need to coddle people using it. :)

+1

> MJE> So is the problem that sometimes a DBD does not know what to encode data 
> MJE> being sent to the database or how/whether to decode data coming back 
> from 
> MJE> the database? and if that is the case do we need some settings in DBI 
> MJE> to tell a DBD?
> 
> I think that's one of the things that is being argued for, here.

Yes.

> MJE> I think this was my point above, i.e., why utf8? databases accept and 
> MJE> supply a number of encodings so why have a flag called utf8? are we 
> MJE> going to have ucs2, utf16, utf32 flags as well. Surely, it makes more 
> MJE> sense to have a flag where you can set the encoding in the same form 
> MJE> Encode uses.
> 
> Well, because utf-8 is pretty much a defacto encoding, or at least way, way 
> more popular than things like ucs2. Also, the Perl utf8 flag encourages 
> us to put everything into UTF-8.

Yeah, but again, that might be some reason to call it something else, like 
"perl_native" or something. The fact that it happens to be UTF-8 should be 
irrelevant. ER, except, I guess, you still have to know the encoding of the 
database.

> MJE> and what about when the DBD knows you are wrong because the database 
> MJE> says it is returning data in encoding X but you ask for Y.
> 
> I would assume that the DBD should attempt to convert it to Y if that 
> is what the user wants.

And throw exceptions as appropriate (encoding/decoding failure).

> MJE> (examples of DBD flags)
> 
> Almost all the examples from DBDs seem to be focusing on the SvUTF8 flag, so 
> perhaps we should start by focusing on that, or at least decoupling that 
> entirely from decoding? If we assume that the default DBI behavior, or more 
> specifically the default behavior for a random DBD someone picks up is 
> "flip the flag on if the data is known to be UTF-8", then we can propose a 
> DBI attribute, call it utf8_flag, that has three states:
> 
> * 'A': the default, it means the DBD should do the best thing, which in most 
> cases means setting SvUTF8_on if the data coming back is UTF-8.
> * 'B': (on). The DBD should make every effort to set SvUTF8_on for returned 
> data, even if it thinks it may not be UTF-8.
> * 'C': (off). The DBD should not call SvUTF8_on, regardless of what it 
> thinks the data is.

I still prefer an encoding attribute that you can set as follows:

* undef: Default; same as your A.
* ':utf8': Same as your B:
* ':raw': Same as your C
* $encoding: Encode/decode to/from $encoding

> I presume the other half would be an encoding, such that
> $h->{encoding} would basically ask the DBD to make any returned 
> data into that encoding, by hook or by crook.

With an encoding attribute, you don't need the utf8_flag at all.

Best,

David

Re: Add Unicode Support to the DBI

2011-10-02 Thread Greg Sabino Mullane


-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160


From: "David E. Wheeler" 
GSM>> * $h->{unicode_flag}
GSM>> If this is set on, data returned from the database is assumed to be 
UTF-8, and 
GSM>> the utf8 flag will be set.

DEW> I assume you also mean to say that data sent *to* the database 
DEW> has the flag turned off, yes?

No: that is undefined. I don't see it as the DBDs job to massage data 
going into the database. Or at least, I cannot imagine a DBI interface 
for that.

From: "Martin J. Evans" 
MJE> There is more than one way to encode unicode - not everyone uses 
MJE> UTF-8; although some encodings don't support all of unicode.

Right, but I'm talking utf8 here. There are only two things that 
can be done with the strings returned from a database: flip the 
utf8 flag, or convert/decode it to something else. If it's anything 
but utf-8, the utf-8 flag is useless at best, harmful at worse.

DEW> Yeah, maybe should be utf8_flag instead.

Yes, very bad example. Let's call it utf8. Forget 'unicode' entirely.

MJE> 4) don't decode the data, the DBD knows it is say UTF-8 encoded 
MJE> and simply sets the UTF-8 flag (which from what I read is horribly 
MJE> flawed but seems to work for me).

Yeah, that last one is the current Postgres plan. Which I think should 
be best practice and a default DBI expectation.

DEW> DBDs will decode the data as needed.
DEW> I don't understand this sentence. If the flag is 
DEW> flipped, why will it decode?

Because it may still need to convert things. See the ODBC discussion.

GSM>> If this is set off, the utf8 flag will never be set, and no 
GSM>> decoding will be done on data coming back from the database.

DEW> What if the data coming back from the database 
DEW> is Big5 and I want to decode it?

Eh? You just asked above why would we ever decode it?

DEW> You mean never allow it to be flipped when the 
DEW> database encoding is SQL_ASCII?

Yes, basically. But perhaps it does not matter too much. SQL_ASCII 
is such a bad idea anyway, I feel no need to coddle people using it. :)

MJE> So is the problem that sometimes a DBD does not know what to encode data 
MJE> being sent to the database or how/whether to decode data coming back from 
MJE> the database? and if that is the case do we need some settings in DBI 
MJE> to tell a DBD?

I think that's one of the things that is being argued for, here.

MJE> I think this was my point above, i.e., why utf8? databases accept and 
MJE> supply a number of encodings so why have a flag called utf8? are we 
MJE> going to have ucs2, utf16, utf32 flags as well. Surely, it makes more 
MJE> sense to have a flag where you can set the encoding in the same form 
MJE> Encode uses.

Well, because utf-8 is pretty much a defacto encoding, or at least way, way 
more popular than things like ucs2. Also, the Perl utf8 flag encourages 
us to put everything into UTF-8.

MJE> and what about when the DBD knows you are wrong because the database 
MJE> says it is returning data in encoding X but you ask for Y.

I would assume that the DBD should attempt to convert it to Y if that 
is what the user wants.

MJE> DBD::Oracle to my knowledge has no special flags; it just attempts to do 
MJE> the right thing but it favours speed so most data that is supposed to be 
MJE> UTF-8 encoded has SvUTF8_on set but in one case (error messages) it 
MJE> properly and strictly decodes the message so long as your Perl is recent 
MJE> enough else it uses SvUTF8_on.

I'm not sure I understand this. It takes UTF-8 errors from the database, 
changes them to something else, and does NOT set SvUTF8?

MJE> (examples of DBD flags)

Almost all the examples from DBDs seem to be focusing on the SvUTF8 flag, so 
perhaps we should start by focusing on that, or at least decoupling that 
entirely from decoding? If we assume that the default DBI behavior, or more 
specifically the default behavior for a random DBD someone picks up is 
"flip the flag on if the data is known to be UTF-8", then we can propose a 
DBI attribute, call it utf8_flag, that has three states:

* 'A': the default, it means the DBD should do the best thing, which in most 
cases means setting SvUTF8_on if the data coming back is UTF-8.
* 'B': (on). The DBD should make every effort to set SvUTF8_on for returned 
data, even if it thinks it may not be UTF-8.
* 'C': (off). The DBD should not call SvUTF8_on, regardless of what it 
thinks the data is.

I presume the other half would be an encoding, such that
$h->{encoding} would basically ask the DBD to make any returned 
data into that encoding, by hook or by crook.

- -- 
Greg Sabino Mullane g...@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201110022345
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-BEGIN PGP SIGNATURE-

iEYEAREDAAYFAk6JMKgACgkQvJuQZxSWSsgMsQCfdsB6cBwxEmcjvm1WLi9Khncc
I10AoM+M+UGjHjXrtpcQ2PcQOdmmU/n0
=BuvK
-END PGP SIGNATURE-

Re: Add Unicode Support to the DBI

2011-09-22 Thread David E. Wheeler

On Sep 22, 2011, at 11:57 AM, Martin J. Evans wrote:

> ok except what the oracle client libraries accept does not match with Encode 
> accepted strings so someone would have to come up with some sort of mapping 
> between the two.

Yes. That's one of the consequences of providing a single interface to multiple 
databases.

>>> and what about when it conflicts with your locale/LANG?
>> So what?
> I'm not so sure this is a "So what" as Perl itself uses locale settings in 
> some cases - just thought it needed mentioning for consideration.

I'm not really concerned about locales at this point. I tend to leave 
collation, for example, up to the database. Right now I'm strictly concerned 
about encoding.

>>> and what about PERL_UNICODE flags, do they come into this?
>> What are those?
> See http://perldoc.perl.org/perlrun.html
> 
> In particular "UTF-8 is the default PerlIO layer for input streams" of which 
> reading data from a database could be considered one?

That'd be cool, but it's not currently implemented that way, obviously. DBI and 
PerlIO are completely independent AFAIK, and the DBI doesn't look like a file 
handle.

> ok, I'm thinking through the ramifications of this.
> 
> To add to the list I see DBD::SQLite has |sqlite_unicode |"strings coming 
> from the database and passed to the collation function will be properly 
> tagged with the utf8 flag; but this only works if the |sqlite_unicode| 
> attribute is set before the first call to a perl collation sequence" and "The 
> current FTS3 implementation in SQLite is far from complete with respect to 
> utf8 handling : in particular, variable-length characters are not treated 
> correctly by the builtin functions |offsets()| and |snippet()|."
> 
> and DBD::CSV has
> 
> f_encoding => "utf8",
> 
> DBD::mysql has mysql_enable_utf8 which apparently "This attribute determines 
> whether DBD::mysql should assume strings stored in the database are utf8. 
> This feature defaults to off."
> 
> I could not find any special flags for DBD::DB2.
> 
> DBD::Sybase has syb_enable_utf8 "If this attribute is set then DBD::Sybase 
> will convert UNIVARCHAR, UNICHAR, and UNITEXT data to Perl's internal utf-8 
> encoding when they are retrieved. Updating a unicode column will cause Sybase 
> to convert any incoming data from utf-8 to its internal utf-16 encoding."


Yeah, so I think that can be generalized.

Best,

David

Re: Add Unicode Support to the DBI

On 22/09/2011 19:28, David E. Wheeler wrote:

On Sep 22, 2011, at 11:14 AM, Martin J. Evans wrote:

Right. There needs to be a way to tell the DBI what encoding the server sends
and expects to be sent. If it's not UTF-8, then the utf8_flag option is kind of
useless.

I think this was my point above, i.e., why utf8? databases accept and supply a
number of encodings so why have a flag called utf8? are we going to have ucs2,
utf16, utf32 flags as well. Surely, it makes more sense to have a flag where
you can set the encoding in the same form Encode uses.

Yes, I agreed with you. :-)

progress then ;-)

Unless I'm mistaken as to what you refer to I believe that is a feature of the
Oracle client libraries and not one of DBD::Oracle so there is little we can do
about that.

Sure you can. I set something via the DBI interface and the DBD sets the
environment variable for the Oracle client libraries.
ok except what the oracle client libraries accept does not match with
Encode accepted strings so someone would have to come up with some sort
of mapping between the two.

e.g., my current setting is:

NLS_LANG=AMERICAN_AMERICA.AL32UTF8

So to try and move forward, we'd we talking about a flag or flags which say:

1 encode the data sent to the database like "this" (which could be nothing)
2 decode the data retrieved from the database like "this" (which could be
nothing but if not nothing it could be using strict or loose for the UTF-8 and utf-8 case)
3 don't decode but use SvUTF8_on (a specific case since Perl uses that
internally and a number of database return UTF-8)
one that seems to work but I worry about.
4 do what the DBD thinks is best - whatever the behaviour is now?

Yes.

and what about when it conflicts with your locale/LANG?

So what?
I'm not so sure this is a "So what" as Perl itself uses locale settings
in some cases - just thought it needed mentioning for consideration.

and what about PERL_UNICODE flags, do they come into this?

What are those?

See http://perldoc.perl.org/perlrun.html

In particular "UTF-8 is the default PerlIO layer for input streams" of
which reading data from a database could be considered one?

and what about when the DBD knows you are wrong because the database says it is
returning data in encoding X but you ask for Y.

Throw an exception or a warning.

and for DBD::ODBC built for unicode API am I expected to try and decode UCS2 as
x just because the flag tells me to and I know it will not work? Seems like it
only applies to the ANSI API in DBD::ODBC where the data could be UTF-8 encoded
in a few (possibly broken see http://www.martin-evans.me.uk/node/20#unicode)
cases.

If the user does something that makes no sense, tell them it makes no sense.
Die if necessary.

I still think it would help to name some specific cases per DBD of flags in use
and why they exist:

DBD::ODBC has a odbc_utf8_on flag to say that data returned by the database
when using the ANSI APIs is UTF-8 encoded and currently it calls SvUTF8_on on
that data (I've never used or verified it works myself but the person supplying
the patch said it had a purpose with a particular Postgres based database).

That's what the new DBD::Pg flag that Greg's working on does, too.

Beyond that DBD::ODBC has no other flags as it knows in the unicode/wide APIs
the data is UCS2 encoded and it checks it is valid when decoding it. Similarly
when sending data to the database in the wide APIs it takes the Perl scalar and
encodes it in UCS2.

Yeah, ideally, by default, if the DBD knows the encoding used by the database, it should
just DTRT. There are backward compatibility issues with that for DBD::Pg, though. So
there probably should be a knob to say "don't do any encoding or decoding at
all", because a lot of older apps likely expect that.

DBD::Oracle to my knowledge has no special flags; it just attempts to do the
right thing but it favours speed so most data that is supposed to be UTF-8
encoded has SvUTF8_on set but in one case (error messages) it properly and
strictly decodes the message so long as your Perl is recent enough else it uses
SvUTF8_on.

So, what are the other flags in use and what purpose do they fulfill.

I think we could really just start with one flag, "encoding". By default the DBD should just try to do the
right thing. If "encoding" is set to ":raw" then it should do no encoding or decoding. If it's set
to ":utf8" it should just turn the flag on or off. If it's set to an actual encoding it should encode and
decode. I think that would be a good start.

Best,

David

ok, I'm thinking through the ramifications of this.

To add to the list I see DBD::SQLite has |sqlite_unicode |"strings
coming from the database and passed to the collation function will be
properly tagged with the utf8 flag; but this only works if the
|sqlite_unicode| attribute is set before the first call to a perl
collation sequence" and "The current FTS3 implementation in SQLite is
far

Re: Add Unicode Support to the DBI

2011-09-22 Thread David E. Wheeler

On Sep 22, 2011, at 11:14 AM, Martin J. Evans wrote:

>> Right. There needs to be a way to tell the DBI what encoding the server 
>> sends and expects to be sent. If it's not UTF-8, then the utf8_flag option 
>> is kind of useless.
> I think this was my point above, i.e., why utf8? databases accept and supply 
> a number of encodings so why have a flag called utf8? are we going to have 
> ucs2, utf16, utf32 flags as well. Surely, it makes more sense to have a flag 
> where you can set the encoding in the same form Encode uses.

Yes, I agreed with you. :-)

> Unless I'm mistaken as to what you refer to I believe that is a feature of 
> the Oracle client libraries and not one of DBD::Oracle so there is little we 
> can do about that.

Sure you can. I set something via the DBI interface and the DBD sets the 
environment variable for the Oracle client libraries.

> So to try and move forward, we'd we talking about a flag or flags which say:
> 
> 1 encode the data sent to the database like "this" (which could be nothing)
> 2 decode the data retrieved from the database like "this" (which could be 
> nothing but if not nothing it could be using strict or loose for the UTF-8 
> and utf-8 case)
> 3 don't decode but use SvUTF8_on (a specific case since Perl uses that 
> internally and a number of database return UTF-8)
>  one that seems to work but I worry about.
> 4 do what the DBD thinks is best - whatever the behaviour is now?

Yes.

> and what about when it conflicts with your locale/LANG?

So what?

> and what about PERL_UNICODE flags, do they come into this?

What are those?

> and what about when the DBD knows you are wrong because the database says it 
> is returning data in encoding X but you ask for Y.

Throw an exception or a warning.

> and for DBD::ODBC built for unicode API am I expected to try and decode UCS2 
> as x just because the flag tells me to and I know it will not work? Seems 
> like it only applies to the ANSI API in DBD::ODBC where the data could be 
> UTF-8 encoded in a few (possibly broken see 
> http://www.martin-evans.me.uk/node/20#unicode) cases.

If the user does something that makes no sense, tell them it makes no sense. 
Die if necessary.

> I still think it would help to name some specific cases per DBD of flags in 
> use and why they exist:
> 
> DBD::ODBC has a odbc_utf8_on flag to say that data returned by the database 
> when using the ANSI APIs is UTF-8 encoded and currently it calls SvUTF8_on on 
> that data (I've never used or verified it works myself but the person 
> supplying the patch said it had a purpose with a particular Postgres based 
> database).

That's what the new DBD::Pg flag that Greg's working on does, too.

> Beyond that DBD::ODBC has no other flags as it knows in the unicode/wide APIs 
> the data is UCS2 encoded and it checks it is valid when decoding it. 
> Similarly when sending data to the database in the wide APIs it takes the 
> Perl scalar and encodes it in UCS2.

Yeah, ideally, by default, if the DBD knows the encoding used by the database, 
it should just DTRT. There are backward compatibility issues with that for 
DBD::Pg, though. So there probably should be a knob to say "don't do any 
encoding or decoding at all", because a lot of older apps likely expect that.

> DBD::Oracle to my knowledge has no special flags; it just attempts to do the 
> right thing but it favours speed so most data that is supposed to be UTF-8 
> encoded has SvUTF8_on set but in one case (error messages) it properly and 
> strictly decodes the message so long as your Perl is recent enough else it 
> uses SvUTF8_on.
> 
> So, what are the other flags in use and what purpose do they fulfill.

I think we could really just start with one flag, "encoding". By default the 
DBD should just try to do the right thing. If "encoding" is set to ":raw" then 
it should do no encoding or decoding. If it's set to ":utf8" it should just 
turn the flag on or off. If it's set to an actual encoding it should encode and 
decode. I think that would be a good start.

Best,

David

Re: Add Unicode Support to the DBI


On 22/09/2011 17:36, David E. Wheeler wrote:

On Sep 22, 2011, at 2:26 AM, Martin J. Evans wrote:


There is more than one way to encode unicode - not everyone uses UTF-8; 
although some encodings don't support all of unicode.

Yeah, maybe should be utf8_flag instead.

see below.

unicode is not encoded as UTF-8 in ODBC using the wide APIs.

Using the wide ODBC APIs returns data in UCS2 encoding and DBD::ODBC decodes 
it. Using the ANSI APIs data is returned as octets and is whatever it is - it 
may be ASCII, it may be UTF-8 encoded (only in 2 cases I know and I believe 
they are flawed anyway) it may be something else in which case the application 
needs to know what it is. In the case of octets which are UTF-8 encoded 
DBD::ODBC has no idea that is the case unless you tell it and it will then set 
the UTF-8 flag (but see later).

Right. There needs to be a way to tell the DBI what encoding the server sends 
and expects to be sent. If it's not UTF-8, then the utf8_flag option is kind of 
useless.
I think this was my point above, i.e., why utf8? databases accept and 
supply a number of encodings so why have a flag called utf8? are we 
going to have ucs2, utf16, utf32 flags as well. Surely, it makes more 
sense to have a flag where you can set the encoding in the same form 
Encode uses.

I'm not that familiar with Postgres (I've used a few times and not to any great 
degree) and I used MySQL for a while years ago. I occasionally use SQLite. I do 
use DBD::Oracle and DBD::ODBC all the time. I'm still struggling to see the 
problem that needs fixing. Is it just that some people would like a DBI flag 
which tells the DBD:

1) decode any data coming back from the database strictly such that if it is 
invalid you die
2) decode any data coming back from the database loosely (e.g., utf-8 vs UTF-8)
3) don't decode the data from the database at all
4) don't decode the data, the DBD knows it is say UTF-8 encoded and simply sets 
the UTF-8 flag (which from what I read is horribly flawed but seems to work for 
me).

and the reverse.

Yes, with one API for all drivers, if possible, and guidelines for how it 
should work (when to encode and decode, what to encode and decode, when to just 
flip the utf8 flag on and off, etc.).

ok

DBD::Oracle does 1 some of the time and it does 4 the rest of the time e.g. 
error messages are fully decoded from UTF-8 IF Oracle is sending UTF-8 and it 
does 4 on most of the column data IF Oracle is sending UTF-8.

Yeah, but to enable it *you set a bloody environment variable*. WHAT?
Unless I'm mistaken as to what you refer to I believe that is a feature 
of the Oracle client libraries and not one of DBD::Oracle so there is 
little we can do about that.



My point being, doesn't the DBD know how the data is encoded when it gets it 
from the database? and it would hopefully know what the database needs when 
sending data. Perhaps in some conditions the DBD does not know this and needs 
to be told (I could imagine SQLite reading/writing straight to files for 
instance might want to know to open the file with UTF-8 layer).

Or to turn it off, so you can just pass the encoded UTF-8 through to the file 
without the decode/encode round-trip.


So is the problem that sometimes a DBD does not know what to encode data being 
sent to the database or how/whether to decode data coming back from the 
database? and if that is the case do we need some settings in DBI to tell a DBD?

That's an issue, yes, but the main issue is that all the drivers do it 
differently, sometimes with different semantics, and lack all the functionality 
one might want (e.g., your examples 1-4).

Best,

David



So to try and move forward, we'd we talking about a flag or flags which say:

1 encode the data sent to the database like "this" (which could be nothing)
2 decode the data retrieved from the database like "this" (which could 
be nothing but if not nothing it could be using strict or loose for the 
UTF-8 and utf-8 case)
3 don't decode but use SvUTF8_on (a specific case since Perl uses that 
internally and a number of database return UTF-8)

  one that seems to work but I worry about.
4 do what the DBD thinks is best - whatever the behaviour is now?

and what about when it conflicts with your locale/LANG?

and what about PERL_UNICODE flags, do they come into this?

and what about when the DBD knows you are wrong because the database 
says it is returning data in encoding X but you ask for Y.


and for DBD::ODBC built for unicode API am I expected to try and decode 
UCS2 as x just because the flag tells me to and I know it will not work? 
Seems like it only applies to the ANSI API in DBD::ODBC where the data 
could be UTF-8 encoded in a few (possibly broken see 
http://www.martin-evans.me.uk/node/20#unicode) cases.


I still think it would help to name some specific cases per DBD of flags 
in use and why they exist:


DBD::ODBC has a odbc_utf8_on flag to say that data returned by the 
database when using the ANSI APIs i

Re: Add Unicode Support to the DBI

2011-09-22 Thread David E . Wheeler

On Sep 22, 2011, at 2:26 AM, Martin J. Evans wrote:

> There is more than one way to encode unicode - not everyone uses UTF-8; 
> although some encodings don't support all of unicode.

Yeah, maybe should be utf8_flag instead.

> unicode is not encoded as UTF-8 in ODBC using the wide APIs.
> 
> Using the wide ODBC APIs returns data in UCS2 encoding and DBD::ODBC decodes 
> it. Using the ANSI APIs data is returned as octets and is whatever it is - it 
> may be ASCII, it may be UTF-8 encoded (only in 2 cases I know and I believe 
> they are flawed anyway) it may be something else in which case the 
> application needs to know what it is. In the case of octets which are UTF-8 
> encoded DBD::ODBC has no idea that is the case unless you tell it and it will 
> then set the UTF-8 flag (but see later).

Right. There needs to be a way to tell the DBI what encoding the server sends 
and expects to be sent. If it's not UTF-8, then the utf8_flag option is kind of 
useless.

> I'm not that familiar with Postgres (I've used a few times and not to any 
> great degree) and I used MySQL for a while years ago. I occasionally use 
> SQLite. I do use DBD::Oracle and DBD::ODBC all the time. I'm still struggling 
> to see the problem that needs fixing. Is it just that some people would like 
> a DBI flag which tells the DBD:
> 
> 1) decode any data coming back from the database strictly such that if it is 
> invalid you die
> 2) decode any data coming back from the database loosely (e.g., utf-8 vs 
> UTF-8)
> 3) don't decode the data from the database at all
> 4) don't decode the data, the DBD knows it is say UTF-8 encoded and simply 
> sets the UTF-8 flag (which from what I read is horribly flawed but seems to 
> work for me).
> 
> and the reverse.

Yes, with one API for all drivers, if possible, and guidelines for how it 
should work (when to encode and decode, what to encode and decode, when to just 
flip the utf8 flag on and off, etc.).

> DBD::Oracle does 1 some of the time and it does 4 the rest of the time e.g. 
> error messages are fully decoded from UTF-8 IF Oracle is sending UTF-8 and it 
> does 4 on most of the column data IF Oracle is sending UTF-8.

Yeah, but to enable it *you set a bloody environment variable*. WHAT?

> My point being, doesn't the DBD know how the data is encoded when it gets it 
> from the database? and it would hopefully know what the database needs when 
> sending data. Perhaps in some conditions the DBD does not know this and needs 
> to be told (I could imagine SQLite reading/writing straight to files for 
> instance might want to know to open the file with UTF-8 layer).

Or to turn it off, so you can just pass the encoded UTF-8 through to the file 
without the decode/encode round-trip.

> So is the problem that sometimes a DBD does not know what to encode data 
> being sent to the database or how/whether to decode data coming back from the 
> database? and if that is the case do we need some settings in DBI to tell a 
> DBD?

That's an issue, yes, but the main issue is that all the drivers do it 
differently, sometimes with different semantics, and lack all the functionality 
one might want (e.g., your examples 1-4).

Best,

David

Re: Add Unicode Support to the DBI


David, I forgot to answer your post first and ended up putting most of my 
comments in a reply to Greg's posting - sorry, it was a long night last night. 
Some further comments below:

On 21/09/11 19:44, David E. Wheeler wrote:

On Sep 10, 2011, at 3:08 AM, Martin J. Evans wrote:


I'm not sure any change is required to DBI to support unicode. As
far as I'm aware unicode already works with DBI if the DBDs do the
right thing.


Right, but the problem is that, IME, none of them do "the right
thing." As I said, I've submitted encoding-related bug reports for
every DBD I've used in production code. And they all have different
interfaces for tweaking things.


So it is these "different interfaces for tweaking things" you'd like to 
standardise I presume.


If you stick to the rule that all data Perl receives must be
decoded and all data Perl exports must be encoded it works
(ignoring any issues in Perl itself).


Er, was there supposed to be a ", then …" statement there?


I bow to Tom's experience but I'm still not sure how that applies
to DBI so long as the interface between the database and Perl
always encodes and decodes then the issues Tom describes are all
Perl ones - no?


The trouble is that:

1. They don't always encode or decode 2. When they do, the tend to
get subtle bits wrong 3. And they all have different interfaces and
philosophies for doing so


Surely Oracle should return the data encoded as you asked for it
and if it did not Oracle is broken. I'd still like to see this case
and then we can see if Oracle is broken and if there is a fix for
it.


Oh I don't doubt that Oracle is broken.


In some places DBD::Oracle does sv_utf8_decode(scalar) or
SvUTF8_on(scalar) (depending on your Perl) and in some places it
just does SvUTF8_on(scalar). I believe the latter is much quicker
as the data is not checked. Many people (myself included) are
particularly interested in DBD::Oracle being fast and if all the
occurrences were changed to decode I'd patch that out in my copy as
I know the data I receive is UTF-8 encoded.


IME It needs an "assume Oracle is broken" knob. That is, I should
have the option to enface encoding and decoding, rather than just
flipping SvUTF8. And I think that such an interface should be
standardized in the DBI along with detailed information for driver
authors how how to "get it right."


ok, I get that.


See above. I'd like the chance to go with speed and take the
consequences rather than go with slower but know incorrect UTF-8 is
spotted.


And maybe that's the default. But I should be able to tell it to be
pedantic when the data is known to be bad (see, for example data from
an SQL_ASCII-encoded PostgreSQL database).


I thought UTF-8 when used in Perl used the strict definition and
utf-8 used Perl's looser definition - see
http://search.cpan.org/~dankogai/Encode-2.44/Encode.pm#UTF-8_vs._utf8_vs._UTF8





That's right. So if I want to ensure that I'm getting strict encoding
in my database, It needs to encode and decode, not simply flip
SvUTF8.


Don't DBDs do this now? I know the encoding of the data I receive
in DBD::ODBC and decode it when I get it and encode it when I send
it and I believe that is what DBD::Oracle does as well. There is
one exception in ODBC for drivers which don't truly abide by ODBC
spec and send 8 bit data back UTF-8 encoded (see later).


There is no single API for configuring this in the DBI, and I argue
there should be.


I've spent a lot of effort getting unicode working in DBD::ODBC
(for UNIX and with patches from Alexander Foken for Windows) which
is implemented in an awkward fashion in ODBC. I'd like to hear from
DBD authors what support they already have and how it is
implemented so we can see what ground is already covered and where
the problems were.


DBD::Pg's approach is currently broken. Greg is working on fixing it,
but for compatibility reasons the fix is non-trivial (an the API
might be, too). In a perfect world DBD::Pg would just always do the
right thing, as the database tells it what encodings to use when you
connect (and *all* data is encoded as such, not just certain data
types). But the world is not perfect, there's a lot of legacy stuff.

Greg, care to add any other details?


as I remain unconvinced a problem exists other than incorrectly
coded DBDs. I'm happy to collate that information. As a start I'll
describe the DBD::ODBC:

1. ODBC has 2 sets of APIs, SQLxxxA (each chr is 8 bits) and
SQLxxxW (each chr is 16 bits and UCS-2). This is how Microsoft did
it and yes I know that does not support all of unicode but code
pages get involved too.

2. You select which API you are using with a macro when you compile
your application so you cannot change your mind. You can in theory
call SQLxxxA or SQLxxxW functions directly but if you use SQLxxx
you get the A or W depending on what the macro is set to. Problem:
DBD::ODBC has to built one way or the other.

3. When using the SQLxxxA functions you can still bind
columns/parameters as wide

Re: Add Unicode Support to the DBI


On 21/09/11 21:52, Greg Sabino Mullane wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160


...

And maybe that's the default. But I should be able to tell it to be pedantic 
when the
data is known to be bad (see, for example data from an SQL_ASCII-encoded 
PostgreSQL database).

...

DBD::Pg's approach is currently broken. Greg is working on fixing it, but for 
compatibility
reasons the fix is non-trivial (an the API might be, too). In a perfect world 
DBD::Pg would
just always do the right thing, as the database tells it what encodings to use 
when you
connect (and *all* data is encoded as such, not just certain data types). But 
the world is
not perfect, there's a lot of legacy stuff.

Greg, care to add any other details?


My thinking on this has changed a bit. See the DBD::Pg in git head for a 
sample, but basically,
DBD::Pg is going to:

* Flip the flag on if the client_encoding is UTF-8 (and server_encoding is not 
SQL_ASCII)
* Flip if off if not

The single switch will be pg_unicode_flag, which will basiccaly override the 
automatic
choice above, just in case you really want your SQL_ASCII byte soup marked as 
utf8 for
some reason, or (more likely), you want your data unmarked as utf8 despite 
being so.

This does rely on PostgreSQL doing the right thing when it comes to 
encoding/decoding/storing
all the encodings, but I'm pretty sure it's doing well in that regard.

...

Since nobody has actally defined a specific interface yet, let me throw out a
straw man. It may look familiar :)

===
* $h->{unicode_flag}

If this is set on, data returned from the database is assumed to be UTF-8, and
the utf8 flag will be set. DBDs will decode the data as needed.


There is more than one way to encode unicode - not everyone uses UTF-8; 
although some encodings don't support all of unicode.

unicode is not encoded as UTF-8 in ODBC using the wide APIs.

Using the wide ODBC APIs returns data in UCS2 encoding and DBD::ODBC decodes 
it. Using the ANSI APIs data is returned as octets and is whatever it is - it 
may be ASCII, it may be UTF-8 encoded (only in 2 cases I know and I believe 
they are flawed anyway) it may be something else in which case the application 
needs to know what it is. In the case of octets which are UTF-8 encoded 
DBD::ODBC has no idea that is the case unless you tell it and it will then set 
the UTF-8 flag (but see later).


If this is set off, the utf8 flag will never be set, and no decoding will be 
done
on data coming back from the database.

If this is not set (undefined), the underlying DBD is responsible for doing the
correct thing. In other words, the behaviour is undefined.
===

I don't think this will fit into DBD::Pgs current implementation perfectly, as
we wouldn't want people to simply leave $h->{unicode_flag} on, as that would
force SQL_ASCII text to have utf8 flipped on. Perhaps we simply never, ever
allow that.


I'm not that familiar with Postgres (I've used a few times and not to any great 
degree) and I used MySQL for a while years ago. I occasionally use SQLite. I do 
use DBD::Oracle and DBD::ODBC all the time. I'm still struggling to see the 
problem that needs fixing. Is it just that some people would like a DBI flag 
which tells the DBD:

1) decode any data coming back from the database strictly such that if it is 
invalid you die
2) decode any data coming back from the database loosely (e.g., utf-8 vs UTF-8)
3) don't decode the data from the database at all
4) don't decode the data, the DBD knows it is say UTF-8 encoded and simply sets 
the UTF-8 flag (which from what I read is horribly flawed but seems to work for 
me).

and the reverse.

DBD::Oracle does 1 some of the time and it does 4 the rest of the time e.g. 
error messages are fully decoded from UTF-8 IF Oracle is sending UTF-8 and it 
does 4 on most of the column data IF Oracle is sending UTF-8.

DBD::ODBC does nothing via the ANSI APIs unless the odbc_utf8 flag is turned on 
in which case it does 4 (and it only does this because there is apparently a 
version of the Postgres ODBC driver out there somewhere that returns UTF-8 
encoded data, I've never seen it, I just accepted the patch).

DBD::ODBC does 1 if using the wide APIs and it has little choice since no one 
would want to accept UCS2 and have to decode it all the time.

My point being, doesn't the DBD know how the data is encoded when it gets it 
from the database? and it would hopefully know what the database needs when 
sending data. Perhaps in some conditions the DBD does not know this and needs 
to be told (I could imagine SQLite reading/writing straight to files for 
instance might want to know to open the file with UTF-8 layer).

So is the problem that sometimes a DBD does not know what to encode data being 
sent to the database or how/whether to decode data coming back from the 
database? and if that is the case do we need some settings in DBI to tell a DBD?

Martin
--
Martin J. Evans
Easysoft Limited
http://www.easysoft.com

Re: Add Unicode Support to the DBI

On Sep 21, 2011, at 1:52 PM, Greg Sabino Mullane wrote:

> Since nobody has actally defined a specific interface yet, let me throw out a 
> straw man. It may look familiar :)
> 
> ===
> * $h->{unicode_flag}
> 
> If this is set on, data returned from the database is assumed to be UTF-8, 
> and 
> the utf8 flag will be set.

I assume you also mean to say that data sent *to* the database has the flag 
turned off, yes?

> DBDs will decode the data as needed.

I don't understand this sentence. If the flag is flipped, why will it decode?

> If this is set off, the utf8 flag will never be set, and no decoding will be 
> done 
> on data coming back from the database.

What if the data coming back from the database is Big5 and I want to decode it?

> If this is not set (undefined), the underlying DBD is responsible for doing 
> the 
> correct thing. In other words, the behaviour is undefined.
> ===
> 
> I don't think this will fit into DBD::Pgs current implementation perfectly, 
> as 
> we wouldn't want people to simply leave $h->{unicode_flag} on, as that would 
> force SQL_ASCII text to have utf8 flipped on. Perhaps we simply never, ever 
> allow that.

You mean never allow it to be flipped when the database encoding is SQL_ASCII?

Best,

David

Re: Add Unicode Support to the DBI

2011-09-21 Thread Greg Sabino Mullane


-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160


...
> And maybe that's the default. But I should be able to tell it to be pedantic 
> when the 
> data is known to be bad (see, for example data from an SQL_ASCII-encoded 
> PostgreSQL database).
...
> DBD::Pg's approach is currently broken. Greg is working on fixing it, but for 
> compatibility 
> reasons the fix is non-trivial (an the API might be, too). In a perfect world 
> DBD::Pg would 
> just always do the right thing, as the database tells it what encodings to 
> use when you 
> connect (and *all* data is encoded as such, not just certain data types). But 
> the world is 
> not perfect, there's a lot of legacy stuff.
>
> Greg, care to add any other details?

My thinking on this has changed a bit. See the DBD::Pg in git head for a 
sample, but basically, 
DBD::Pg is going to:

* Flip the flag on if the client_encoding is UTF-8 (and server_encoding is not 
SQL_ASCII)
* Flip if off if not

The single switch will be pg_unicode_flag, which will basiccaly override the 
automatic 
choice above, just in case you really want your SQL_ASCII byte soup marked as 
utf8 for 
some reason, or (more likely), you want your data unmarked as utf8 despite 
being so.

This does rely on PostgreSQL doing the right thing when it comes to 
encoding/decoding/storing 
all the encodings, but I'm pretty sure it's doing well in that regard.

...

Since nobody has actally defined a specific interface yet, let me throw out a 
straw man. It may look familiar :)

===
* $h->{unicode_flag}

If this is set on, data returned from the database is assumed to be UTF-8, and 
the utf8 flag will be set. DBDs will decode the data as needed.

If this is set off, the utf8 flag will never be set, and no decoding will be 
done 
on data coming back from the database.

If this is not set (undefined), the underlying DBD is responsible for doing the 
correct thing. In other words, the behaviour is undefined.
===

I don't think this will fit into DBD::Pgs current implementation perfectly, as 
we wouldn't want people to simply leave $h->{unicode_flag} on, as that would 
force SQL_ASCII text to have utf8 flipped on. Perhaps we simply never, ever 
allow that.

- -- 
Greg Sabino Mullane g...@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201109211651
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-BEGIN PGP SIGNATURE-

iEYEAREDAAYFAk56TngACgkQvJuQZxSWSsiIfwCeKMfsg2RYsCzDuwb8FnmZhhbu
8LgAn2TNLuKirq5IDAhlCNmQ3gxbnuq7
=k+Fi
-END PGP SIGNATURE-

Re: Add Unicode Support to the DBI

On Sep 10, 2011, at 3:08 AM, Martin J. Evans wrote:

> I'm not sure any change is required to DBI to support unicode. As far as I'm 
> aware unicode already works with DBI if the DBDs do the right thing.

Right, but the problem is that, IME, none of them do "the right thing." As I 
said, I've submitted encoding-related bug reports for every DBD I've used in 
production code. And they all have different interfaces for tweaking things.

> If you stick to the rule that all data Perl receives must be decoded and all 
> data Perl exports must be encoded it works (ignoring any issues in Perl 
> itself).

Er, was there supposed to be a ", then …" statement there?

> I bow to Tom's experience but I'm still not sure how that applies to DBI so 
> long as the interface between the database and Perl always encodes and 
> decodes then the issues Tom describes are all Perl ones - no?

The trouble is that:

1. They don't always encode or decode
2. When they do, the tend to get subtle bits wrong
3. And they all have different interfaces and philosophies for doing so

> Surely Oracle should return the data encoded as you asked for it and if it 
> did not Oracle is broken.
> I'd still like to see this case and then we can see if Oracle is broken and 
> if there is a fix for it.

Oh I don't doubt that Oracle is broken.

> In some places DBD::Oracle does sv_utf8_decode(scalar) or SvUTF8_on(scalar) 
> (depending on your Perl) and in some places it just does SvUTF8_on(scalar). I 
> believe the latter is much quicker as the data is not checked. Many people 
> (myself included) are particularly interested in DBD::Oracle being fast and 
> if all the occurrences were changed to decode I'd patch that out in my copy 
> as I know the data I receive is UTF-8 encoded.

IME It needs an "assume Oracle is broken" knob. That is, I should have the 
option to enface encoding and decoding, rather than just flipping SvUTF8. And I 
think that such an interface should be standardized in the DBI along with 
detailed information for driver authors how how to "get it right."

> See above. I'd like the chance to go with speed and take the consequences 
> rather than go with slower but know incorrect UTF-8 is spotted.

And maybe that's the default. But I should be able to tell it to be pedantic 
when the data is known to be bad (see, for example data from an 
SQL_ASCII-encoded PostgreSQL database).

> I thought UTF-8 when used in Perl used the strict definition and utf-8 used 
> Perl's looser definition - see 
> http://search.cpan.org/~dankogai/Encode-2.44/Encode.pm#UTF-8_vs._utf8_vs._UTF8

That's right. So if I want to ensure that I'm getting strict encoding in my 
database, It needs to encode and decode, not simply flip SvUTF8.

> Don't DBDs do this now? I know the encoding of the data I receive in 
> DBD::ODBC and decode it when I get it and encode it when I send it and I 
> believe that is what DBD::Oracle does as well. There is one exception in ODBC 
> for drivers which don't truly abide by ODBC spec and send 8 bit data back 
> UTF-8 encoded (see later).

There is no single API for configuring this in the DBI, and I argue there 
should be.

> I've spent a lot of effort getting unicode working in DBD::ODBC (for UNIX and 
> with patches from Alexander Foken for Windows) which is implemented in an 
> awkward fashion in ODBC. I'd like to hear from DBD authors what support they 
> already have and how it is implemented so we can see what ground is already 
> covered and where the problems were.

DBD::Pg's approach is currently broken. Greg is working on fixing it, but for 
compatibility reasons the fix is non-trivial (an the API might be, too). In a 
perfect world DBD::Pg would just always do the right thing, as the database 
tells it what encodings to use when you connect (and *all* data is encoded as 
such, not just certain data types). But the world is not perfect, there's a lot 
of legacy stuff.

Greg, care to add any other details?

> as I remain unconvinced a problem exists other than incorrectly coded DBDs. 
> I'm happy to collate that information. As a start I'll describe the DBD::ODBC:
> 
> 1. ODBC has 2 sets of APIs, SQLxxxA (each chr is 8 bits) and SQLxxxW (each 
> chr is 16 bits and UCS-2). This is how Microsoft did it and yes I know that 
> does not support all of unicode but code pages get involved too.
> 
> 2. You select which API you are using with a macro when you compile your 
> application so you cannot change your mind.
> You can in theory call SQLxxxA or SQLxxxW functions directly but if you use 
> SQLxxx you get the A or W depending on what the macro is set to.
> Problem: DBD::ODBC has to built one way or the other.
> 
> 3. When using the SQLxxxA functions you can still bind columns/parameters as 
> wide characters but the ODBC driver needs to support this.
> 
> 4. When using SQLxxxW functions all strings are expected in UCS-2. You can 
> bind columns and parameters as whatever type you like but obviously if you 
> bind a unicode

Re: Add Unicode Support to the DBI

On Sep 10, 2011, at 7:44 AM, Lyle wrote:

>> Right now 5.8 is the required minimum for DBI: should we consider bumping 
>> this?
> 
> I know a lot of servers in the wild are still running RHEL5 and it's 
> variants, which are stuck on 5.8 in the standard package management. The new 
> RHEL6 only has 5.10...
> So at this time the impact of such change could be significant.

Yes, which is why we can't just impose a solution on people.

Best,

David

Re: Add Unicode Support to the DBI