Re: DBD::ODBC tricky unicode problem

2013-10-28 Thread Tim Bunce
On Sun, Oct 27, 2013 at 05:45:50PM +, Martin J. Evans wrote:
 On 27/10/2013 14:49, Tim Bunce wrote:
 
 So, the next question is what are the implications of fixing it for
 existing applications? Do you need a deprecation cycle with warnings etc?

Might be worth nailing that doen before moving on to output issues.

 Now we've got to this point we /can/ consider what happens when you
 read the data back. By default, in a unicode build of DBD::ODBC ALL
 string data is bound as SQL_WCHARs and I'm of the opinion that this
 is right and should continue (apart from anything else the only
 alternative is to bind as SQL_CHAR when the column is char/varchar
 and this doesn't help at all as the server will have to map server
 codepage chrs to client codepage chrs and this cannot always work).
 So let's consider what happens for each example:
 
 Inserting a unicode euro, utf8 flag on:

I wouldn't involve input issues when discussing output issues.
Input and output are separate issues best kept separate. Otherwise
there are two many variables and distractions. I'd suggest
constructing the string on the server using suitable SQL statements
and checking that that string is received by the app.

   input string: €
   data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3 bytes
   ords of input string: 20ac,
   bytes of input string: e2,82,ac,
   database length: 3

database length is ambiguous, best to clarify as database character length
(or database octet length).

   data_string_desc of output string: UTF8 on, non-ASCII, 3
 characters 7 bytes
   ords of output string:e2,201a,ac,
 
 So, here you didn't get back what you put in and the database didn't
 see the inserted data as a single euro chr but as 3 individual chrs
 in the client codepage which was mapped to the column codepage. When
 read back it mapped from the column code page to unicode as the
 select bound chrs as SQL_WCHAR. I'm struggling to see the value to
 anyone of this behaviour but I'd love to hear from you if you don't
 agree.

Forget the input, focus on what the server had. The key question for
output is did the application get the same characters as as server.
In this case the server had 3 chars and when they reached the app
one of the chars was corrupted. It's broken, but with this test you
can't be sure where the breakage is (input or output) because you've not
reported what the server thought the three characters were.


   database length: 3
   data_string_desc of output string: UTF8 on, non-ASCII, 3
 characters 7 bytes
   ords of output string:e2,201a,ac,
 
 This is just the same as the first example and as such I cannot see
 the usefulness of it.

It's the same server state (3 chars) so the same output state.
Like I said, forget the mechanism when talking about the output mechanism.

[Sorry to keep banging the same drum!]

 As it turns out I was about to release a 1.45 official release at
 the end of the 1.44 development cycle. I plan to release this any
 day now before considering any of this.

Perhaps add a warning message to the README and docs.

 My suggestion is that at the start of the 1.46_xx dev cycle:
 
 o I apply the switch as described, advertise it widely, and hound
 everyone I know to try it out. As I see the current behaviour as
 broken and badly I just cannot see right now how anyone could have
 used it as it is and be adversely affected when it is fixed.
 
 o add test cases for round tripping of unicode data to varchars (of
 which there is none right now as it is all to nvarchar which already
 works as discussed).

round tripping will just work _if_ input works and output works.
But testing _only_ round tripping won't prove that the server has the
correct interpretation of what's been sent. It's prone to false
positives. Also failures won't tell you if the failure was due to the
input or output. So I suggest focusing on testing input and output
separately.

 So, the only remaining issues are:
 
 1) my belief that binding output data as unicode/wide_chars always
 in the unicode build is the right one. If you see any problem in
 that it could turn things up-side-down but it would be useful to
 know.

What about people wanting to fetch binary data, like images?
Presumably binding output data as unicode/wide_chars always
depends on the data type on the server: varchar/nvarchar/blob/clob.

 2) if someone specifies a bind type on bind_param I follow that no
 matter what?

They'll get what they ask for :)

 3) SQLDescribeParam is not always supported (e.g. freeTDS) and even
 if it is, it can fail (because rearranging the SQL into a select
 which most drivers do for SQLDescribeParam can fail). In this case
 D:O has a fallback binding which you can override. I suggest that
 unless it is overridden D:O looks at the input param and binds it as
 unicode if it is unicode (SvUTF8), otherwise it binds it as
 SQL_CHAR. In other words, the result from SQLDescribeParam call
 becomes irrelevant for char data as we look at the 

Re: DBD::ODBC tricky unicode problem

2013-10-27 Thread Martin J. Evans
I've snipped my original email and Tim's response and started again 
based on Tim's comments in the email and on IRC. So


o this concentrates on input to the database and ignores output.
o only looks so far at inserting into varchars (and not nvarchars which 
I currently believe works but we'll come back to that).
o includes examples of different inputs, lengths in perl and what you 
get back.
o test code and all discussion is windows only for now as ODBC drivers 
on Unix can work in very different ways. Also, test code run to MS SQL 
Server which in my experience is one of the best ODBC drivers for Windows.
o you cannot correctly translate chr data from a client to a server 
using the SQL Server ODBC driver if the client code page differs from 
the server code page - see http://support.microsoft.com/kb/234748.
o MS quite clearly state If your server must store data from multiple 
code pages, the supported solution is to store the data in Unicode 
columns (NCHAR/NVARCHAR/NTEXT). see 
http://support.microsoft.com/kb/234748 again.

o test code included at end of mail

A few things you should know:

1. D:O does not know what codepage you are in and there is nothing in 
ODBC which allows it to find that out or even say it matters.
2. I believe the TDS protocol (what MS SQL Server uses) allows a client 
code page to be sent, however I believe that is irrelevant now as the 
SQL Server driver converts your client code page chrs into unicode 
before sending them to SQL Server.
3. You cannot expect to get back chrs you inserted into char/varchar 
columns if you did not turn on AutoTranslate in the driver (which 
defaults to on).
4. D:O converts Perl unicode scalars to unicode on Windows using 
MultiByteToWideChar and setting CP_UTF8.


ok, so diving into the example, here is hopefully fairly self 
explanatory output from the test code (code at end of email). I've 
interleaved my comments starting with #.


Current active console code page: 1252
DBD::ODBC build for unicode:1
Output connstr: DSN=asus2;UID=sa;PWD=easysoft;APP=Strawberry Perl 
(64-bit);WSID=ASUS2;Network=DBMSSOCN;Address=ASUS2\SQLEXPRESS,1433


Inserting a unicode euro, utf8 flag on:
  input string: €
  data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3 
bytes

  ords of input string: 20ac,
  bytes of input string: e2,82,ac,
  database length: 3
  data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 
bytes

  ords of output string:e2,201a,ac,
# what happened here is SQLDecribeParam described the parameter as 
SQL_CHAR and that is what D:O bound it as. The driver converted 82 in 
code page 1252 to 201a (a magic quote) and the database sees this as 3 chrs
# the probably mistake here is that D:O should have looked at the perl 
data, seen it was unicode and not bound it as SQL_CHAR (but SQL_WCHAR) 
even though SQLDescribeParam said it was


Inserting a UTF-8 encoded unicode euro, utf8 flag off:
\x{0082} does not map to cp1252 at test_code.pl line 36.
# the above resulted from a print to stdout in windows-1252
# you'd expect that since we were outputting bytes to the terminal
  input string: â\x{0082}¬
  data_string_desc of input string: UTF8 off, non-ASCII, 3 characters 3 
bytes

  ords of input string: e2,82,ac,
  bytes of input string: e2,82,ac,
  database length: 3
  data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 
bytes

  ords of output string:e2,201a,ac,
# same thing as first case happened here except D:O couldn't have done 
anything else as Perl said the scalar was not unicode and SQL Server 
said the parameter was SQL_CHAR


Inserting a unicode euro, utf8 flag on, forced SQL_WVARCHAR:
  input string: €
  data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3 
bytes

  ords of input string: 20ac,
  bytes of input string: e2,82,ac,
  database length: 1
  data_string_desc of output string: UTF8 on, non-ASCII, 1 characters 3 
bytes

  ords of output string:20ac,
# the above is arguably what should have happened in the first test case 
i.e., D:O should have bound as SQL_WVARCHAR because the perl data was 
unicode. I included this one to show you what happens if you do it right.


Inserting a unicode U+187 which is not in the current code page:
\x{0187} does not map to cp1252 at test_code.pl line 36.
# you'd expect this warning here on printing to stdout
  input string: \x{0187}
  data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 2 
bytes

  ords of input string: 187,
  bytes of input string: c6,87,
  database length: 2
  data_string_desc of output string: UTF8 on, non-ASCII, 2 characters 5 
bytes

  ords of output string:c6,2021,
# here again, down to D:O binding as SQL_CHAR, SQL Server interpreted 
this as 2 chrs in the client code page, and converted them to chr c6 and 
a double dagger


Inserting a unicode U+187 which is not in the current code page with 
forced binding:

\x{0187} does not map to cp1252 at test_code.pl line 36.
  input string: \x{0187}
  

Re: DBD::ODBC tricky unicode problem

2013-10-27 Thread Tim Bunce
On Sun, Oct 27, 2013 at 12:18:53PM +, Martin J. Evans wrote:
 
 Inserting a unicode euro, utf8 flag on:
   input string: €
   data_string_desc of input string: UTF8 on, non-ASCII, 1 characters
 3 bytes
   ords of input string: 20ac,
   bytes of input string: e2,82,ac,
   database length: 3
   data_string_desc of output string: UTF8 on, non-ASCII, 3
 characters 7 bytes
   ords of output string:e2,201a,ac,
 # what happened here is SQLDecribeParam described the parameter as
 SQL_CHAR and that is what D:O bound it as. The driver converted 82
 in code page 1252 to 201a (a magic quote) and the database sees this
 as 3 chrs
 # the probably mistake here is that D:O should have looked at the
 perl data, seen it was unicode and not bound it as SQL_CHAR (but
 SQL_WCHAR) even though SQLDescribeParam said it was

Agreed.

 Inserting a UTF-8 encoded unicode euro, utf8 flag off:
 \x{0082} does not map to cp1252 at test_code.pl line 36.

I'd caution against using phases like UTF-8 encoded unicode euro, utf8
flag off. From the application's perspective it's not a euro, it's just
a sequence of bytes (that just happens to match what a euro unicode
codepoint would look like when UTF-8 encoded).

To put it another way, if the application has that string of bytes and
thinks it's a euro then the application is almost certainly broken.

 Inserting a unicode euro, utf8 flag on, forced SQL_WVARCHAR:
 ...
 # the above is arguably what should have happened in the first test
 case i.e., D:O should have bound as SQL_WVARCHAR because the perl
 data was unicode. I included this one to show you what happens if
 you do it right.

Agreed.

 So, I'm thinking the problem above is D:O ignores utf8 flag on
 parameters when they are bound and uses whatever SQLDescribeParam
 says instead (SQL_CHAR in case of varchar columns). If it didn't
 ignore the utf8 flag here, it would have to rebind on every execute
 (which it may already do, I didn't check).

Agreed.

Great. Progress! :)

So, the next question is what are the implications of fixing it for
existing applications? Do you need a deprecation cycle with warnings etc?

Tim.


Re: DBD::ODBC tricky unicode problem

2013-10-27 Thread Martin J. Evans

On 27/10/2013 14:49, Tim Bunce wrote:

Thanks to Tim for being (as usual) a good sounding board.


On Sun, Oct 27, 2013 at 12:18:53PM +, Martin J. Evans wrote:


Inserting a unicode euro, utf8 flag on:
   input string: €
   data_string_desc of input string: UTF8 on, non-ASCII, 1 characters
3 bytes
   ords of input string: 20ac,
   bytes of input string: e2,82,ac,
   database length: 3
   data_string_desc of output string: UTF8 on, non-ASCII, 3
characters 7 bytes
   ords of output string:e2,201a,ac,
# what happened here is SQLDecribeParam described the parameter as
SQL_CHAR and that is what D:O bound it as. The driver converted 82
in code page 1252 to 201a (a magic quote) and the database sees this
as 3 chrs
# the probably mistake here is that D:O should have looked at the
perl data, seen it was unicode and not bound it as SQL_CHAR (but
SQL_WCHAR) even though SQLDescribeParam said it was


Agreed.


Good. That is also easily done assuming I check params are rebound in 
case someone tries to insert code page chrs followed by unicode.



Inserting a UTF-8 encoded unicode euro, utf8 flag off:
\x{0082} does not map to cp1252 at test_code.pl line 36.


I'd caution against using phases like UTF-8 encoded unicode euro, utf8
flag off. From the application's perspective it's not a euro, it's just
a sequence of bytes (that just happens to match what a euro unicode
codepoint would look like when UTF-8 encoded).


Yeah, I realised that, I only threw that one in because I've come across 
people doing that i.e., encoding unicode and inserting the encoded 
octets, and I wanted to cover all bases.



To put it another way, if the application has that string of bytes and
thinks it's a euro then the application is almost certainly broken.


see above.


Inserting a unicode euro, utf8 flag on, forced SQL_WVARCHAR:
...
# the above is arguably what should have happened in the first test
case i.e., D:O should have bound as SQL_WVARCHAR because the perl
data was unicode. I included this one to show you what happens if
you do it right.


Agreed.


So, I'm thinking the problem above is D:O ignores utf8 flag on
parameters when they are bound and uses whatever SQLDescribeParam
says instead (SQL_CHAR in case of varchar columns). If it didn't
ignore the utf8 flag here, it would have to rebind on every execute
(which it may already do, I didn't check).


Agreed.

Great. Progress! :)

So, the next question is what are the implications of fixing it for
existing applications? Do you need a deprecation cycle with warnings etc?

Tim.



Now we've got to this point we /can/ consider what happens when you read 
the data back. By default, in a unicode build of DBD::ODBC ALL string 
data is bound as SQL_WCHARs and I'm of the opinion that this is right 
and should continue (apart from anything else the only alternative is to 
bind as SQL_CHAR when the column is char/varchar and this doesn't help 
at all as the server will have to map server codepage chrs to client 
codepage chrs and this cannot always work). So let's consider what 
happens for each example:


Inserting a unicode euro, utf8 flag on:
  input string: €
  data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3 
bytes

  ords of input string: 20ac,
  bytes of input string: e2,82,ac,
  database length: 3
  data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 
bytes

  ords of output string:e2,201a,ac,

So, here you didn't get back what you put in and the database didn't see 
the inserted data as a single euro chr but as 3 individual chrs in the 
client codepage which was mapped to the column codepage. When read back 
it mapped from the column code page to unicode as the select bound chrs 
as SQL_WCHAR. I'm struggling to see the value to anyone of this 
behaviour but I'd love to hear from you if you don't agree.


Inserting a UTF-8 encoded unicode euro, utf8 flag off:
  In other words a stream of octets
\x{0082} does not map to cp1252 at test_code.pl line 36.
  input string: â\x{0082}¬
  data_string_desc of input string: UTF8 off, non-ASCII, 3 characters 3 
bytes

  ords of input string: e2,82,ac,
  bytes of input string: e2,82,ac,
  database length: 3
  data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 
bytes

  ords of output string:e2,201a,ac,

This is just the same as the first example and as such I cannot see the 
usefulness of it. However, because I know this I am at a loss as to why 
no one has told me this before I discovered it for myself. I've spoken 
to a few people actively using DBD::ODBC on Windows and mostly they are 
not using unicode for inserts/updates. If they restrict their inserts to 
ASCII there will be no change for them as what they get back now is 
already unicode (just the first 127 chrs of unicode) because DBD::ODBC 
always binds as SQL_WCHAR. If you insert chrs in the upper part of your 
codepage then what you get back is not what you inserted i.e., it is not 
chrs in your codepage, it is unicode and the 

Re: DBD::ODBC tricky unicode problem

2013-10-26 Thread Tim Bunce
On Thu, Oct 24, 2013 at 12:00:27AM +0100, Martin J. Evans wrote:

 I may interchange bytes with chrs in a codepage in this.
 A codepage has 255 characters and each one is a byte.

I don't understand that first sentance.

 Before unicode support was added to DBD::ODBC, it used to bind all
 char and varchar columns as SQL_CHAR (and hence not unicode BUT
 could be a character in a windows code page which has a unique
 unicode codepoint). This meant you could store chrs in your current
 code page and it also meant you could just ignore code pages and
 store, say UTF-8 encoded data in a char/varchar (and some did and
 decoded it themselves when they read it back).

There's a whole bunch of stuff in there that I'm not clear on.

By your current code page I presume you mean the client current code
page, not the server's idea of what the client code page is.

What I think you're saying here is that client code could store a
sequence of bytes on the server and get back the same bytes later,
*but* the server may interpret the sequence of bytes differently.

I find it really helpful to be very clear about how each part of
a system views the bytes that are passing through it:

- terminal
- application reading input
- application internal logic
- DBD
- DB client library
- DB server
---later---
- DB server
- DB client library
- DBD
- application internal logic
- etc. etc.

There are many actors here. If they have different views of what the
byte stream represents then you can have problems. But those problems
might only appear in certain cases (like surrogate characters).
Also different views at multiple different levels can cancel each other
out, making detection of problems difficult.

It's quite possible to have correct unicode input at a terminal, that's
then processed by some code and output as correct unicode, while the code
that did the processing doesn't view the bytes as unicode.

The DBD::Oracle test suite goes to great lengths to check that
the servers view of what it's been sent matches what we want.
Specifically that when a single unicode character which encodes to
multiple bytes gets to the server it agrees that it's only a single
character.

Forgive me if all this is familiar to you. I repeat it here partly for
others and partly because it wasn't clear to me from how you'd expressed
the above.

 When I discovered if you bound char/varchar columns as SQL_WCHAR
 (unicode) (with a unicode build of DBD::ODBC) you got unicode back
 (in other words sql server converts chrs in the current codepage to
 unicode) it seemed like a good idea to change the default to bind as
 SQL_WCHAR. With hindsight that was a mistake. Because this would
 break people storing UTF-8 encoded data in char/varchar I issued a
 change warning, eventually changed to SQL_WCHAR and added an
 odbc_old_unicode attribute to return it to old behaviour. No one has
 complained so far.

I read this and I'm left wondering about input vs storage vs output.
It's not clear to me what is the servers view of what it's been sent.
What would selectrow_array(SELECT CHARACTER_LENGTH(?) ..., undef, $foo)
return?

Or, to look at it another way, if the value is inserted into a table
and the table is sorted, is it sorted correctly? (That's a more complex
question though because it involves collations and possibly per-table
or even per-column characters sets - so even more actors in the plot!)

 As it turns out this was probably a bad idea and breaks things.
 
 The problem I've created is horrible.
 
 Let's say you want to store chrs in the current code page in a char
 or varchar or you want to store data already UTF-8 encoded in a
 char/varchar (i.e. bytes) and decode it when you read it back.

Argh. Too complicated! :)

 Let's say the data you want to store is either:
 
 codepage chrs: 0xe2, 0x82, 0xc2
 UTF-8 encoded euro symbol: 0xe2, 0x82, 0xc2

What does perl's length() function report for those?
It's really important to focus on surface representation
when trying to work out what the right thing to do is.
I presume the perl length() of the first string is 3
and the second is 1.  Right?

 When you insert it, DBD::ODBC calls SQLDescribeParam and the driver
 describes these parameters as SQL_CHAR, they are bound as SQL_CHAR,
 so SQL Server takes this sequence to be characters in the current
 (or table or column) codepage.

Are you saying that DBD::ODBC doesn't pay attention to the SvUTF8
flag on the value?

 If you select them back binding them
 as SQL_CHAR, you get back what you put in - excellent. Except,
 DBD::ODBC now (and has for a long time) defaulted in a unicode build
 of DBD::ODBC to binding them as SQL_WCHAR. When you read them back
 you get:
 
 0xe2,0x201a,0xac (assuming windows-1252 is the codepage)
 
 (because 0x82 in windows-1252 is a funny comma which is really U+201a)

What did the server think the string was?

 and the column data is marked in perl as unicode.

So a unicode build of DBD::ODBC 

DBD::ODBC tricky unicode problem

2013-10-23 Thread Martin J. Evans
No one has reported this to me but during writing unicode common 
problems using DBD::ODBC I've uncovered something which I think I broke 
a long time ago and although I know how to fix it, I'm unsure of the 
repercussions.


I may interchange bytes with chrs in a codepage in this. A codepage has 
255 characters and each one is a byte.


Sorry for length of this but I could not think of a way of describing 
the problem more succinctly.


Some background

Before unicode support was added to DBD::ODBC, it used to bind all char 
and varchar columns as SQL_CHAR (and hence not unicode BUT could be a 
character in a windows code page which has a unique unicode codepoint). 
This meant you could store chrs in your current code page and it also 
meant you could just ignore code pages and store, say UTF-8 encoded data 
in a char/varchar (and some did and decoded it themselves when they read 
it back).


When I discovered if you bound char/varchar columns as SQL_WCHAR 
(unicode) (with a unicode build of DBD::ODBC) you got unicode back (in 
other words sql server converts chrs in the current codepage to unicode) 
it seemed like a good idea to change the default to bind as SQL_WCHAR. 
With hindsight that was a mistake. Because this would break people 
storing UTF-8 encoded data in char/varchar I issued a change warning, 
eventually changed to SQL_WCHAR and added an odbc_old_unicode attribute 
to return it to old behaviour. No one has complained so far.


As it turns out this was probably a bad idea and breaks things.

The problem I've created is horrible.

Let's say you want to store chrs in the current code page in a char or 
varchar or you want to store data already UTF-8 encoded in a 
char/varchar (i.e. bytes) and decode it when you read it back. Let's say 
the data you want to store is either:


codepage chrs: 0xe2, 0x82, 0xc2
UTF-8 encoded euro symbol: 0xe2, 0x82, 0xc2

When you insert it, DBD::ODBC calls SQLDescribeParam and the driver 
describes these parameters as SQL_CHAR, they are bound as SQL_CHAR, so 
SQL Server takes this sequence to be characters in the current (or table 
or column) codepage. If you select them back binding them as SQL_CHAR, 
you get back what you put in - excellent. Except, DBD::ODBC now (and has 
for a long time) defaulted in a unicode build of DBD::ODBC to binding 
them as SQL_WCHAR. When you read them back you get:


0xe2,0x201a,0xac (assuming windows-1252 is the codepage)

(because 0x82 in windows-1252 is a funny comma which is really U+201a)

and the column data is marked in perl as unicode.

argh.

I don't really care about the people trying to insert unicode into 
char/varchars as it is just a daft thing to do (although possible) and 
my document explains why but I'm amazed this has not caught someone out 
on Windows (where the default for DBD::ODBC is a unicode build).


So now I'm not sure what to do. It seems the odbc_old_unicode behaviour 
was right in the first place and although it is easy to fix how do I do 
it reasonably? Can I assume anyone who got broken when the original 
change was made, switched to setting odbc_old_unicode so reverting to 
old behaviour won't affect them? But then, what about people not using 
odbc_old_unicode and relying on it. You could say these people still get 
back what they inserted, it is just they inserted data in a codepage and 
got the same string back, just in unicode now. They will be the ones 
affected because the data they would get back after reverting the change 
will be bytes/chrs in the codepage now and no longer marked as unicode 
in their perl.


Any good suggestions to get me out of this mess?

Martin
--
Martin J. Evans
Wetherby, UK