Re: DBD::ODBC tricky unicode problem
On Sun, Oct 27, 2013 at 05:45:50PM +, Martin J. Evans wrote: On 27/10/2013 14:49, Tim Bunce wrote: So, the next question is what are the implications of fixing it for existing applications? Do you need a deprecation cycle with warnings etc? Might be worth nailing that doen before moving on to output issues. Now we've got to this point we /can/ consider what happens when you read the data back. By default, in a unicode build of DBD::ODBC ALL string data is bound as SQL_WCHARs and I'm of the opinion that this is right and should continue (apart from anything else the only alternative is to bind as SQL_CHAR when the column is char/varchar and this doesn't help at all as the server will have to map server codepage chrs to client codepage chrs and this cannot always work). So let's consider what happens for each example: Inserting a unicode euro, utf8 flag on: I wouldn't involve input issues when discussing output issues. Input and output are separate issues best kept separate. Otherwise there are two many variables and distractions. I'd suggest constructing the string on the server using suitable SQL statements and checking that that string is received by the app. input string: € data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3 bytes ords of input string: 20ac, bytes of input string: e2,82,ac, database length: 3 database length is ambiguous, best to clarify as database character length (or database octet length). data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 bytes ords of output string:e2,201a,ac, So, here you didn't get back what you put in and the database didn't see the inserted data as a single euro chr but as 3 individual chrs in the client codepage which was mapped to the column codepage. When read back it mapped from the column code page to unicode as the select bound chrs as SQL_WCHAR. I'm struggling to see the value to anyone of this behaviour but I'd love to hear from you if you don't agree. Forget the input, focus on what the server had. The key question for output is did the application get the same characters as as server. In this case the server had 3 chars and when they reached the app one of the chars was corrupted. It's broken, but with this test you can't be sure where the breakage is (input or output) because you've not reported what the server thought the three characters were. database length: 3 data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 bytes ords of output string:e2,201a,ac, This is just the same as the first example and as such I cannot see the usefulness of it. It's the same server state (3 chars) so the same output state. Like I said, forget the mechanism when talking about the output mechanism. [Sorry to keep banging the same drum!] As it turns out I was about to release a 1.45 official release at the end of the 1.44 development cycle. I plan to release this any day now before considering any of this. Perhaps add a warning message to the README and docs. My suggestion is that at the start of the 1.46_xx dev cycle: o I apply the switch as described, advertise it widely, and hound everyone I know to try it out. As I see the current behaviour as broken and badly I just cannot see right now how anyone could have used it as it is and be adversely affected when it is fixed. o add test cases for round tripping of unicode data to varchars (of which there is none right now as it is all to nvarchar which already works as discussed). round tripping will just work _if_ input works and output works. But testing _only_ round tripping won't prove that the server has the correct interpretation of what's been sent. It's prone to false positives. Also failures won't tell you if the failure was due to the input or output. So I suggest focusing on testing input and output separately. So, the only remaining issues are: 1) my belief that binding output data as unicode/wide_chars always in the unicode build is the right one. If you see any problem in that it could turn things up-side-down but it would be useful to know. What about people wanting to fetch binary data, like images? Presumably binding output data as unicode/wide_chars always depends on the data type on the server: varchar/nvarchar/blob/clob. 2) if someone specifies a bind type on bind_param I follow that no matter what? They'll get what they ask for :) 3) SQLDescribeParam is not always supported (e.g. freeTDS) and even if it is, it can fail (because rearranging the SQL into a select which most drivers do for SQLDescribeParam can fail). In this case D:O has a fallback binding which you can override. I suggest that unless it is overridden D:O looks at the input param and binds it as unicode if it is unicode (SvUTF8), otherwise it binds it as SQL_CHAR. In other words, the result from SQLDescribeParam call becomes irrelevant for char data as we look at the
Re: DBD::ODBC tricky unicode problem
I've snipped my original email and Tim's response and started again based on Tim's comments in the email and on IRC. So o this concentrates on input to the database and ignores output. o only looks so far at inserting into varchars (and not nvarchars which I currently believe works but we'll come back to that). o includes examples of different inputs, lengths in perl and what you get back. o test code and all discussion is windows only for now as ODBC drivers on Unix can work in very different ways. Also, test code run to MS SQL Server which in my experience is one of the best ODBC drivers for Windows. o you cannot correctly translate chr data from a client to a server using the SQL Server ODBC driver if the client code page differs from the server code page - see http://support.microsoft.com/kb/234748. o MS quite clearly state If your server must store data from multiple code pages, the supported solution is to store the data in Unicode columns (NCHAR/NVARCHAR/NTEXT). see http://support.microsoft.com/kb/234748 again. o test code included at end of mail A few things you should know: 1. D:O does not know what codepage you are in and there is nothing in ODBC which allows it to find that out or even say it matters. 2. I believe the TDS protocol (what MS SQL Server uses) allows a client code page to be sent, however I believe that is irrelevant now as the SQL Server driver converts your client code page chrs into unicode before sending them to SQL Server. 3. You cannot expect to get back chrs you inserted into char/varchar columns if you did not turn on AutoTranslate in the driver (which defaults to on). 4. D:O converts Perl unicode scalars to unicode on Windows using MultiByteToWideChar and setting CP_UTF8. ok, so diving into the example, here is hopefully fairly self explanatory output from the test code (code at end of email). I've interleaved my comments starting with #. Current active console code page: 1252 DBD::ODBC build for unicode:1 Output connstr: DSN=asus2;UID=sa;PWD=easysoft;APP=Strawberry Perl (64-bit);WSID=ASUS2;Network=DBMSSOCN;Address=ASUS2\SQLEXPRESS,1433 Inserting a unicode euro, utf8 flag on: input string: € data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3 bytes ords of input string: 20ac, bytes of input string: e2,82,ac, database length: 3 data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 bytes ords of output string:e2,201a,ac, # what happened here is SQLDecribeParam described the parameter as SQL_CHAR and that is what D:O bound it as. The driver converted 82 in code page 1252 to 201a (a magic quote) and the database sees this as 3 chrs # the probably mistake here is that D:O should have looked at the perl data, seen it was unicode and not bound it as SQL_CHAR (but SQL_WCHAR) even though SQLDescribeParam said it was Inserting a UTF-8 encoded unicode euro, utf8 flag off: \x{0082} does not map to cp1252 at test_code.pl line 36. # the above resulted from a print to stdout in windows-1252 # you'd expect that since we were outputting bytes to the terminal input string: â\x{0082}¬ data_string_desc of input string: UTF8 off, non-ASCII, 3 characters 3 bytes ords of input string: e2,82,ac, bytes of input string: e2,82,ac, database length: 3 data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 bytes ords of output string:e2,201a,ac, # same thing as first case happened here except D:O couldn't have done anything else as Perl said the scalar was not unicode and SQL Server said the parameter was SQL_CHAR Inserting a unicode euro, utf8 flag on, forced SQL_WVARCHAR: input string: € data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3 bytes ords of input string: 20ac, bytes of input string: e2,82,ac, database length: 1 data_string_desc of output string: UTF8 on, non-ASCII, 1 characters 3 bytes ords of output string:20ac, # the above is arguably what should have happened in the first test case i.e., D:O should have bound as SQL_WVARCHAR because the perl data was unicode. I included this one to show you what happens if you do it right. Inserting a unicode U+187 which is not in the current code page: \x{0187} does not map to cp1252 at test_code.pl line 36. # you'd expect this warning here on printing to stdout input string: \x{0187} data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 2 bytes ords of input string: 187, bytes of input string: c6,87, database length: 2 data_string_desc of output string: UTF8 on, non-ASCII, 2 characters 5 bytes ords of output string:c6,2021, # here again, down to D:O binding as SQL_CHAR, SQL Server interpreted this as 2 chrs in the client code page, and converted them to chr c6 and a double dagger Inserting a unicode U+187 which is not in the current code page with forced binding: \x{0187} does not map to cp1252 at test_code.pl line 36. input string: \x{0187}
Re: DBD::ODBC tricky unicode problem
On Sun, Oct 27, 2013 at 12:18:53PM +, Martin J. Evans wrote: Inserting a unicode euro, utf8 flag on: input string: € data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3 bytes ords of input string: 20ac, bytes of input string: e2,82,ac, database length: 3 data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 bytes ords of output string:e2,201a,ac, # what happened here is SQLDecribeParam described the parameter as SQL_CHAR and that is what D:O bound it as. The driver converted 82 in code page 1252 to 201a (a magic quote) and the database sees this as 3 chrs # the probably mistake here is that D:O should have looked at the perl data, seen it was unicode and not bound it as SQL_CHAR (but SQL_WCHAR) even though SQLDescribeParam said it was Agreed. Inserting a UTF-8 encoded unicode euro, utf8 flag off: \x{0082} does not map to cp1252 at test_code.pl line 36. I'd caution against using phases like UTF-8 encoded unicode euro, utf8 flag off. From the application's perspective it's not a euro, it's just a sequence of bytes (that just happens to match what a euro unicode codepoint would look like when UTF-8 encoded). To put it another way, if the application has that string of bytes and thinks it's a euro then the application is almost certainly broken. Inserting a unicode euro, utf8 flag on, forced SQL_WVARCHAR: ... # the above is arguably what should have happened in the first test case i.e., D:O should have bound as SQL_WVARCHAR because the perl data was unicode. I included this one to show you what happens if you do it right. Agreed. So, I'm thinking the problem above is D:O ignores utf8 flag on parameters when they are bound and uses whatever SQLDescribeParam says instead (SQL_CHAR in case of varchar columns). If it didn't ignore the utf8 flag here, it would have to rebind on every execute (which it may already do, I didn't check). Agreed. Great. Progress! :) So, the next question is what are the implications of fixing it for existing applications? Do you need a deprecation cycle with warnings etc? Tim.
Re: DBD::ODBC tricky unicode problem
On 27/10/2013 14:49, Tim Bunce wrote: Thanks to Tim for being (as usual) a good sounding board. On Sun, Oct 27, 2013 at 12:18:53PM +, Martin J. Evans wrote: Inserting a unicode euro, utf8 flag on: input string: € data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3 bytes ords of input string: 20ac, bytes of input string: e2,82,ac, database length: 3 data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 bytes ords of output string:e2,201a,ac, # what happened here is SQLDecribeParam described the parameter as SQL_CHAR and that is what D:O bound it as. The driver converted 82 in code page 1252 to 201a (a magic quote) and the database sees this as 3 chrs # the probably mistake here is that D:O should have looked at the perl data, seen it was unicode and not bound it as SQL_CHAR (but SQL_WCHAR) even though SQLDescribeParam said it was Agreed. Good. That is also easily done assuming I check params are rebound in case someone tries to insert code page chrs followed by unicode. Inserting a UTF-8 encoded unicode euro, utf8 flag off: \x{0082} does not map to cp1252 at test_code.pl line 36. I'd caution against using phases like UTF-8 encoded unicode euro, utf8 flag off. From the application's perspective it's not a euro, it's just a sequence of bytes (that just happens to match what a euro unicode codepoint would look like when UTF-8 encoded). Yeah, I realised that, I only threw that one in because I've come across people doing that i.e., encoding unicode and inserting the encoded octets, and I wanted to cover all bases. To put it another way, if the application has that string of bytes and thinks it's a euro then the application is almost certainly broken. see above. Inserting a unicode euro, utf8 flag on, forced SQL_WVARCHAR: ... # the above is arguably what should have happened in the first test case i.e., D:O should have bound as SQL_WVARCHAR because the perl data was unicode. I included this one to show you what happens if you do it right. Agreed. So, I'm thinking the problem above is D:O ignores utf8 flag on parameters when they are bound and uses whatever SQLDescribeParam says instead (SQL_CHAR in case of varchar columns). If it didn't ignore the utf8 flag here, it would have to rebind on every execute (which it may already do, I didn't check). Agreed. Great. Progress! :) So, the next question is what are the implications of fixing it for existing applications? Do you need a deprecation cycle with warnings etc? Tim. Now we've got to this point we /can/ consider what happens when you read the data back. By default, in a unicode build of DBD::ODBC ALL string data is bound as SQL_WCHARs and I'm of the opinion that this is right and should continue (apart from anything else the only alternative is to bind as SQL_CHAR when the column is char/varchar and this doesn't help at all as the server will have to map server codepage chrs to client codepage chrs and this cannot always work). So let's consider what happens for each example: Inserting a unicode euro, utf8 flag on: input string: € data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3 bytes ords of input string: 20ac, bytes of input string: e2,82,ac, database length: 3 data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 bytes ords of output string:e2,201a,ac, So, here you didn't get back what you put in and the database didn't see the inserted data as a single euro chr but as 3 individual chrs in the client codepage which was mapped to the column codepage. When read back it mapped from the column code page to unicode as the select bound chrs as SQL_WCHAR. I'm struggling to see the value to anyone of this behaviour but I'd love to hear from you if you don't agree. Inserting a UTF-8 encoded unicode euro, utf8 flag off: In other words a stream of octets \x{0082} does not map to cp1252 at test_code.pl line 36. input string: â\x{0082}¬ data_string_desc of input string: UTF8 off, non-ASCII, 3 characters 3 bytes ords of input string: e2,82,ac, bytes of input string: e2,82,ac, database length: 3 data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 bytes ords of output string:e2,201a,ac, This is just the same as the first example and as such I cannot see the usefulness of it. However, because I know this I am at a loss as to why no one has told me this before I discovered it for myself. I've spoken to a few people actively using DBD::ODBC on Windows and mostly they are not using unicode for inserts/updates. If they restrict their inserts to ASCII there will be no change for them as what they get back now is already unicode (just the first 127 chrs of unicode) because DBD::ODBC always binds as SQL_WCHAR. If you insert chrs in the upper part of your codepage then what you get back is not what you inserted i.e., it is not chrs in your codepage, it is unicode and the
Re: DBD::ODBC tricky unicode problem
On Thu, Oct 24, 2013 at 12:00:27AM +0100, Martin J. Evans wrote: I may interchange bytes with chrs in a codepage in this. A codepage has 255 characters and each one is a byte. I don't understand that first sentance. Before unicode support was added to DBD::ODBC, it used to bind all char and varchar columns as SQL_CHAR (and hence not unicode BUT could be a character in a windows code page which has a unique unicode codepoint). This meant you could store chrs in your current code page and it also meant you could just ignore code pages and store, say UTF-8 encoded data in a char/varchar (and some did and decoded it themselves when they read it back). There's a whole bunch of stuff in there that I'm not clear on. By your current code page I presume you mean the client current code page, not the server's idea of what the client code page is. What I think you're saying here is that client code could store a sequence of bytes on the server and get back the same bytes later, *but* the server may interpret the sequence of bytes differently. I find it really helpful to be very clear about how each part of a system views the bytes that are passing through it: - terminal - application reading input - application internal logic - DBD - DB client library - DB server ---later--- - DB server - DB client library - DBD - application internal logic - etc. etc. There are many actors here. If they have different views of what the byte stream represents then you can have problems. But those problems might only appear in certain cases (like surrogate characters). Also different views at multiple different levels can cancel each other out, making detection of problems difficult. It's quite possible to have correct unicode input at a terminal, that's then processed by some code and output as correct unicode, while the code that did the processing doesn't view the bytes as unicode. The DBD::Oracle test suite goes to great lengths to check that the servers view of what it's been sent matches what we want. Specifically that when a single unicode character which encodes to multiple bytes gets to the server it agrees that it's only a single character. Forgive me if all this is familiar to you. I repeat it here partly for others and partly because it wasn't clear to me from how you'd expressed the above. When I discovered if you bound char/varchar columns as SQL_WCHAR (unicode) (with a unicode build of DBD::ODBC) you got unicode back (in other words sql server converts chrs in the current codepage to unicode) it seemed like a good idea to change the default to bind as SQL_WCHAR. With hindsight that was a mistake. Because this would break people storing UTF-8 encoded data in char/varchar I issued a change warning, eventually changed to SQL_WCHAR and added an odbc_old_unicode attribute to return it to old behaviour. No one has complained so far. I read this and I'm left wondering about input vs storage vs output. It's not clear to me what is the servers view of what it's been sent. What would selectrow_array(SELECT CHARACTER_LENGTH(?) ..., undef, $foo) return? Or, to look at it another way, if the value is inserted into a table and the table is sorted, is it sorted correctly? (That's a more complex question though because it involves collations and possibly per-table or even per-column characters sets - so even more actors in the plot!) As it turns out this was probably a bad idea and breaks things. The problem I've created is horrible. Let's say you want to store chrs in the current code page in a char or varchar or you want to store data already UTF-8 encoded in a char/varchar (i.e. bytes) and decode it when you read it back. Argh. Too complicated! :) Let's say the data you want to store is either: codepage chrs: 0xe2, 0x82, 0xc2 UTF-8 encoded euro symbol: 0xe2, 0x82, 0xc2 What does perl's length() function report for those? It's really important to focus on surface representation when trying to work out what the right thing to do is. I presume the perl length() of the first string is 3 and the second is 1. Right? When you insert it, DBD::ODBC calls SQLDescribeParam and the driver describes these parameters as SQL_CHAR, they are bound as SQL_CHAR, so SQL Server takes this sequence to be characters in the current (or table or column) codepage. Are you saying that DBD::ODBC doesn't pay attention to the SvUTF8 flag on the value? If you select them back binding them as SQL_CHAR, you get back what you put in - excellent. Except, DBD::ODBC now (and has for a long time) defaulted in a unicode build of DBD::ODBC to binding them as SQL_WCHAR. When you read them back you get: 0xe2,0x201a,0xac (assuming windows-1252 is the codepage) (because 0x82 in windows-1252 is a funny comma which is really U+201a) What did the server think the string was? and the column data is marked in perl as unicode. So a unicode build of DBD::ODBC
DBD::ODBC tricky unicode problem
No one has reported this to me but during writing unicode common problems using DBD::ODBC I've uncovered something which I think I broke a long time ago and although I know how to fix it, I'm unsure of the repercussions. I may interchange bytes with chrs in a codepage in this. A codepage has 255 characters and each one is a byte. Sorry for length of this but I could not think of a way of describing the problem more succinctly. Some background Before unicode support was added to DBD::ODBC, it used to bind all char and varchar columns as SQL_CHAR (and hence not unicode BUT could be a character in a windows code page which has a unique unicode codepoint). This meant you could store chrs in your current code page and it also meant you could just ignore code pages and store, say UTF-8 encoded data in a char/varchar (and some did and decoded it themselves when they read it back). When I discovered if you bound char/varchar columns as SQL_WCHAR (unicode) (with a unicode build of DBD::ODBC) you got unicode back (in other words sql server converts chrs in the current codepage to unicode) it seemed like a good idea to change the default to bind as SQL_WCHAR. With hindsight that was a mistake. Because this would break people storing UTF-8 encoded data in char/varchar I issued a change warning, eventually changed to SQL_WCHAR and added an odbc_old_unicode attribute to return it to old behaviour. No one has complained so far. As it turns out this was probably a bad idea and breaks things. The problem I've created is horrible. Let's say you want to store chrs in the current code page in a char or varchar or you want to store data already UTF-8 encoded in a char/varchar (i.e. bytes) and decode it when you read it back. Let's say the data you want to store is either: codepage chrs: 0xe2, 0x82, 0xc2 UTF-8 encoded euro symbol: 0xe2, 0x82, 0xc2 When you insert it, DBD::ODBC calls SQLDescribeParam and the driver describes these parameters as SQL_CHAR, they are bound as SQL_CHAR, so SQL Server takes this sequence to be characters in the current (or table or column) codepage. If you select them back binding them as SQL_CHAR, you get back what you put in - excellent. Except, DBD::ODBC now (and has for a long time) defaulted in a unicode build of DBD::ODBC to binding them as SQL_WCHAR. When you read them back you get: 0xe2,0x201a,0xac (assuming windows-1252 is the codepage) (because 0x82 in windows-1252 is a funny comma which is really U+201a) and the column data is marked in perl as unicode. argh. I don't really care about the people trying to insert unicode into char/varchars as it is just a daft thing to do (although possible) and my document explains why but I'm amazed this has not caught someone out on Windows (where the default for DBD::ODBC is a unicode build). So now I'm not sure what to do. It seems the odbc_old_unicode behaviour was right in the first place and although it is easy to fix how do I do it reasonably? Can I assume anyone who got broken when the original change was made, switched to setting odbc_old_unicode so reverting to old behaviour won't affect them? But then, what about people not using odbc_old_unicode and relying on it. You could say these people still get back what they inserted, it is just they inserted data in a codepage and got the same string back, just in unicode now. They will be the ones affected because the data they would get back after reverting the change will be bytes/chrs in the codepage now and no longer marked as unicode in their perl. Any good suggestions to get me out of this mess? Martin -- Martin J. Evans Wetherby, UK