DBD::ODBC tricky unicode problem

Martin J. Evans Wed, 23 Oct 2013 16:03:38 -0700

No one has reported this to me but during writing unicode commonproblems using DBD::ODBC I've uncovered something which I think I brokea long time ago and although I know how to fix it, I'm unsure of therepercussions.

I may interchange bytes with chrs in a codepage in this. A codepage has255 characters and each one is a byte.

Sorry for length of this but I could not think of a way of describingthe problem more succinctly.


Some background

Before unicode support was added to DBD::ODBC, it used to bind all charand varchar columns as SQL_CHAR (and hence not unicode BUT could be acharacter in a windows code page which has a unique unicode codepoint).This meant you could store chrs in your current code page and it alsomeant you could just ignore code pages and store, say UTF-8 encoded datain a char/varchar (and some did and decoded it themselves when they readit back).

When I discovered if you bound char/varchar columns as SQL_WCHAR(unicode) (with a unicode build of DBD::ODBC) you got unicode back (inother words sql server converts chrs in the current codepage to unicode)it seemed like a good idea to change the default to bind as SQL_WCHAR.With hindsight that was a mistake. Because this would break peoplestoring UTF-8 encoded data in char/varchar I issued a change warning,eventually changed to SQL_WCHAR and added an odbc_old_unicode attributeto return it to old behaviour. No one has complained so far.


As it turns out this was probably a bad idea and breaks things.

The problem I've created is horrible.

Let's say you want to store chrs in the current code page in a char orvarchar or you want to store data already UTF-8 encoded in achar/varchar (i.e. bytes) and decode it when you read it back. Let's saythe data you want to store is either:


codepage chrs: 0xe2, 0x82, 0xc2
UTF-8 encoded euro symbol: 0xe2, 0x82, 0xc2

When you insert it, DBD::ODBC calls SQLDescribeParam and the driverdescribes these parameters as SQL_CHAR, they are bound as SQL_CHAR, soSQL Server takes this sequence to be characters in the current (or tableor column) codepage. If you select them back binding them as SQL_CHAR,you get back what you put in - excellent. Except, DBD::ODBC now (and hasfor a long time) defaulted in a unicode build of DBD::ODBC to bindingthem as SQL_WCHAR. When you read them back you get:


0xe2,0x201a,0xac (assuming windows-1252 is the codepage)

(because 0x82 in windows-1252 is a funny comma which is really U+201a)

and the column data is marked in perl as unicode.

argh.

I don't really care about the people trying to insert unicode intochar/varchars as it is just a daft thing to do (although possible) andmy document explains why but I'm amazed this has not caught someone outon Windows (where the default for DBD::ODBC is a unicode build).

So now I'm not sure what to do. It seems the odbc_old_unicode behaviourwas right in the first place and although it is easy to fix how do I doit reasonably? Can I assume anyone who got broken when the originalchange was made, switched to setting odbc_old_unicode so reverting toold behaviour won't affect them? But then, what about people not usingodbc_old_unicode and relying on it. You could say these people still getback what they inserted, it is just they inserted data in a codepage andgot the same string back, just in unicode now. They will be the onesaffected because the data they would get back after reverting the changewill be bytes/chrs in the codepage now and no longer marked as unicodein their perl.


Any good suggestions to get me out of this mess?

Martin
--
Martin J. Evans
Wetherby, UK

DBD::ODBC tricky unicode problem

Reply via email to