No one has reported this to me but during writing unicode common problems using DBD::ODBC I've uncovered something which I think I broke a long time ago and although I know how to fix it, I'm unsure of the repercussions.

I may interchange bytes with chrs in a codepage in this. A codepage has 255 characters and each one is a byte.

Sorry for length of this but I could not think of a way of describing the problem more succinctly.

Some background

Before unicode support was added to DBD::ODBC, it used to bind all char and varchar columns as SQL_CHAR (and hence not unicode BUT could be a character in a windows code page which has a unique unicode codepoint). This meant you could store chrs in your current code page and it also meant you could just ignore code pages and store, say UTF-8 encoded data in a char/varchar (and some did and decoded it themselves when they read it back).

When I discovered if you bound char/varchar columns as SQL_WCHAR (unicode) (with a unicode build of DBD::ODBC) you got unicode back (in other words sql server converts chrs in the current codepage to unicode) it seemed like a good idea to change the default to bind as SQL_WCHAR. With hindsight that was a mistake. Because this would break people storing UTF-8 encoded data in char/varchar I issued a change warning, eventually changed to SQL_WCHAR and added an odbc_old_unicode attribute to return it to old behaviour. No one has complained so far.

As it turns out this was probably a bad idea and breaks things.

The problem I've created is horrible.

Let's say you want to store chrs in the current code page in a char or varchar or you want to store data already UTF-8 encoded in a char/varchar (i.e. bytes) and decode it when you read it back. Let's say the data you want to store is either:

codepage chrs: 0xe2, 0x82, 0xc2
UTF-8 encoded euro symbol: 0xe2, 0x82, 0xc2

When you insert it, DBD::ODBC calls SQLDescribeParam and the driver describes these parameters as SQL_CHAR, they are bound as SQL_CHAR, so SQL Server takes this sequence to be characters in the current (or table or column) codepage. If you select them back binding them as SQL_CHAR, you get back what you put in - excellent. Except, DBD::ODBC now (and has for a long time) defaulted in a unicode build of DBD::ODBC to binding them as SQL_WCHAR. When you read them back you get:

0xe2,0x201a,0xac (assuming windows-1252 is the codepage)

(because 0x82 in windows-1252 is a funny comma which is really U+201a)

and the column data is marked in perl as unicode.

argh.

I don't really care about the people trying to insert unicode into char/varchars as it is just a daft thing to do (although possible) and my document explains why but I'm amazed this has not caught someone out on Windows (where the default for DBD::ODBC is a unicode build).

So now I'm not sure what to do. It seems the odbc_old_unicode behaviour was right in the first place and although it is easy to fix how do I do it reasonably? Can I assume anyone who got broken when the original change was made, switched to setting odbc_old_unicode so reverting to old behaviour won't affect them? But then, what about people not using odbc_old_unicode and relying on it. You could say these people still get back what they inserted, it is just they inserted data in a codepage and got the same string back, just in unicode now. They will be the ones affected because the data they would get back after reverting the change will be bytes/chrs in the codepage now and no longer marked as unicode in their perl.

Any good suggestions to get me out of this mess?

Martin
--
Martin J. Evans
Wetherby, UK

Reply via email to