No one has reported this to me but during writing unicode common
problems using DBD::ODBC I've uncovered something which I think I broke
a long time ago and although I know how to fix it, I'm unsure of the
repercussions.
I may interchange bytes with chrs in a codepage in this. A codepage has
255 characters and each one is a byte.
Sorry for length of this but I could not think of a way of describing
the problem more succinctly.
Some background
Before unicode support was added to DBD::ODBC, it used to bind all char
and varchar columns as SQL_CHAR (and hence not unicode BUT could be a
character in a windows code page which has a unique unicode codepoint).
This meant you could store chrs in your current code page and it also
meant you could just ignore code pages and store, say UTF-8 encoded data
in a char/varchar (and some did and decoded it themselves when they read
it back).
When I discovered if you bound char/varchar columns as SQL_WCHAR
(unicode) (with a unicode build of DBD::ODBC) you got unicode back (in
other words sql server converts chrs in the current codepage to unicode)
it seemed like a good idea to change the default to bind as SQL_WCHAR.
With hindsight that was a mistake. Because this would break people
storing UTF-8 encoded data in char/varchar I issued a change warning,
eventually changed to SQL_WCHAR and added an odbc_old_unicode attribute
to return it to old behaviour. No one has complained so far.
As it turns out this was probably a bad idea and breaks things.
The problem I've created is horrible.
Let's say you want to store chrs in the current code page in a char or
varchar or you want to store data already UTF-8 encoded in a
char/varchar (i.e. bytes) and decode it when you read it back. Let's say
the data you want to store is either:
codepage chrs: 0xe2, 0x82, 0xc2
UTF-8 encoded euro symbol: 0xe2, 0x82, 0xc2
When you insert it, DBD::ODBC calls SQLDescribeParam and the driver
describes these parameters as SQL_CHAR, they are bound as SQL_CHAR, so
SQL Server takes this sequence to be characters in the current (or table
or column) codepage. If you select them back binding them as SQL_CHAR,
you get back what you put in - excellent. Except, DBD::ODBC now (and has
for a long time) defaulted in a unicode build of DBD::ODBC to binding
them as SQL_WCHAR. When you read them back you get:
0xe2,0x201a,0xac (assuming windows-1252 is the codepage)
(because 0x82 in windows-1252 is a funny comma which is really U+201a)
and the column data is marked in perl as unicode.
argh.
I don't really care about the people trying to insert unicode into
char/varchars as it is just a daft thing to do (although possible) and
my document explains why but I'm amazed this has not caught someone out
on Windows (where the default for DBD::ODBC is a unicode build).
So now I'm not sure what to do. It seems the odbc_old_unicode behaviour
was right in the first place and although it is easy to fix how do I do
it reasonably? Can I assume anyone who got broken when the original
change was made, switched to setting odbc_old_unicode so reverting to
old behaviour won't affect them? But then, what about people not using
odbc_old_unicode and relying on it. You could say these people still get
back what they inserted, it is just they inserted data in a codepage and
got the same string back, just in unicode now. They will be the ones
affected because the data they would get back after reverting the change
will be bytes/chrs in the codepage now and no longer marked as unicode
in their perl.
Any good suggestions to get me out of this mess?
Martin
--
Martin J. Evans
Wetherby, UK