I've been trying to get to the bottom of unicode issues with MySQL and 
the various tests that exercise this functionality - test_blob.py and 
test_unicode.py.

I'm no expert on all of these issues, but I've gained a little 
understanding over the last few days.

I'm using release 0.7.1, at revision 1954, MySQLdb 1.2.1+, MySQL 5.0, 
Python 2.4.3.

The suggestion, when the test_blob.py and test_unicode.py failed, was to 
use the following

   charset=utf8&sqlobject_encoding=utf-8

in my connection URI, but I found that even this didn't work, generating 
essentially the same error. (This from test_blob.py, test_unicode.py 
similar).

no extra connection URI settings:

   UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 
167: ordinal not in range(128)

with extra connection URI settings:

   UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 
167: unexpected code byte

The problem is, I believe, that python strings (<type 'str'>) are no 
guaranteed to be either utf-8 or ascii.

The python string s = chr(128) is not valid ascii _or_ utf-8, but 
SQLObject (and in particular its tests) effectively assume it is.

chr(128) is not valid ascii because it sets the 8th bit. chr(128) is not 
valid utf-8 because it should be represented with two bytes, the first 
of which should begin with 11. (See http://en.wikipedia.org/wiki/UTF-8).

My solution (this is where my ignorance comes in) is to change the 
default sqlobjct_encoding to 'latin-1', rather than 'ascii' in 
mysql/mysqlconnection.py. latin-1 is a 256 symbol single byte string 
encoding, which correctly represents the range of python string instances.

With this change, all of SQLObject's unicode and blob tests pass again, 
with no connection URI settings or special magic at the MySQL database end.

On the other hand, in the process I've discovered a similar problem with 
sqlite's unicode handling. In test_unicode.py, a UnicodeColumn is made 
an alternateID, which implies unique. For MySQL, unique implies a key, 
which requires a length. So,

col1 = UnicodeCol(alternateID=True)

becomes

col1 = UnicodeCol(alternateID=True,length=100)

and the test passes for MySQL, handling all the Unicode stuff correctly.

However, with the length argument, sqlite now fails this test.

With no length argument, sqlite uses a TEXT type, while with a length 
argument, it uses a VARCHAR type.

The TEXT type works (incorrectly!!! I believe) because it returns python 
strings rather than unicode strings. The VARCHAR type doesn't work 
(correctly!!! I believe) because it tries to coerce a python string into 
a unicode string (and a similar codec error is encoutered).

So, I think the use of the 'ascii' encoding, where 'latin-1' is actually 
what is required is a bug that should be fixed. The MySQL driver is the 
only place this is done explicitly, but the problem the sqlite's VARCHAR 
makes me think that this bug is present implicitly in a variety of other 
places.

Cheers!

nathan

-- 
Nathan Edwards, Ph.D.
Center for Bioinformatics and Computational Biology
3119 Biomolecular Sciences Bldg. #296
University of Maryland, College Park, MD 20742
Phone: +1 301-405-9901
Email: [EMAIL PROTECTED]
WWWeb: http://www.umiacs.umd.edu/~nedwards

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
sqlobject-discuss mailing list
sqlobject-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sqlobject-discuss

Reply via email to