On Thu, 2007-10-04 at 14:47 +0200, Markus Gritsch wrote: > On 10/4/07, Oleg Broytmann <[EMAIL PROTECTED]> wrote: > > On Thu, Oct 04, 2007 at 01:50:52PM +0200, Markus Gritsch wrote: > > > In MySQLdb queries are *allowed* to be unicode. > > > > I meant - there were a period when SQLObject forces queries to be > > unicode for MySQLdb 1.2.1+. Is that over now? Do we allow unicode but not > > enforce it regardless of MySQLdb version? > > Ah, I understand. Well, I have not really an idea :( Maybe it would > be a good idea to decode the string using self.encoding it was the > case before removing it... David?
I spent nearly all day working on this. I have to admit that I don't really understand all of the layers fully. Here's what I thought this morning, not because I understand what anything does, but because that's what makes the tests pass: All queries should be plain strings encoded in various arbitrary and sometimes conflicting methods. But this is totally not true. The truth is much more complicated. Imagine you have a table with one latin-1 column and one utf-8 column, and you want to update both columns. How are you going to send that to mysql? You'll might think to send it like this (assuming you want both to be character 0xf1 ): "update morx set fleem = '\xf1', baz = '\xc3\xb1';" Where fleem is the latin-1 column and baz is the utf-8 column. This doesn't actually work, because MySQL expects you to talk to it in some fixed charset -- which is set per-connection. And this leads into the bug that I found: When a column is encoded in UTF-16, everything goes horribly wrong. I was thinking this would be pretty easy to fix, but it's not, since there are so many places that encoding and decoding happen: into sqlobject, on the column, on the query (formerly) in sqlobject, on the query in mysqldb, maybe inside the mysql client library, on the mysql server. There's this additional connection charset parameter which can differ from the column charset. The connection charset is what determines how data is sent into mysql. In the test, UTF-8 happens to work because treating UTF-8 as Latin-1 (the connection charset) doesn't break anything roundtrip. UTF-16 doesn't work, because while we can insert UTF-16 into a latin-1 string, it doesn't make the whole string UTF-16. So when we try to convert back, we get latin-1 treated as UTF-16. Not good. So you might think that we should just make our connection charset always utf-8, and then have our columns convert their utf-16 to utf-8. But the way that MySQLdb sets the charset is via "SET NAMES", which also sets the collation to the default collation of that charset -- whether or not that collation works for the charset of your columns! Using SET CHARACTER SET instead fixes that, but breaks further on -- non-ascii characters (which are in latin-1) are being inserted into latin-1 columns as '?'. This is probably because when we send them across the connection, we don't encode them in any way, so they end up treated as invalid utf-8. You might ask why we don't go back to encoding the whole query -- that's because then we would be double-encoding the stuff that really is utf-8 -- and we still don't know what we're converting the query *from*. I believe I am going slightly crazy here. I will think about this some more, but possibly not immediately. In the meantime, if anyone has any thoughts for how to fix this encoding madness, they would be appreciated. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ sqlobject-discuss mailing list sqlobject-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/sqlobject-discuss