From: "Peter Kirk" <[EMAIL PROTECTED]> > Agreed. But to be fair to MySQL, they do mention as a potential problem > that three bytes have to be allocated in strings for each UTF-8 > character. For full UTF-8 support they would need four bytes per > character which would, from their perspective, be a greater problem. > Also I suspect that Unicode data is actually being stored in 16-bit > entities, and that the major issue is the extra complication of handling > surrogate pairs within that representation (rather than the trivial one > of converting such pairs to and from valid UTF-8).
Modern database engines now offer multiple encoding strategy for storage of characters. In SQL engines, the key issue is performance (notibly in terms of storage I/O or networking I/O), but this is completely orthogonal of the logical correctness of SQL functions and selections, which should be based internally on Unicode characters, independantly of their actual encoding in storage (as UTF-8, CESU-8, UTF-16BE/LE, UTF-32, GB18030, or any other legacy charset). So I do think that it is quite easy to implement UTF-8 and be fully compliant with it for I/O in the request language or in its results, as well as for storage where it is certainly better than CESU-8. The hard part is not in these interfaces (MySQL for example is unique in the fact that it supports several alternate storage formats for its tables), but in the core engine itself when it performs identity selection, sorting and range selections, substring extractions. The other part of the problem is the interoperability with MySQL clients. As long as these clients will not be prepared to receive character data out of the BMP, they should connect with a CESU-8 encoding profile. If they are prepared for it, they should better use UTF-8. But is the MySQL client protocol compatible enough to support explicit tagging of the encoding used for strings? This is the good question. This may require an update in the protocol, and this may not be the first priority for MySQL, which wants first to prepare its core engines, and get it to connect to external data sources or storages like Oracle, Sybase, MS-SQL, UTF-8 text files, Access MDB files, XML data files, and possibly with more recent extensions of the Berkeley DB table format that now supports characters out of the BMP. Tracking the required reencoding between these components connected to the core engine may be tricky to develop, unless the interfaces between these components and the engine are prepared to support explicit labelling of the charsets and encodings actually usable to interoperate, and some negociation protocol in these interfaces (something like the "Accept-*" headers in HTTP), notably if there are transcoding issues (which may affect very serious database integrity constraints, notably for uniqueness and existence, but also in triggers).

