At 5:23 PM -0500 6/21/04, Tim wrote:
It sounds like SQLite 3.0 only supports one text representation per database ("Internally and in the disk file, the same text representation is used everywhere."). Is there a particular reason for this limitation? What if I want to store one column as UTF8 text and another column as UTF16? Other databases offer a separate 16-bit "ntext" column type for this purpose.

There is no reason to support both UTF-8 and UTF-16 in the same file at the same time. They both have identical character repertoires, and identical code points for each character; they are completely interchangeable. They both represent every character that exists in any language.


The only reasons for picking one over the other relates to memory/disk usage and portability. UTF-8 takes up much less disk space when you are dealing mainly in Roman characters, while UTF-16 tends to use less disk space when dealing mainly with non-Roman characters. UTF-8 also has an identical byte order on any platform, making it "network safe", and trivially portable. UTF-16 has separate versions for big-endian and little-endian architectures, plus some UTF-16 files have a byte-order-mark ("BOM") to say what we have, while others don't, making for a potential compatability problem in the latter case; UTF-16's main advantage is that it is sometimes faster to use. In any event, both UTF-8 and UTF-16 are variable-width encodings, though UTF-8 tends to vary more; their names say the *minimum* number of bits used by each character.

Speaking practically, unless you can justify using UTF-16 for anything, then always use UTF-8 by default as doing so gives a lot more advantages. (But whichever of those you choose, you can convert later fairly easily.)

The reason for making the entire file one encoding or the other is that it simplifies the core SQLite code, as well as other code which uses it, since the code doesn't have to constantly test for one type or the other.

The situation you mention about other databases is a different matter. Their normal text columns are fixed-width 8-bit characters, which are not Unicode, and can only represent 255 unique characters. The "ntext" (short for "national varchar") column type is for when you need more than that many characters. And even a fixed-width 16-bit column won't fit all the characters that Unicode can represent. Unicode has a 21-bit code point range, plus a fixed-width UTF-32 encoding (the remaining 11 bits aren't used). The 8-bit and 16-bit non-Unicode encodings used by other databases are not fully interchangeable with each other.

The fact that un-necessary options aren't provided by SQLite is part of what makes it "lite". I say don't change a thing.

I could store anything with BLOBs, but then
I'd lose the ability to use SORT.

Simple. Don't use BLOBs for character data; that is what Unicode text is for.

Storing only one type of text data also seems to go against the idea of SQLite's "manifest typing" where "datatype is associated with the data itself, not with its container."

That is still true. But the "manifest typing" is kept relatively simple by the single-encoding description. The choices are simply: character, integer, decimal, binary, null. No reason to break down 'character' further.


Here the datatype of particular piece of text is associated with not just a column type, but a fixed datatype defined for the entire database. SQLite should store knowledge about the text encoding of each value. I guess there would also need to be a way to differentiate between 8-bit and 16-bit string literals since SQLite doesn't have column types, and efficiently (without conversion) insert or query for 8-bit and 16-bit values in a single statement.

What probably would work best is to pick the single UTF format that matches what the rest of your program uses internally; if your program uses multiple types, then either change it to standardize on one, or put conversion routines in the few spots that need it next to the database access code. If it were up to me (and for my stuff, it is), I would pick a single encoding, UTF-8, and use just it through my entire program.


-- Darren Duncan

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to