On Sun, May 10, 2009 at 4:46 AM, John Machin <sjmac...@lexicon.net> wrote: > On 10/05/2009 3:29 PM, kalyan....@aol.in wrote: >> Hi, >> >> I am a newbie to sqlite. >> I want to create a sqlite database in Linux OS. >> >> The data that I wish to put in the table contains both ASCII and UTF-16 >> encoded strings. >> For eg. File and directory names are UTF-16 where the URL, date time >> information about the files are in ASCII. >> >> How do I create a table that holds both ASCII and UTF-16 strings. >> >> The documentation says that TEXT datatype can hold both the encodings, >> if i do a sqlite3_open -> TEXT is only ASCII, UTF-8 >> if i do a sqlite3_open16 -> TEXT is only UTF-16 > > AFAICT ... and I am by no means omniscient :-) ... the story is > something like this: > > There are THREE possibilities for how the data is declared to be stored. > They are: UTF-8, UTF-16LE, UTF-16BE. You must choose ONE of the three. > > The choice is made when a database is created and is applied to the > whole database. There is no way of making a choice per table. > > When you subsequently feed data into SQLite, you have two [1] choices: > (a) you say that you are supplying bytes that are encoded in UTF-8 > (b) you say that you are supplying wide chars that are encoded in the > UTF-16xE that is appropriate to the endianness of the machine that you > are running on. > > If the choice differs from that recorded in the database, SQLite will > attempt to convert your data for you. > > If the declaration(s) that you make implicitly is/are untrue e.g. your > data is encoded in ISO 8859-xx but you say that it is encoded in UTF-8, > you may not get any error message. SQLite3 does little checking. > > Which choice you make for database storage would depend on two things: > > (1) the time required for any transcoding (1.1) by SQLite automatically > as described above, and/or (1.2) by you to convert raw input into the > encoding that you will use to feed your data to SQLite > > (2) the resultant size of the database: > [following assumes all of your data is in the range U+0000 to U+FFFF] > > If stored in UTF-16xE, each character will occupy 2 bytes. > > If stored in UTF-8, the bytes per char is variable: > > U+0000 to U+007F (ASCII) -> 1 byte per char > U+0080 to U+07FF (covers e.g. accented-and-otherwise-adorned Latin > letters, Greek, Cyrillic, Hebrew, Arabic) -> 2 bytes per char > U+0800 to U+FFFF (covers e.g. Devanagari and other Indian languages, > Chinese, Japanese, Korean) -> 3 bytes per char > > So it's impossible to say whether UTF-8 would be better/same/worse > without examining your actual data. You could write a script that loaded > a typical data mix into a database, and run it with the database > encoding specified as either UTF-8 or UTF-16xE and compare the resultant > sizes (and time difference). > > [1]: If you are using a wrapper like e.g the Python sqlite3 module, your > choices may be different. Possibilities include: you supply a unicode > object or you supply a str object (8-bit characters) plus an encoding > (which may default to UTF-8) and the wrapper will do whatever is necessary.
This is a really nice explanation, worthy of being enshrined in the wiki. > > HTH, > John > _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > -- Puneet Kishor http://www.punkish.org/ Carbon Model http://carbonmodel.org/ Charter Member, Open Source Geospatial Foundation http://www.osgeo.org/ Science Commons Fellow, Geospatial Data http://sciencecommons.org Nelson Institute, UW-Madison http://www.nelson.wisc.edu/ ----------------------------------------------------------------------- collaborate, communicate, compete ======================================================================= _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users