On Sun, May 10, 2009 at 4:46 AM, John Machin <sjmac...@lexicon.net> wrote:
> On 10/05/2009 3:29 PM, kalyan....@aol.in wrote:
>> Hi,
>>
>> I am a newbie to sqlite.
>> I want to create a sqlite database in Linux OS.
>>
>> The data that I wish to put in the table contains both ASCII and UTF-16 
>> encoded strings.
>> For eg. File and directory names are UTF-16 where the URL, date time 
>> information about the files are in ASCII.
>>
>> How do I create a table that holds both ASCII and UTF-16 strings.
>>
>> The documentation says that TEXT datatype can hold both the encodings,
>> if i do a sqlite3_open -> TEXT is only ASCII, UTF-8
>> if i do a sqlite3_open16 -> TEXT is only UTF-16
>
> AFAICT ... and I am by no means omniscient :-) ... the story is
> something like this:
>
> There are THREE possibilities for how the data is declared to be stored.
> They are: UTF-8, UTF-16LE, UTF-16BE. You must choose ONE of the three.
>
> The choice is made when a database is created and is applied to the
> whole database. There is no way of making a choice per table.
>
> When you subsequently feed data into SQLite, you have two [1] choices:
> (a) you say that you are supplying bytes that are encoded in UTF-8
> (b) you say that you are supplying wide chars that are encoded in the
> UTF-16xE that is appropriate to the endianness of the machine that you
> are running on.
>
> If the choice differs from that recorded in the database, SQLite will
> attempt to convert your data for you.
>
> If the declaration(s) that you make implicitly is/are untrue e.g. your
> data is encoded in ISO 8859-xx but you say that it is encoded in UTF-8,
> you may not get any error message. SQLite3 does little checking.
>
> Which choice you make for database storage would depend on two things:
>
> (1) the time required for any transcoding (1.1) by SQLite automatically
> as described above, and/or (1.2) by you to convert raw input into the
> encoding that you will use to feed your data to SQLite
>
> (2) the resultant size of the database:
> [following assumes all of your data is in the range U+0000 to U+FFFF]
>
> If stored in UTF-16xE, each character will occupy 2 bytes.
>
> If stored in UTF-8, the bytes per char is variable:
>
> U+0000 to U+007F (ASCII) -> 1 byte per char
> U+0080 to U+07FF (covers e.g. accented-and-otherwise-adorned Latin
> letters, Greek, Cyrillic, Hebrew, Arabic) -> 2 bytes per char
> U+0800 to U+FFFF (covers e.g. Devanagari and other Indian languages,
> Chinese, Japanese, Korean) -> 3 bytes per char
>
> So it's impossible to say whether UTF-8 would be better/same/worse
> without examining your actual data. You could write a script that loaded
> a typical data mix into a database, and run it with the database
> encoding specified as either UTF-8 or UTF-16xE and compare the resultant
> sizes (and time difference).
>
> [1]: If you are using a wrapper like e.g the Python sqlite3 module, your
> choices may be different. Possibilities include: you supply a unicode
> object or you supply a str object (8-bit characters) plus an encoding
> (which may default to UTF-8) and the wrapper will do whatever is necessary.

This is a really nice explanation, worthy of being enshrined in the wiki.


>
> HTH,
> John
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>



-- 
Puneet Kishor http://www.punkish.org/
Carbon Model http://carbonmodel.org/
Charter Member, Open Source Geospatial Foundation http://www.osgeo.org/
Science Commons Fellow, Geospatial Data http://sciencecommons.org
Nelson Institute, UW-Madison http://www.nelson.wisc.edu/
-----------------------------------------------------------------------
collaborate, communicate, compete
=======================================================================
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to