-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Zbigniew Baniewski wrote:
> How one should handle this? SQLite has UTF-8 by default. 

You seem to doubt being all Unicode is a good thing :-)  Read this
http://www.joelonsoftware.com/articles/Unicode.html

> What C-function (Linux) could be considered as most convenient? Perhaps
> there's a doc with explanation (in the context of SQLite-usage)?

SQLite does not include conversion from random non-Unicode encodings to
or from Unicode.  (It does include conversion between 8 bit and 16 bit
Unicode encodings).

If you just want a simple bytes in give the same bytes out then use
blobs in SQLite.  If you think your bytes are actually strings then
reread the link above again :-)

To do the conversion within your code you should use iconv
http://en.wikipedia.org/wiki/Iconv

If you want to do manipulation of the text (once it is in unicode) such
as upper/lower casing or sorting then you need to know about locales.
This is because the exact same sequence of characters sort, upper case,
lower case etc differently depending on where you are.  As an example
Turkic languages have multiple letter i, German has ß which behaves like
s, various accents sort differently in different European countries.
Fortunately there is a libary you can ask to do the right locale
specific thing
http://en.wikipedia.org/wiki/International_Components_for_Unicode
A default SQLite compilation only deals with the 26 letter Roman
alphabet.  If you enable ICU with SQLite then you get good stuff
http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/icu/README.txt  (*)

You Linux distribution almost certainly has iconv binary and libraries
already installed.  ICU should be installed already or easily
installable via your package manager.

(*) Viewing that page is a good example of how messy this gets.  The
actual README.txt is encoded in UTF8.  However the cvstrac web server
tells the browser that it is encoded as ANSI_X3.4-1968 (a fancy name for
 ASCII).  If you scroll to just before section 1.2 you can see the
Turkish lower case dotless i being mangled.  I like to test using the
front page of http://Wikipedia.org as it contains the names of a wide
variety of languages in those languages and hence uses a wide sampling
of Unicode characters.

In summary, never confuse bytes with strings (which C sadly treats as
the same thing).  Either always uses bytes (and SQLite blobs) for
everything or use strings (and SQLite strings) for everything.  If you
take the latter approach and have to deal with external input/output
then you must know what encodings are being used and it is best to
convert to Unicode as early as possible on input and late as possible on
output.

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFI9qkcmOOfHg372QQRAkO+AJ9rXxdLkyjgZGYUS+W3RMmOJel0ZgCg44e2
7FpA+U2cn0DusHMSR0ZEl8Q=
=a9T7
-----END PGP SIGNATURE-----
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to