[10-04-2005 11:36, NÃdaski DÃvid escreveu]
I've been trying to get sqlite to work with characters independent to my
codepage, and I faced the following problem in my application: when using
insert, update or select to manage strings in the database, from the app. side
everything went fine - I've inserted locale-specific characters, retrieved
them, and they've displayed correctly.
However, when I've tried to compare two strings, where one was from the app, the other
from the database, I realized that it was impossible to do so, because in the db I had
all sorts of weird characters in place of characters with flying accent (Ã, Ã,
etc.).
I tried using setlocale, but that didn't seem to work.
It's a common problem. SQLite works with UTF-8 (or UCS-2, the Unicode
function variants) strings, so you need to convert any given SQL input
to UTF-8 before executing it (or use the Unicode equivalents, off
course).
In UTF-8, any "character" with the 7th bit set (128-255) is considered
the prefix of a multi-byte sequence, so if you pass one of these, like
the â (Euro) symbol, or any accented character, it will be parsed as the
beginning of a multi-byte sequence (that in theory could go to 6-byte
sequences, iirc).
Weird thing is, when using the sqlite3.exe, everything seems to be working fine!
It's only your impression, in reality even the sqlite3 program doesn't
do it right (it passes any string given without doing the conversion to
UTF-8).
A simple test to prove the "bug" (v3.2.1, tested in windows and linux
on a non UTF-8 console), using the Euro symbol:
sqlite> SELECT substr("The.â.Symbol.(Euro)",3,5);
e.â.Sy
[It should have returned 5 chars, but returns 6]
sqlite> SELECT substr("The.â.Symbol.(Euro)",4,5);
.â.Sym
[The same as above]
sqlite> SELECT substr("The.â.Symbol.(Euro)",5,5);
.Symb
[Returns the correct length, but wrong string]
sqlite> SELECT substr("The.â.Symbol.(Euro)",6,5);
Symbo
[The same as above]
Another (maybe) more common string, using the 'copyright' (c) symbol
(#169 on ISO-8859-1):
sqlite> SELECT substr("Â.Nuno.Lucas",1,5);
.Nuno
['Â.' is not a valid UTF-8 sequence, so it ignores it]
sqlite> SELECT length("Â.Nuno.Lucas");
11
[It should have returned 12]
Note that if you just do a SELECT "Â.Nuno.Lucas" it all seems ok,
because it just returns the string given, without processing it,
as substr() is forced to do. But it affects other functions, like
length().
I understand the reasons for not handling this as it should in SQLite:
too much internationalization (and probably not easily portable) code
would be needed. One thing not much desired in SQLite.
Hope you understand the issue now. It can be easily overcome by always
using the Unicode versions of the SQLite API (at the cost of a bigger
database in some cases, less performance in others, depending on how you
created the database).
Under windows you can use MultiByteToWideChar and WideCharToMultiByte to
do the conversions for you (just use CP_ACP - CP_OEM in DOS - to unicode
and then from unicode to CP_UTF8). I'm not sure if this works for those
systems not using ANSI, like korean or so (I know this doesn't work for
Windows CE, as it can't convert to UTF-8, at least the 3.0 version).
For other systems (and also including a windows port), i know there is
the iconv library, but never used it.
Regards,
~Nuno Lucas