Re: [sqlite] Using ICU case folding support

Alex Loukissas Wed, 19 Mar 2014 08:37:20 -0700

On Wed, Mar 19, 2014 at 8:32 AM, Dan Kennedy <danielk1...@gmail.com> wrote:


> On 03/19/2014 09:44 PM, Aleksey Tulinov wrote:
>
>>
>> I've created test database:
>>
>>
>> sqlite> CREATE TABLE test (x COLLATE NOCASE);
>> sqlite> INSERT INTO test VALUES ('s');
>> sqlite> INSERT INTO test VALUES ('S');
>> sqlite> INSERT INTO test VALUES ('ё'); -- Russian e with diacritic
>> sqlite> INSERT INTO test VALUES ('Ё'); -- Russian E with diacritic
>>
>> Then created index in ICU-disabled SQLite version:
>>
>> sqlite> SELECT 'ё' LIKE 'Ё';
>> 0
>> sqlite> .schema
>>
>> CREATE TABLE test (x COLLATE NOCASE);
>> sqlite> CREATE INDEX idx_x ON test (x);
>>
>> Then tried it in ICU-enabled SQLite version:
>>
>
> ICU-enabled or nunicode-enabled?
>
> ICU does not modify the behaviour of existing collation sequences. So
> there is no problem there (apart from the original problem - that the ICU
> extension does not provide anything that can be used to create a
> case-independent collation sequence).
>
> An index is a sorted list. And queries like this:
>
>
>  sqlite> SELECT * FROM test WHERE x = 'ё';
>>
>
> do a binary search of that list to find keys equal to 'ё'. But to do a
> binary search of an ordered list, you need to be using a comparison
> function compatible with that used to sort the list in the first place. Say
> I have the following list, sorted using a unicode aware NOCASE collation:
>
>   (Ä, ä, Ë, ë, f)
>
> Also assume that all characters in the list have umlauts adorning them.
>
> Then I open the db using regular SQLite and try searching for "ä".
> Obviously the binary search fails - the first comparison compares the seek
> key "ä" with "Ë", incorrectly concludes that the key "ä" is larger than "Ë"
> and goes on to search the right-hand side of the index. The search fails.
>
> Then say this search is part of a delete operation to remove a row from
> the database. The table row itself might be removed correctly, but the
> corresponding index key is not - because a search fails to find it. At that
> point you have an inconsistent table and index. A corrupt database.
>
> In the future, we might have a similar problem in FTS. FTS offers a
> home-grown tokenizer named "unicode61" that folds case in the same
> unicode-aware way as nunicode. If the unicode standard changes to define
> more pairs of case equivalent characters, we will not be able simply
> upgrade "unicode61". For the same reasons - modifying the comparison
> function creates an incompatible system. Instead, we would name it
> "unicode62" or similar, to be sure that databases created using the old
> version continue to use it.
>
>
> Dan.
>
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>

Thanks everyone for your comments. IIUC, the correct way of going about
what I want to do is to use BINARY collation on the column I'm interested
in and when I want to do unicode-aware case-insensitive lookups, they
should look something like SELECT * FROM table WHERE LOWER(col_name) =
LOWER(key), correct? It seems like with ICU support, LOWER( ) will call
u_foldCase under the covers, which is what I want.

Alex


-- 
Alex Loukissas

www.maginatics.com
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Using ICU case folding support

Reply via email to