On Wed, Mar 19, 2014 at 8:32 AM, Dan Kennedy <danielk1...@gmail.com> wrote:
> On 03/19/2014 09:44 PM, Aleksey Tulinov wrote: > >> >> I've created test database: >> >> >> sqlite> CREATE TABLE test (x COLLATE NOCASE); >> sqlite> INSERT INTO test VALUES ('s'); >> sqlite> INSERT INTO test VALUES ('S'); >> sqlite> INSERT INTO test VALUES ('ё'); -- Russian e with diacritic >> sqlite> INSERT INTO test VALUES ('Ё'); -- Russian E with diacritic >> >> Then created index in ICU-disabled SQLite version: >> >> sqlite> SELECT 'ё' LIKE 'Ё'; >> 0 >> sqlite> .schema >> >> CREATE TABLE test (x COLLATE NOCASE); >> sqlite> CREATE INDEX idx_x ON test (x); >> >> Then tried it in ICU-enabled SQLite version: >> > > ICU-enabled or nunicode-enabled? > > ICU does not modify the behaviour of existing collation sequences. So > there is no problem there (apart from the original problem - that the ICU > extension does not provide anything that can be used to create a > case-independent collation sequence). > > An index is a sorted list. And queries like this: > > > sqlite> SELECT * FROM test WHERE x = 'ё'; >> > > do a binary search of that list to find keys equal to 'ё'. But to do a > binary search of an ordered list, you need to be using a comparison > function compatible with that used to sort the list in the first place. Say > I have the following list, sorted using a unicode aware NOCASE collation: > > (Ä, ä, Ë, ë, f) > > Also assume that all characters in the list have umlauts adorning them. > > Then I open the db using regular SQLite and try searching for "ä". > Obviously the binary search fails - the first comparison compares the seek > key "ä" with "Ë", incorrectly concludes that the key "ä" is larger than "Ë" > and goes on to search the right-hand side of the index. The search fails. > > Then say this search is part of a delete operation to remove a row from > the database. The table row itself might be removed correctly, but the > corresponding index key is not - because a search fails to find it. At that > point you have an inconsistent table and index. A corrupt database. > > In the future, we might have a similar problem in FTS. FTS offers a > home-grown tokenizer named "unicode61" that folds case in the same > unicode-aware way as nunicode. If the unicode standard changes to define > more pairs of case equivalent characters, we will not be able simply > upgrade "unicode61". For the same reasons - modifying the comparison > function creates an incompatible system. Instead, we would name it > "unicode62" or similar, to be sure that databases created using the old > version continue to use it. > > > Dan. > > _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > Thanks everyone for your comments. IIUC, the correct way of going about what I want to do is to use BINARY collation on the column I'm interested in and when I want to do unicode-aware case-insensitive lookups, they should look something like SELECT * FROM table WHERE LOWER(col_name) = LOWER(key), correct? It seems like with ICU support, LOWER( ) will call u_foldCase under the covers, which is what I want. Alex -- Alex Loukissas www.maginatics.com _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users