Re: [sqlite] Using ICU case folding support
On Wed, Mar 19, 2014 at 8:36 AM, Alex Loukissas wrote: > On Wed, Mar 19, 2014 at 8:32 AM, Dan Kennedy wrote: > >> On 03/19/2014 09:44 PM, Aleksey Tulinov wrote: >> >>> >>> I've created test database: >>> >>> >>> sqlite> CREATE TABLE test (x COLLATE NOCASE); >>> sqlite> INSERT INTO test VALUES ('s'); >>> sqlite> INSERT INTO test VALUES ('S'); >>> sqlite> INSERT INTO test VALUES ('ё'); -- Russian e with diacritic >>> sqlite> INSERT INTO test VALUES ('Ё'); -- Russian E with diacritic >>> >>> Then created index in ICU-disabled SQLite version: >>> >>> sqlite> SELECT 'ё' LIKE 'Ё'; >>> 0 >>> sqlite> .schema >>> >>> CREATE TABLE test (x COLLATE NOCASE); >>> sqlite> CREATE INDEX idx_x ON test (x); >>> >>> Then tried it in ICU-enabled SQLite version: >>> >> >> ICU-enabled or nunicode-enabled? >> >> ICU does not modify the behaviour of existing collation sequences. So >> there is no problem there (apart from the original problem - that the ICU >> extension does not provide anything that can be used to create a >> case-independent collation sequence). >> >> An index is a sorted list. And queries like this: >> >> >> sqlite> SELECT * FROM test WHERE x = 'ё'; >>> >> >> do a binary search of that list to find keys equal to 'ё'. But to do a >> binary search of an ordered list, you need to be using a comparison >> function compatible with that used to sort the list in the first place. Say >> I have the following list, sorted using a unicode aware NOCASE collation: >> >> (Ä, ä, Ë, ë, f) >> >> Also assume that all characters in the list have umlauts adorning them. >> >> Then I open the db using regular SQLite and try searching for "ä". >> Obviously the binary search fails - the first comparison compares the seek >> key "ä" with "Ë", incorrectly concludes that the key "ä" is larger than "Ë" >> and goes on to search the right-hand side of the index. The search fails. >> >> Then say this search is part of a delete operation to remove a row from >> the database. The table row itself might be removed correctly, but the >> corresponding index key is not - because a search fails to find it. At that >> point you have an inconsistent table and index. A corrupt database. >> >> In the future, we might have a similar problem in FTS. FTS offers a >> home-grown tokenizer named "unicode61" that folds case in the same >> unicode-aware way as nunicode. If the unicode standard changes to define >> more pairs of case equivalent characters, we will not be able simply >> upgrade "unicode61". For the same reasons - modifying the comparison >> function creates an incompatible system. Instead, we would name it >> "unicode62" or similar, to be sure that databases created using the old >> version continue to use it. >> >> >> Dan. >> >> ___ >> sqlite-users mailing list >> sqlite-users@sqlite.org >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users >> > > Thanks everyone for your comments. IIUC, the correct way of going about > what I want to do is to use BINARY collation on the column I'm interested > in and when I want to do unicode-aware case-insensitive lookups, they > should look something like SELECT * FROM table WHERE LOWER(col_name) = > LOWER(key), correct? It seems like with ICU support, LOWER( ) will call > u_foldCase under the covers, which is what I want. > > Alex > > Actually, it looks like the ICU extension doesn't provide what I want here and the preferred way forward is to define my own collation sequence, as described in http://www.sqlite.org/datatype3.html and use direct calls to ICU caseCompare (which does case folding underneath). I would suggest that a new collation sequence is added to sqlite with ICU extension enabled (e.g. ICU_NOCASE) - will be very useful IMHO. Is there a proper way of filing such feature requests? Thanks! Alex -- Alex Loukissas www.maginatics.com ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Using ICU case folding support
On 03/19/2014 05:32 PM, Dan Kennedy wrote: home-grown tokenizer named "unicode61" that folds case in the same unicode-aware way as nunicode. If the unicode standard changes to define more pairs of case equivalent characters, we will not be able simply upgrade "unicode61". For the same reasons - modifying the comparison function creates an incompatible system. Instead, we would name it "unicode62" or similar, to be sure that databases created using the old version continue to use it. Thank you for the detailed answer. Is assume working this around with custom collation as in "SELECT ... x = 'ё' COLLATE NOCASE_U61" will imply performance penalty regarding index search, but custom case-insensitive collation assigned to table column, "CREATE TABLE test (x COLLATE NOCASE_U61)", will work as expected. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Using ICU case folding support
On 19 Mar 2014, at 3:36pm, Alex Loukissas wrote: > Thanks everyone for your comments. IIUC, the correct way of going about > what I want to do is to use BINARY collation on the column I'm interested > in and when I want to do unicode-aware case-insensitive lookups, they > should look something like SELECT * FROM table WHERE LOWER(col_name) = > LOWER(key), correct? It seems like with ICU support, LOWER( ) will call > u_foldCase under the covers, which is what I want. This solution suggests that a good compromise for handling Unicode is a hashing function. It would be equivalent to NOCASE, but for Unicode characters, and instead of just removing case it would also remove accents and various other 'hints'. How it would handle the various unicode characters which have no equivalent in the any alphabet, I have no idea. It would be reasonable for all 'Right Arrow' characters to have the same hash, but how much data about Unicode would it take to do that ? Maybe it should just leave all non-alphabetic characters as they are. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Using ICU case folding support
On Wed, Mar 19, 2014 at 4:36 PM, Alex Loukissas wrote: > On Wed, Mar 19, 2014 at 8:32 AM, Dan Kennedy wrote: > Thanks everyone for your comments. IIUC, the correct way of going about > what I want to do is to use BINARY collation on the column I'm interested > in and when I want to do unicode-aware case-insensitive lookups, they > should look something like SELECT * FROM table WHERE LOWER(col_name) = > LOWER(key), correct? It seems like with ICU support, LOWER( ) will call > u_foldCase under the covers, which is what I want. But if you use one collation for the column, and another for the query, don't you implicitly disable indexes via incompatible collations? Is SQLite even aware of the mismatch? --DD ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Using ICU case folding support
On Wed, Mar 19, 2014 at 8:32 AM, Dan Kennedy wrote: > On 03/19/2014 09:44 PM, Aleksey Tulinov wrote: > >> >> I've created test database: >> >> >> sqlite> CREATE TABLE test (x COLLATE NOCASE); >> sqlite> INSERT INTO test VALUES ('s'); >> sqlite> INSERT INTO test VALUES ('S'); >> sqlite> INSERT INTO test VALUES ('ё'); -- Russian e with diacritic >> sqlite> INSERT INTO test VALUES ('Ё'); -- Russian E with diacritic >> >> Then created index in ICU-disabled SQLite version: >> >> sqlite> SELECT 'ё' LIKE 'Ё'; >> 0 >> sqlite> .schema >> >> CREATE TABLE test (x COLLATE NOCASE); >> sqlite> CREATE INDEX idx_x ON test (x); >> >> Then tried it in ICU-enabled SQLite version: >> > > ICU-enabled or nunicode-enabled? > > ICU does not modify the behaviour of existing collation sequences. So > there is no problem there (apart from the original problem - that the ICU > extension does not provide anything that can be used to create a > case-independent collation sequence). > > An index is a sorted list. And queries like this: > > > sqlite> SELECT * FROM test WHERE x = 'ё'; >> > > do a binary search of that list to find keys equal to 'ё'. But to do a > binary search of an ordered list, you need to be using a comparison > function compatible with that used to sort the list in the first place. Say > I have the following list, sorted using a unicode aware NOCASE collation: > > (Ä, ä, Ë, ë, f) > > Also assume that all characters in the list have umlauts adorning them. > > Then I open the db using regular SQLite and try searching for "ä". > Obviously the binary search fails - the first comparison compares the seek > key "ä" with "Ë", incorrectly concludes that the key "ä" is larger than "Ë" > and goes on to search the right-hand side of the index. The search fails. > > Then say this search is part of a delete operation to remove a row from > the database. The table row itself might be removed correctly, but the > corresponding index key is not - because a search fails to find it. At that > point you have an inconsistent table and index. A corrupt database. > > In the future, we might have a similar problem in FTS. FTS offers a > home-grown tokenizer named "unicode61" that folds case in the same > unicode-aware way as nunicode. If the unicode standard changes to define > more pairs of case equivalent characters, we will not be able simply > upgrade "unicode61". For the same reasons - modifying the comparison > function creates an incompatible system. Instead, we would name it > "unicode62" or similar, to be sure that databases created using the old > version continue to use it. > > > Dan. > > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > Thanks everyone for your comments. IIUC, the correct way of going about what I want to do is to use BINARY collation on the column I'm interested in and when I want to do unicode-aware case-insensitive lookups, they should look something like SELECT * FROM table WHERE LOWER(col_name) = LOWER(key), correct? It seems like with ICU support, LOWER( ) will call u_foldCase under the covers, which is what I want. Alex -- Alex Loukissas www.maginatics.com ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Using ICU case folding support
On 03/19/2014 09:44 PM, Aleksey Tulinov wrote: I've created test database: sqlite> CREATE TABLE test (x COLLATE NOCASE); sqlite> INSERT INTO test VALUES ('s'); sqlite> INSERT INTO test VALUES ('S'); sqlite> INSERT INTO test VALUES ('ё'); -- Russian e with diacritic sqlite> INSERT INTO test VALUES ('Ё'); -- Russian E with diacritic Then created index in ICU-disabled SQLite version: sqlite> SELECT 'ё' LIKE 'Ё'; 0 sqlite> .schema CREATE TABLE test (x COLLATE NOCASE); sqlite> CREATE INDEX idx_x ON test (x); Then tried it in ICU-enabled SQLite version: ICU-enabled or nunicode-enabled? ICU does not modify the behaviour of existing collation sequences. So there is no problem there (apart from the original problem - that the ICU extension does not provide anything that can be used to create a case-independent collation sequence). An index is a sorted list. And queries like this: sqlite> SELECT * FROM test WHERE x = 'ё'; do a binary search of that list to find keys equal to 'ё'. But to do a binary search of an ordered list, you need to be using a comparison function compatible with that used to sort the list in the first place. Say I have the following list, sorted using a unicode aware NOCASE collation: (Ä, ä, Ë, ë, f) Also assume that all characters in the list have umlauts adorning them. Then I open the db using regular SQLite and try searching for "ä". Obviously the binary search fails - the first comparison compares the seek key "ä" with "Ë", incorrectly concludes that the key "ä" is larger than "Ë" and goes on to search the right-hand side of the index. The search fails. Then say this search is part of a delete operation to remove a row from the database. The table row itself might be removed correctly, but the corresponding index key is not - because a search fails to find it. At that point you have an inconsistent table and index. A corrupt database. In the future, we might have a similar problem in FTS. FTS offers a home-grown tokenizer named "unicode61" that folds case in the same unicode-aware way as nunicode. If the unicode standard changes to define more pairs of case equivalent characters, we will not be able simply upgrade "unicode61". For the same reasons - modifying the comparison function creates an incompatible system. Instead, we would name it "unicode62" or similar, to be sure that databases created using the old version continue to use it. Dan. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Using ICU case folding support
On 03/19/2014 07:55 PM, Aleksey Tulinov wrote: On 03/18/2014 10:02 PM, Alex Loukissas wrote: Alex, I suppose I can declare the column as BINARY and use LOWER( ) in my select statements. Browsing through the code though, I do see uses of u_foldCase in certain places, which leads me to believe that what I want may be already there. I'll try to unit test this. Correct me if i'm wrong, but i think i reproduced bug you've described in SQLite with ICU extension compiled in: sqlite> CREATE TABLE test (x COLLATE NOCASE); sqlite> INSERT INTO test VALUES ('s'); sqlite> INSERT INTO test VALUES ('S'); sqlite> INSERT INTO test VALUES ('ё'); -- Russian e with diacritic sqlite> INSERT INTO test VALUES ('Ё'); -- Russian E with diacritic sqlite> SELECT * FROM test WHERE x = 'ё'; ё sqlite> SELECT * FROM test WHERE x = 's'; s S This might be a point of interest for SQLite ICU extension developers. On the other hand, SQLite nunicode extension, which i developed, do that: sqlite> SELECT * FROM test WHERE x = 'ё'; ё sqlite> .load ./libnusqlite3.so sqlite> SELECT * FROM test WHERE x = 'ё'; ё Ё One problem with this sort of thing (overriding built-in collation sequences) is that you need to be careful never to accidentally use the wrong version. If you were to create an index using collation sequence "NOCASE" and populate the db using a stock SQLite, then query using a nunicode enhanced version, the results would be unpredictable. And running "PRAGMA integrity_check" would report corruption. If you were to write to the db using both a regular SQLite and a nunicode version the database would become corrupt in the sense that no version would be able to query it reliably. For that reason, the SQLite ICU extension does not automatically override the "NOCASE" collation sequence the same way it overrides the built-in upper() and lower() functions. If we were to add case-folding to ICU collations, we would have to use a different name. It does look like nunicode would solve the OP's problem nicely though. Dan. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Using ICU case folding support
On 03/18/2014 10:02 PM, Alex Loukissas wrote: Alex, I suppose I can declare the column as BINARY and use LOWER( ) in my select statements. Browsing through the code though, I do see uses of u_foldCase in certain places, which leads me to believe that what I want may be already there. I'll try to unit test this. Correct me if i'm wrong, but i think i reproduced bug you've described in SQLite with ICU extension compiled in: sqlite> CREATE TABLE test (x COLLATE NOCASE); sqlite> INSERT INTO test VALUES ('s'); sqlite> INSERT INTO test VALUES ('S'); sqlite> INSERT INTO test VALUES ('ё'); -- Russian e with diacritic sqlite> INSERT INTO test VALUES ('Ё'); -- Russian E with diacritic sqlite> SELECT * FROM test WHERE x = 'ё'; ё sqlite> SELECT * FROM test WHERE x = 's'; s S This might be a point of interest for SQLite ICU extension developers. On the other hand, SQLite nunicode extension, which i developed, do that: sqlite> SELECT * FROM test WHERE x = 'ё'; ё sqlite> .load ./libnusqlite3.so sqlite> SELECT * FROM test WHERE x = 'ё'; ё Ё You can try nunicode by following this link: https://bitbucket.org/alekseyt/nunicode/overview#markdown-header-sqlite3-extension Hope this helps. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Using ICU case folding support
On Tue, Mar 18, 2014 at 12:39 PM, Dan Kennedy wrote: > On 03/19/2014 01:22 AM, Alex Loukissas wrote: > >> Hello, >> >> I'm trying to use the ICU extension and it looks like what I want to do is >> not very clear from the documentation. I would like to switch from my >> current usage, where my column declaration is as follows: >> >> CREATE TABLE demo("name text not null COLLATE NOCASE"); >> >> As noted in the documentation, this is not Unicode-aware, which is why I >> turned to the ICU extension. What should be the equivalent usage where >> sqlite does case folding using ICU? >> > > I don't think there is a way to do that with the current SQLite ICU > extension. > > Dan. > I suppose I can declare the column as BINARY and use LOWER( ) in my select statements. Browsing through the code though, I do see uses of u_foldCase in certain places, which leads me to believe that what I want may be already there. I'll try to unit test this. > > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > -- Alex Loukissas www.maginatics.com ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Using ICU case folding support
On 03/19/2014 01:22 AM, Alex Loukissas wrote: Hello, I'm trying to use the ICU extension and it looks like what I want to do is not very clear from the documentation. I would like to switch from my current usage, where my column declaration is as follows: CREATE TABLE demo("name text not null COLLATE NOCASE"); As noted in the documentation, this is not Unicode-aware, which is why I turned to the ICU extension. What should be the equivalent usage where sqlite does case folding using ICU? I don't think there is a way to do that with the current SQLite ICU extension. Dan. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users