subject:"Re\: \[sqlite\] Using ICU case folding support"

Re: [sqlite] Using ICU case folding support

2014-03-19 Thread Alex Loukissas

On Wed, Mar 19, 2014 at 8:36 AM, Alex Loukissas  wrote:

> On Wed, Mar 19, 2014 at 8:32 AM, Dan Kennedy wrote:
>
>> On 03/19/2014 09:44 PM, Aleksey Tulinov wrote:
>>
>>>
>>> I've created test database:
>>>
>>>
>>> sqlite> CREATE TABLE test (x COLLATE NOCASE);
>>> sqlite> INSERT INTO test VALUES ('s');
>>> sqlite> INSERT INTO test VALUES ('S');
>>> sqlite> INSERT INTO test VALUES ('ё'); -- Russian e with diacritic
>>> sqlite> INSERT INTO test VALUES ('Ё'); -- Russian E with diacritic
>>>
>>> Then created index in ICU-disabled SQLite version:
>>>
>>> sqlite> SELECT 'ё' LIKE 'Ё';
>>> 0
>>> sqlite> .schema
>>>
>>> CREATE TABLE test (x COLLATE NOCASE);
>>> sqlite> CREATE INDEX idx_x ON test (x);
>>>
>>> Then tried it in ICU-enabled SQLite version:
>>>
>>
>> ICU-enabled or nunicode-enabled?
>>
>> ICU does not modify the behaviour of existing collation sequences. So
>> there is no problem there (apart from the original problem - that the ICU
>> extension does not provide anything that can be used to create a
>> case-independent collation sequence).
>>
>> An index is a sorted list. And queries like this:
>>
>>
>>  sqlite> SELECT * FROM test WHERE x = 'ё';
>>>
>>
>> do a binary search of that list to find keys equal to 'ё'. But to do a
>> binary search of an ordered list, you need to be using a comparison
>> function compatible with that used to sort the list in the first place. Say
>> I have the following list, sorted using a unicode aware NOCASE collation:
>>
>>   (Ä, ä, Ë, ë, f)
>>
>> Also assume that all characters in the list have umlauts adorning them.
>>
>> Then I open the db using regular SQLite and try searching for "ä".
>> Obviously the binary search fails - the first comparison compares the seek
>> key "ä" with "Ë", incorrectly concludes that the key "ä" is larger than "Ë"
>> and goes on to search the right-hand side of the index. The search fails.
>>
>> Then say this search is part of a delete operation to remove a row from
>> the database. The table row itself might be removed correctly, but the
>> corresponding index key is not - because a search fails to find it. At that
>> point you have an inconsistent table and index. A corrupt database.
>>
>> In the future, we might have a similar problem in FTS. FTS offers a
>> home-grown tokenizer named "unicode61" that folds case in the same
>> unicode-aware way as nunicode. If the unicode standard changes to define
>> more pairs of case equivalent characters, we will not be able simply
>> upgrade "unicode61". For the same reasons - modifying the comparison
>> function creates an incompatible system. Instead, we would name it
>> "unicode62" or similar, to be sure that databases created using the old
>> version continue to use it.
>>
>>
>> Dan.
>>
>> ___
>> sqlite-users mailing list
>> sqlite-users@sqlite.org
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>
>
> Thanks everyone for your comments. IIUC, the correct way of going about
> what I want to do is to use BINARY collation on the column I'm interested
> in and when I want to do unicode-aware case-insensitive lookups, they
> should look something like SELECT * FROM table WHERE LOWER(col_name) =
> LOWER(key), correct? It seems like with ICU support, LOWER( ) will call
> u_foldCase under the covers, which is what I want.
>
> Alex
>
>
Actually, it looks like the ICU extension doesn't provide what I want here
and the preferred way forward is to define my own collation sequence, as
described in http://www.sqlite.org/datatype3.html and use direct calls to
ICU caseCompare (which does case folding underneath).

I would suggest that a new collation sequence is added to sqlite with ICU
extension enabled (e.g. ICU_NOCASE) - will be very useful IMHO. Is there a
proper way of filing such feature requests?

Thanks!
Alex

-- 
Alex Loukissas

www.maginatics.com
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Using ICU case folding support

2014-03-19 Thread Aleksey Tulinov


On 03/19/2014 05:32 PM, Dan Kennedy wrote:


home-grown tokenizer named "unicode61" that folds case in the same
unicode-aware way as nunicode. If the unicode standard changes to define
more pairs of case equivalent characters, we will not be able simply
upgrade "unicode61". For the same reasons - modifying the comparison
function creates an incompatible system. Instead, we would name it
"unicode62" or similar, to be sure that databases created using the old
version continue to use it.



Thank you for the detailed answer. Is assume working this around with 
custom collation as in "SELECT ... x = 'ё' COLLATE NOCASE_U61" will 
imply performance penalty regarding index search, but custom 
case-insensitive collation assigned to table column, "CREATE TABLE test 
(x COLLATE NOCASE_U61)", will work as expected.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Using ICU case folding support

2014-03-19 Thread Simon Slavin

On 19 Mar 2014, at 3:36pm, Alex Loukissas  wrote:

> Thanks everyone for your comments. IIUC, the correct way of going about
> what I want to do is to use BINARY collation on the column I'm interested
> in and when I want to do unicode-aware case-insensitive lookups, they
> should look something like SELECT * FROM table WHERE LOWER(col_name) =
> LOWER(key), correct? It seems like with ICU support, LOWER( ) will call
> u_foldCase under the covers, which is what I want.

This solution suggests that a good compromise for handling Unicode is a hashing 
function.  It would be equivalent to NOCASE, but for Unicode characters, and 
instead of just removing case it would also remove accents and various other 
'hints'.

How it would handle the various unicode characters which have no equivalent in 
the any alphabet, I have no idea.  It would be reasonable for all 'Right Arrow' 
characters to have the same hash, but how much data about Unicode would it take 
to do that ?  Maybe it should just leave all non-alphabetic characters as they 
are.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Using ICU case folding support

2014-03-19 Thread Dominique Devienne

On Wed, Mar 19, 2014 at 4:36 PM, Alex Loukissas  wrote:
> On Wed, Mar 19, 2014 at 8:32 AM, Dan Kennedy  wrote:
> Thanks everyone for your comments. IIUC, the correct way of going about
> what I want to do is to use BINARY collation on the column I'm interested
> in and when I want to do unicode-aware case-insensitive lookups, they
> should look something like SELECT * FROM table WHERE LOWER(col_name) =
> LOWER(key), correct? It seems like with ICU support, LOWER( ) will call
> u_foldCase under the covers, which is what I want.

But if you use one collation for the column, and another for the
query, don't you implicitly disable indexes via incompatible
collations? Is SQLite even aware of the mismatch? --DD
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Using ICU case folding support

2014-03-19 Thread Alex Loukissas

On Wed, Mar 19, 2014 at 8:32 AM, Dan Kennedy  wrote:

> On 03/19/2014 09:44 PM, Aleksey Tulinov wrote:
>
>>
>> I've created test database:
>>
>>
>> sqlite> CREATE TABLE test (x COLLATE NOCASE);
>> sqlite> INSERT INTO test VALUES ('s');
>> sqlite> INSERT INTO test VALUES ('S');
>> sqlite> INSERT INTO test VALUES ('ё'); -- Russian e with diacritic
>> sqlite> INSERT INTO test VALUES ('Ё'); -- Russian E with diacritic
>>
>> Then created index in ICU-disabled SQLite version:
>>
>> sqlite> SELECT 'ё' LIKE 'Ё';
>> 0
>> sqlite> .schema
>>
>> CREATE TABLE test (x COLLATE NOCASE);
>> sqlite> CREATE INDEX idx_x ON test (x);
>>
>> Then tried it in ICU-enabled SQLite version:
>>
>
> ICU-enabled or nunicode-enabled?
>
> ICU does not modify the behaviour of existing collation sequences. So
> there is no problem there (apart from the original problem - that the ICU
> extension does not provide anything that can be used to create a
> case-independent collation sequence).
>
> An index is a sorted list. And queries like this:
>
>
>  sqlite> SELECT * FROM test WHERE x = 'ё';
>>
>
> do a binary search of that list to find keys equal to 'ё'. But to do a
> binary search of an ordered list, you need to be using a comparison
> function compatible with that used to sort the list in the first place. Say
> I have the following list, sorted using a unicode aware NOCASE collation:
>
>   (Ä, ä, Ë, ë, f)
>
> Also assume that all characters in the list have umlauts adorning them.
>
> Then I open the db using regular SQLite and try searching for "ä".
> Obviously the binary search fails - the first comparison compares the seek
> key "ä" with "Ë", incorrectly concludes that the key "ä" is larger than "Ë"
> and goes on to search the right-hand side of the index. The search fails.
>
> Then say this search is part of a delete operation to remove a row from
> the database. The table row itself might be removed correctly, but the
> corresponding index key is not - because a search fails to find it. At that
> point you have an inconsistent table and index. A corrupt database.
>
> In the future, we might have a similar problem in FTS. FTS offers a
> home-grown tokenizer named "unicode61" that folds case in the same
> unicode-aware way as nunicode. If the unicode standard changes to define
> more pairs of case equivalent characters, we will not be able simply
> upgrade "unicode61". For the same reasons - modifying the comparison
> function creates an incompatible system. Instead, we would name it
> "unicode62" or similar, to be sure that databases created using the old
> version continue to use it.
>
>
> Dan.
>
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>

Thanks everyone for your comments. IIUC, the correct way of going about
what I want to do is to use BINARY collation on the column I'm interested
in and when I want to do unicode-aware case-insensitive lookups, they
should look something like SELECT * FROM table WHERE LOWER(col_name) =
LOWER(key), correct? It seems like with ICU support, LOWER( ) will call
u_foldCase under the covers, which is what I want.

Alex


-- 
Alex Loukissas

www.maginatics.com
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Using ICU case folding support

2014-03-19 Thread Dan Kennedy


On 03/19/2014 09:44 PM, Aleksey Tulinov wrote:


I've created test database:

sqlite> CREATE TABLE test (x COLLATE NOCASE);
sqlite> INSERT INTO test VALUES ('s');
sqlite> INSERT INTO test VALUES ('S');
sqlite> INSERT INTO test VALUES ('ё'); -- Russian e with diacritic
sqlite> INSERT INTO test VALUES ('Ё'); -- Russian E with diacritic

Then created index in ICU-disabled SQLite version:

sqlite> SELECT 'ё' LIKE 'Ё';
0
sqlite> .schema
CREATE TABLE test (x COLLATE NOCASE);
sqlite> CREATE INDEX idx_x ON test (x);

Then tried it in ICU-enabled SQLite version:


ICU-enabled or nunicode-enabled?

ICU does not modify the behaviour of existing collation sequences. So 
there is no problem there (apart from the original problem - that the 
ICU extension does not provide anything that can be used to create a 
case-independent collation sequence).


An index is a sorted list. And queries like this:


sqlite> SELECT * FROM test WHERE x = 'ё';


do a binary search of that list to find keys equal to 'ё'. But to do a 
binary search of an ordered list, you need to be using a comparison 
function compatible with that used to sort the list in the first place. 
Say I have the following list, sorted using a unicode aware NOCASE 
collation:


  (Ä, ä, Ë, ë, f)

Also assume that all characters in the list have umlauts adorning them.

Then I open the db using regular SQLite and try searching for "ä". 
Obviously the binary search fails - the first comparison compares the 
seek key "ä" with "Ë", incorrectly concludes that the key "ä" is larger 
than "Ë" and goes on to search the right-hand side of the index. The 
search fails.


Then say this search is part of a delete operation to remove a row from 
the database. The table row itself might be removed correctly, but the 
corresponding index key is not - because a search fails to find it. At 
that point you have an inconsistent table and index. A corrupt database.


In the future, we might have a similar problem in FTS. FTS offers a 
home-grown tokenizer named "unicode61" that folds case in the same 
unicode-aware way as nunicode. If the unicode standard changes to define 
more pairs of case equivalent characters, we will not be able simply 
upgrade "unicode61". For the same reasons - modifying the comparison 
function creates an incompatible system. Instead, we would name it 
"unicode62" or similar, to be sure that databases created using the old 
version continue to use it.


Dan.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Using ICU case folding support

2014-03-19 Thread Dan Kennedy


On 03/19/2014 07:55 PM, Aleksey Tulinov wrote:

On 03/18/2014 10:02 PM, Alex Loukissas wrote:

Alex,

I suppose I can declare the column as BINARY and use LOWER( ) in my 
select
statements. Browsing through the code though, I do see uses of 
u_foldCase

in certain places, which leads me to believe that what I want may be
already there. I'll try to unit test this.



Correct me if i'm wrong, but i think i reproduced bug you've described 
in SQLite with ICU extension compiled in:


sqlite> CREATE TABLE test (x COLLATE NOCASE);
sqlite> INSERT INTO test VALUES ('s');
sqlite> INSERT INTO test VALUES ('S');
sqlite> INSERT INTO test VALUES ('ё'); -- Russian e with diacritic
sqlite> INSERT INTO test VALUES ('Ё'); -- Russian E with diacritic
sqlite> SELECT * FROM test WHERE x = 'ё';
ё
sqlite> SELECT * FROM test WHERE x = 's';
s
S

This might be a point of interest for SQLite ICU extension developers.

On the other hand, SQLite nunicode extension, which i developed, do that:

sqlite> SELECT * FROM test WHERE x = 'ё';
ё
sqlite> .load ./libnusqlite3.so
sqlite> SELECT * FROM test WHERE x = 'ё';
ё
Ё


One problem with this sort of thing (overriding built-in collation 
sequences) is that you need to be careful never to accidentally use the 
wrong version.


If you were to create an index using collation sequence "NOCASE" and 
populate the db using a stock SQLite, then query using a nunicode 
enhanced version, the results would be unpredictable. And running 
"PRAGMA integrity_check" would report corruption. If you were to write 
to the db using both a regular SQLite and a nunicode version the 
database would become corrupt in the sense that no version would be able 
to query it reliably.


For that reason, the SQLite ICU extension does not automatically 
override the "NOCASE" collation sequence the same way it overrides the 
built-in upper() and lower() functions. If we were to add case-folding 
to ICU collations, we would have to use a different name.


It does look like nunicode would solve the OP's problem nicely though.

Dan.


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Using ICU case folding support

2014-03-19 Thread Aleksey Tulinov


On 03/18/2014 10:02 PM, Alex Loukissas wrote:

Alex,


I suppose I can declare the column as BINARY and use LOWER( ) in my select
statements. Browsing through the code though, I do see uses of u_foldCase
in certain places, which leads me to believe that what I want may be
already there. I'll try to unit test this.



Correct me if i'm wrong, but i think i reproduced bug you've described 
in SQLite with ICU extension compiled in:


sqlite> CREATE TABLE test (x COLLATE NOCASE);
sqlite> INSERT INTO test VALUES ('s');
sqlite> INSERT INTO test VALUES ('S');
sqlite> INSERT INTO test VALUES ('ё'); -- Russian e with diacritic
sqlite> INSERT INTO test VALUES ('Ё'); -- Russian E with diacritic
sqlite> SELECT * FROM test WHERE x = 'ё';
ё
sqlite> SELECT * FROM test WHERE x = 's';
s
S

This might be a point of interest for SQLite ICU extension developers.

On the other hand, SQLite nunicode extension, which i developed, do that:

sqlite> SELECT * FROM test WHERE x = 'ё';
ё
sqlite> .load ./libnusqlite3.so
sqlite> SELECT * FROM test WHERE x = 'ё';
ё
Ё

You can try nunicode by following this link: 
https://bitbucket.org/alekseyt/nunicode/overview#markdown-header-sqlite3-extension


Hope this helps.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Using ICU case folding support

2014-03-18 Thread Alex Loukissas

On Tue, Mar 18, 2014 at 12:39 PM, Dan Kennedy  wrote:

> On 03/19/2014 01:22 AM, Alex Loukissas wrote:
>
>> Hello,
>>
>> I'm trying to use the ICU extension and it looks like what I want to do is
>> not very clear from the documentation. I would like to switch from my
>> current usage, where my column declaration is as follows:
>>
>> CREATE TABLE demo("name text not null COLLATE NOCASE");
>>
>> As noted in the documentation, this is not Unicode-aware, which is why I
>> turned to the ICU extension. What should be the equivalent usage where
>> sqlite does case folding using ICU?
>>
>
> I don't think there is a way to do that with the current SQLite ICU
> extension.
>
> Dan.
>

I suppose I can declare the column as BINARY and use LOWER( ) in my select
statements. Browsing through the code though, I do see uses of u_foldCase
in certain places, which leads me to believe that what I want may be
already there. I'll try to unit test this.


>
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>



-- 
Alex Loukissas

www.maginatics.com
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Using ICU case folding support

2014-03-18 Thread Dan Kennedy


On 03/19/2014 01:22 AM, Alex Loukissas wrote:

Hello,

I'm trying to use the ICU extension and it looks like what I want to do is
not very clear from the documentation. I would like to switch from my
current usage, where my column declaration is as follows:

CREATE TABLE demo("name text not null COLLATE NOCASE");

As noted in the documentation, this is not Unicode-aware, which is why I
turned to the ICU extension. What should be the equivalent usage where
sqlite does case folding using ICU?


I don't think there is a way to do that with the current SQLite ICU 
extension.


Dan.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Using ICU case folding support

Re: [sqlite] Using ICU case folding support

Re: [sqlite] Using ICU case folding support

Re: [sqlite] Using ICU case folding support

Re: [sqlite] Using ICU case folding support

Re: [sqlite] Using ICU case folding support

Re: [sqlite] Using ICU case folding support

Re: [sqlite] Using ICU case folding support

Re: [sqlite] Using ICU case folding support

Re: [sqlite] Using ICU case folding support

10 matches

Site Navigation

Mail list logo

Footer information