Re: [sqlite] fts5 giving results for substring searches for Hindi content.

2018-02-05 Thread Dan Kennedy

On 02/04/2018 11:39 AM, raj Singla wrote:

Hi,

-- create fts4 and fts5 tables
create virtual table idx4 using "fts4" (content);
create virtual table idx5 using "fts5" (content);
-- insert 1 sample rows into eachinsert into idx4 (content) values
('नीरजा भनोट के कातिल पाकिस्तान की जेल में थे, फिर वे एफबीआई
आए?');insert into idx5 (content) values ('नीरजा भनोट के कातिल
पाकिस्तान की जेल में थे, फिर वे एफबीआई आए?');
-- query index using complete and partial stringsselect * from idx4
where idx4 match 'पाकिस्तान';-- returns नीरजा भनोट के कातिल पाकिस्तान
की जेल में थे, फिर वे एफबीआई आए?
select * from idx4 where idx4 match 'पाकि';-- no results returned
select * from idx5 where idx5 match 'पाकिस्तान';-- returns नीरजा भनोट
के कातिल पाकिस्तान की जेल में थे, फिर वे एफबीआई आए?
select * from idx5 where idx5 match 'पाकि';-- returns नीरजा भनोट के
कातिल पाकिस्तान की जेल में थे, फिर वे एफबीआई आए?


fts5 giving results for substring searches for Hindi content.
Is this expected behavior.
Please if you can provide more insights on this. Maybe this is just an
experimental feature.


By default, FTS5 uses a unicode tokenizer based on data extracted from 
reference file "UnicodeData.txt":


http://www.unicode.org/Public/6.1.0/ucd/UnicodeData.txt

Which divides the characters into categories:

  http://www.fileformat.info/info/unicode/category/index.htm

FTS5 considers categories "Co", "L*" and "N*" to be token characters and 
all others to be separator characters (handled in the same way as spaces).


The string "पाकिस्तान" contains 9 characters, 3 of which are from the 
"Mn" and "Mc" categories, specifically 0x93E, 0x93F, 0x94D and 0x93E. 
According to UnicodeData.txt, these characters are:


  093E;DEVANAGARI VOWEL SIGN AA;Mc;
  093F;DEVANAGARI VOWEL SIGN I;Mc;
  094D;DEVANAGARI SIGN VIRAMA;Mn;

And so the string is being split into several (actually 5 - as there are 
two instances of 0x93E) different words. Given your report, I'm guessing 
that is not what people expect. Can you, or any other Hindi speaker, 
confirm that "पाकिस्तान" should be treated as a single word by FTS5? And 
not broken into several different words?


Dan.














Thank You,
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] fts5 giving results for substring searches for Hindi content.

2018-02-04 Thread Clemens Ladisch
raj Singla wrote:
> create virtual table idx4 using "fts4" (content);
> create virtual table idx5 using "fts5" (content);
> ...
> select * from idx4 where idx4 match 'पाकि';-- no results returned
> select * from idx5 where idx5 match 'पाकि';-- returns नीरजा भनोट के

FTS4 and FTS5 have different defaults for the tokenizer:
http://www.sqlite.org/fts3.html#tokenizer
http://www.sqlite.org/fts5.html#tokenizers


Regards,
Clemens
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users