On 02/04/2018 11:39 AM, raj Singla wrote:
Hi,

-- create fts4 and fts5 tables
create virtual table idx4 using "fts4" (content);
create virtual table idx5 using "fts5" (content);
-- insert 1 sample rows into eachinsert into idx4 (content) values
('नीरजा भनोट के कातिल पाकिस्तान की जेल में थे, फिर वे एफबीआई
आए?');insert into idx5 (content) values ('नीरजा भनोट के कातिल
पाकिस्तान की जेल में थे, फिर वे एफबीआई आए?');
-- query index using complete and partial stringsselect * from idx4
where idx4 match 'पाकिस्तान';-- returns नीरजा भनोट के कातिल पाकिस्तान
की जेल में थे, फिर वे एफबीआई आए?
select * from idx4 where idx4 match 'पाकि';-- no results returned
select * from idx5 where idx5 match 'पाकिस्तान';-- returns नीरजा भनोट
के कातिल पाकिस्तान की जेल में थे, फिर वे एफबीआई आए?
select * from idx5 where idx5 match 'पाकि';-- returns नीरजा भनोट के
कातिल पाकिस्तान की जेल में थे, फिर वे एफबीआई आए?


fts5 giving results for substring searches for Hindi content.
Is this expected behavior.
Please if you can provide more insights on this. Maybe this is just an
experimental feature.

By default, FTS5 uses a unicode tokenizer based on data extracted from reference file "UnicodeData.txt":

http://www.unicode.org/Public/6.1.0/ucd/UnicodeData.txt

Which divides the characters into categories:

  http://www.fileformat.info/info/unicode/category/index.htm

FTS5 considers categories "Co", "L*" and "N*" to be token characters and all others to be separator characters (handled in the same way as spaces).

The string "पाकिस्तान" contains 9 characters, 3 of which are from the "Mn" and "Mc" categories, specifically 0x93E, 0x93F, 0x94D and 0x93E. According to UnicodeData.txt, these characters are:

  093E;DEVANAGARI VOWEL SIGN AA;Mc;
  093F;DEVANAGARI VOWEL SIGN I;Mc;
  094D;DEVANAGARI SIGN VIRAMA;Mn;

And so the string is being split into several (actually 5 - as there are two instances of 0x93E) different words. Given your report, I'm guessing that is not what people expect. Can you, or any other Hindi speaker, confirm that "पाकिस्तान" should be treated as a single word by FTS5? And not broken into several different words?

Dan.













Thank You,
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to