Re: [Maria-developers] e3f45b2f9ea: MDEV-10267 Add ngram fulltext parser plugin
Hi, Rinat! On Nov 02, Rinat Ibragimov wrote: > > > I can't decide. From my point of view, the current approach is > > > fine. Please pick a variant, and I'll try to implement that. > > > > No, I cannot guess which approach will produce more relevant > > searches. Implement something and then we test what works better > > Variable-length n-grams approach is too innovative, and hard to reason > about. I've never heard about such an approach, and it doesn't look > good to me. So I'll stick with a simple slicer. If you mean that variant where it splits "n-grams approach" to "n-gr", "gra", "ram", "ams", "ms a", "s ap", "app", ... then it's just "n letters in every chunk" very easy to explain. But ok, let's start simple and benchmark. > > Of course, it can. Note that fts_get_word() doesn't generate n-grams > > either, it gets the whole word and the n-gram plugin later splits it > > into n-grams. Similarly param->mysql_parse() will extract words for > > you and you'll split them into n-grams. > > Changed to use param->mysql_parse(). > > Turns out that in Aria, MyISAM, and InnoDB, param->mysql_parser() does > call back param->mysql_add_word(). Is it part of the plugin API? Yes, it is. E.g. slide 12 from my old presentation: http://conferences.oreillynet.com/presentations/mysql06/golubchik_sergei.pdf shows that there are three points where a plugin can add functionality. * It can extract the text and then call param->mysql_parser(), this allows to parse, say, gzip-ed texts or EXIF comments in images. * It can replace param->mysql_parser(), to use different rules for spliting the text into words. This is what the n-gram plugin normally does * It can replace param->mysql_add_word() to post-process every word after the built-in parser did the splitting. For example, stemming or soundex plugin can do that. > Comments in include/mysql/plugin_ftparser.h do not mention that at > all. That's why I initially thought that param->mysql_parse() will > parse the string like the default parser do, without any ways to > interact with the process. I've edited the comment to mention this possibility. Regards, Sergei VP of MariaDB Server Engineering and secur...@mariadb.org ___ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
Re: [Maria-developers] e3f45b2f9ea: MDEV-10267 Add ngram fulltext parser plugin
Hi, Rinat! On Oct 21, Rinat Ibragimov wrote: > > > + > > > +// Avoid creating partial n-grams by stopping at word boundaries. > > > +// Underscore is treated as a word character too, since identifiers > > > in > > > +// programming languages often contain them. > > > +if (!is_alnum(char_type) && !is_underscore(cs, next, end)) { > > > + start= next + char_len; > > > + next= start; > > > + n_chars= 0; > > > + continue; > > > > I don't think it's a good idea. On the opposite, I would rather create > > ngrams over word boundaries, so that, say, I'd split "n-gram plugin" to > > > > "n-g", "-gr", "gra", "ram", "am ", "m p", " pl", "plu", "lug", "ugi", "gin". > > I see how this can increase index size: more different strings are > included. Implementation will become a tiny bit simpler. But can it > make searching better somehow? Yes, the idea is that it'll rank "One can use n-gram plugin in MariaDB fulltext search" higher than "Plugins are useful for extending functionality of a program" because some ngrams encode that "n-gram" and "plugin" words must be together as a phrase. > I can't decide. From my point of view, the current approach is fine. > Please pick a variant, and I'll try to implement that. No, I cannot guess which approach will produce more relevant searches. Implement something and then we test what works better > > > +} > > > + > > > +next += char_len; > > > +n_chars++; > > > + > > > +if (n_chars == ngram_token_size) { > > > + bool_info->position= static_cast(start - doc); > > > > nope, we don't have MYSQL_FTPARSER_FULL_BOOLEAN_INFO::position and this > > plugin is definitely not the reason to break the API and add it. > > There is a patch in the patchset that adds the "position" field. Without > that field, > it's possible to trigger an assertion in InnoDB's FTS code by executing: > > INSTALL PLUGIN simple_parser SONAME 'mypluglib'; > CREATE TABLE articles( > a TEXT DEFAULT NULL, > b TEXT DEFAULT NULL, > FULLTEXT (a, b) WITH PARSER simple_parser > ) ENGINE=InnoDB; > INSERT INTO articles VALUES ('111', '1234 1234 1234'); > DROP TABLE articles; > UNINSTALL PLUGIN simple_parser; > > Same assertion may be triggered by any fts plugin. Are you sure you > want that patch to be removed? I don't think InnoDB assert is a reason to extend the fulltext plugin API. What about this fix instead: diff --git a/storage/innobase/fts/fts0fts.cc b/storage/innobase/fts/fts0fts.cc --- a/storage/innobase/fts/fts0fts.cc +++ b/storage/innobase/fts/fts0fts.cc @@ -4669,7 +4669,7 @@ fts_tokenize_add_word_for_parser( ut_ad(boolean_info->position >= 0); position = boolean_info->position + fts_param->add_pos; */ - position = fts_param->add_pos; + position = fts_param->add_pos++; fts_add_token(result_doc, str, position); this fixed the crash for me. > > > + n_chars= ngram_token_size - 1; > > > + is_first= false; > > > +} > > > + } > > > + > > > + // If nothing was emitted yet, emit the remainder. > > > > what reminder, what is it for? > > It's "remainder", with "a" before "i". Remainder of the processed > string. That way strings that are too short for having even a single > n-gram, will still generate some index entries and will be > discoverable by exact queries. Perhaps the comment should say it? That if the string too short, it'll be emitted here? > > > +case MYSQL_FTPARSER_FULL_BOOLEAN_INFO: > > > + // Reusing InnoDB's boolean mode parser to handle all special > > > characters. > > > + while (fts_get_word(cs, , end, , _info)) { > > > > nope. InnoDB might not even be compiled it. Use param->mysql_parse() > > Copied fts_get_word() implementation into the plugin code. > > As far as I understand, param->mysql_parse() cannot be used here as it > splits a string into words like default fulltext search does. This > plugin however must generate n-grams. Of course, it can. Note that fts_get_word() doesn't generate n-grams either, it gets the whole word and the n-gram plugin later splits it into n-grams. Similarly param->mysql_parse() will extract words for you and you'll split them into n-grams. Regards, Sergei VP of MariaDB Server Engineering and secur...@mariadb.org ___ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
Re: [Maria-developers] e3f45b2f9ea: MDEV-10267 Add ngram fulltext parser plugin
Hi, Rinat! On Oct 19, Rinat Ibragimov wrote: > revision-id: e3f45b2f9ea (mariadb-10.2.31-368-ge3f45b2f9ea) > parent(s): 44c48cefdbe > author: Rinat Ibragimov > committer: Rinat Ibragimov > timestamp: 2020-08-25 01:59:06 +0300 > message: > > MDEV-10267 Add ngram fulltext parser plugin > > Ngram is a fulltext parser plugin that splits words into overlapping > segments of fixed lengths. For example, if 3-grams are in use, string > "abcdef" is split into "abc", "bcd", "cde", "def", thus allowing an > efficient substring search for CJK texts where splitting by words in not > feasible. Only word characters are going into n-grams. Spaces and > punctuation characters are treated as separators. > > diff --git a/plugin/fulltext/CMakeLists.txt b/plugin/fulltext/CMakeLists.txt > index 3c37666cb3e..9704733a9aa 100644 > --- a/plugin/fulltext/CMakeLists.txt > +++ b/plugin/fulltext/CMakeLists.txt > @@ -18,3 +18,7 @@ set(CMAKE_CXX_FLAGS_RELWITHDEBINFO > "${CMAKE_CXX_FLAGS_RELWITHDEBINFO} -DNDEBUG") > > MYSQL_ADD_PLUGIN(ftexample plugin_example.c >MODULE_ONLY MODULE_OUTPUT_NAME "mypluglib" COMPONENT Test) > + > +# n-gram parser > +MYSQL_ADD_PLUGIN(ftngram ngram_parser/plugin_ngram.cc > + MODULE_ONLY MODULE_OUTPUT_NAME "fulltext_ngram") why to put the plugin under plugin/fulltext/ngram_parser/ ? it's only one file, just put it in plugin/fulltext/ > diff --git a/plugin/fulltext/mysql-test/ngram/disabled.def > b/plugin/fulltext/mysql-test/ngram/disabled.def > new file mode 100644 > index 000..0ca9dd33ef7 > --- /dev/null > +++ b/plugin/fulltext/mysql-test/ngram/disabled.def > @@ -0,0 +1 @@ > +ngram_gb18030 : GB18030 character set is not supported yet. I don't think there's any need to add tests for unsupported charsets. The plugin is trivial, no need to test it for every possible charset that we have and don't have. > diff --git a/plugin/fulltext/mysql-test/ngram/ngram_basic.inc > b/plugin/fulltext/mysql-test/ngram/ngram_basic.inc > --- /dev/null > +++ b/plugin/fulltext/mysql-test/ngram/ngram_basic.inc > @@ -0,0 +1,25 @@ > +--echo > +--eval CREATE TABLE t (lang TEXT, str TEXT, FULLTEXT(str) WITH PARSER ngram) > ENGINE=$engine CHARSET $charset > + > +# Some character are not possible to represent in some encodings. Ignore > them. > +--disable_warnings > +--disable_query_log > +# http://www.gerolf.org/doc/polyglot/polyglot.htm. > +INSERT IGNORE INTO t VALUES > + ('Chinese', '他们为什么不说中文?'), > + ('English', 'Why can''t they just speak English?'), > + ('Japanese', 'なぜ、みんな日本語を話してくれないのか?'), > + ('Korean', '세계의 모든 사람들이 한국어를 이해한다면 얼마나 좋을까?'); > +--enable_query_log > +--enable_warnings > + > +SELECT * FROM t ORDER BY lang; > + > +SELECT * FROM t WHERE MATCH(str) AGAINST('之一' IN BOOLEAN MODE) ORDER BY lang; > +SELECT * FROM t WHERE MATCH(str) AGAINST('数据' IN BOOLEAN MODE) ORDER BY lang; > +SELECT * FROM t WHERE MATCH(str) AGAINST('zzz' IN BOOLEAN MODE) ORDER BY > lang; > +SELECT * FROM t WHERE MATCH(str) AGAINST('gli' IN BOOLEAN MODE) ORDER BY > lang; > +SELECT * FROM t WHERE MATCH(str) AGAINST('ない' IN BOOLEAN MODE) ORDER BY lang; > +SELECT * FROM t WHERE MATCH(str) AGAINST('람들' IN BOOLEAN MODE) ORDER BY lang; > + > +DROP TABLE t; > diff --git a/plugin/fulltext/mysql-test/ngram/ngram_basic.test > b/plugin/fulltext/mysql-test/ngram/ngram_basic.test > new file mode 100644 > index 000..eaa8829c3ee > --- /dev/null > +++ b/plugin/fulltext/mysql-test/ngram/ngram_basic.test > @@ -0,0 +1,25 @@ > +# Ngram fulltext parser basic functionality. > + > +SET NAMES utf8mb4; > +SELECT @@ngram_token_size; > + > +let $engine_index= 3; > +while ($engine_index) { > + if ($engine_index == 3) { let $engine= InnoDB; } > + if ($engine_index == 2) { let $engine= MyISAM; } > + if ($engine_index == 1) { let $engine= Aria; } > + > + let $charset_index= 6; > + while ($charset_index) { > +if ($charset_index == 6) { let $charset= utf8mb4; } > +if ($charset_index == 5) { let $charset= big5; } > +if ($charset_index == 4) { let $charset= gbk; } > +if ($charset_index == 3) { let $charset= gb2312; } > +if ($charset_index == 2) { let $charset= sjis; } > +if ($charset_index == 1) { let $charset= ujis; } > + > +source ngram_basic.inc; > +dec $charset_index; > + } eh. unroll the loop please, write let $engine= InnoDB; let $charset= utf8mb4; source ngram_basic.inc; let $charset= big5; source ngram_basic.inc; ... Or, better: CREATE TABLE t (lang TEXT, str TEXT, FULLTEXT(str) WITH PARSER ngram) ENGINE=InnoDB CHARSET utf8mb4; source ngram_basic.inc; CREATE TABLE t (lang TEXT, str TEXT, FULLTEXT(str) WITH PARSER ngram) ENGINE=InnoDB CHARSET big5; source ngram_basic.inc; ... > + dec $engine_index; > +} > diff --git a/plugin/fulltext/mysql-test/ngram/ngram_missing_plugin.result > b/plugin/fulltext/mysql-test/ngram/ngram_missing_plugin.result > new file mode 100644 > index 000..aa17b1d6053 > --- /dev/null > +++