Re: [Maria-developers] e3f45b2f9ea: MDEV-10267 Add ngram fulltext parser plugin

2020-11-11 Thread Sergei Golubchik
Hi, Rinat!

On Nov 02, Rinat Ibragimov wrote:
> > > I can't decide. From my point of view, the current approach is
> > > fine. Please pick a variant, and I'll try to implement that.
> >
> > No, I cannot guess which approach will produce more relevant
> > searches. Implement something and then we test what works better
> 
> Variable-length n-grams approach is too innovative, and hard to reason
> about. I've never heard about such an approach, and it doesn't look
> good to me. So I'll stick with a simple slicer.

If you mean that variant where it splits

  "n-grams approach" to "n-gr", "gra", "ram", "ams", "ms a", "s ap", "app", ... 

then it's just "n letters in every chunk" very easy to explain.

But ok, let's start simple and benchmark.

> > Of course, it can. Note that fts_get_word() doesn't generate n-grams
> > either, it gets the whole word and the n-gram plugin later splits it
> > into n-grams. Similarly param->mysql_parse() will extract words for
> > you and you'll split them into n-grams.
> 
> Changed to use param->mysql_parse().
> 
> Turns out that in Aria, MyISAM, and InnoDB, param->mysql_parser() does
> call back param->mysql_add_word(). Is it part of the plugin API?

Yes, it is. E.g. slide 12 from my old presentation:
http://conferences.oreillynet.com/presentations/mysql06/golubchik_sergei.pdf
shows that there are three points where a plugin can add functionality.

* It can extract the text and then call param->mysql_parser(),
  this allows to parse, say, gzip-ed texts or EXIF comments in images.

* It can replace param->mysql_parser(), to use different rules for
  spliting the text into words. This is what the n-gram plugin normally
  does

* It can replace param->mysql_add_word() to post-process every word
  after the built-in parser did the splitting. For example, stemming or
  soundex plugin can do that.

> Comments in include/mysql/plugin_ftparser.h do not mention that at
> all. That's why I initially thought that param->mysql_parse() will
> parse the string like the default parser do, without any ways to
> interact with the process.

I've edited the comment to mention this possibility.
 
Regards,
Sergei
VP of MariaDB Server Engineering
and secur...@mariadb.org

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] e3f45b2f9ea: MDEV-10267 Add ngram fulltext parser plugin

2020-10-21 Thread Sergei Golubchik
Hi, Rinat!

On Oct 21, Rinat Ibragimov wrote:
> > > +
> > > +// Avoid creating partial n-grams by stopping at word boundaries.
> > > +// Underscore is treated as a word character too, since identifiers 
> > > in
> > > +// programming languages often contain them.
> > > +if (!is_alnum(char_type) && !is_underscore(cs, next, end)) {
> > > +  start= next + char_len;
> > > +  next= start;
> > > +  n_chars= 0;
> > > +  continue;
> >
> > I don't think it's a good idea. On the opposite, I would rather create
> > ngrams over word boundaries, so that, say, I'd split "n-gram plugin" to
> >
> > "n-g", "-gr", "gra", "ram", "am ", "m p", " pl", "plu", "lug", "ugi", "gin".
> 
> I see how this can increase index size: more different strings are
> included. Implementation will become a tiny bit simpler. But can it
> make searching better somehow?

Yes, the idea is that it'll rank

  "One can use n-gram plugin in MariaDB fulltext search"

higher than

  "Plugins are useful for extending functionality of a program"

because some ngrams encode that "n-gram" and "plugin" words must be
together as a phrase.

> I can't decide. From my point of view, the current approach is fine.
> Please pick a variant, and I'll try to implement that.

No, I cannot guess which approach will produce more relevant searches.
Implement something and then we test what works better

> > > +}
> > > +
> > > +next += char_len;
> > > +n_chars++;
> > > +
> > > +if (n_chars == ngram_token_size) {
> > > +  bool_info->position= static_cast(start - doc);
> >
> > nope, we don't have MYSQL_FTPARSER_FULL_BOOLEAN_INFO::position and this
> > plugin is definitely not the reason to break the API and add it.
> 
> There is a patch in the patchset that adds the "position" field. Without
> that field,
> it's possible to trigger an assertion in InnoDB's FTS code by executing:
> 
> INSTALL PLUGIN simple_parser SONAME 'mypluglib';
> CREATE TABLE articles(
> a TEXT DEFAULT NULL,
> b TEXT DEFAULT NULL,
> FULLTEXT (a, b) WITH PARSER simple_parser
> ) ENGINE=InnoDB;
> INSERT INTO articles VALUES ('111', '1234 1234 1234');
> DROP TABLE articles;
> UNINSTALL PLUGIN simple_parser;
> 
> Same assertion may be triggered by any fts plugin. Are you sure you
> want that patch to be removed?

I don't think InnoDB assert is a reason to extend the fulltext plugin
API. What about this fix instead:

diff --git a/storage/innobase/fts/fts0fts.cc b/storage/innobase/fts/fts0fts.cc
--- a/storage/innobase/fts/fts0fts.cc
+++ b/storage/innobase/fts/fts0fts.cc
@@ -4669,7 +4669,7 @@ fts_tokenize_add_word_for_parser(
ut_ad(boolean_info->position >= 0);
position = boolean_info->position + fts_param->add_pos;
*/
-   position = fts_param->add_pos;
+   position = fts_param->add_pos++;

fts_add_token(result_doc, str, position);

this fixed the crash for me.

> > > +  n_chars= ngram_token_size - 1;
> > > +  is_first= false;
> > > +}
> > > +  }
> > > +
> > > +  // If nothing was emitted yet, emit the remainder.
> >
> > what reminder, what is it for?
> 
> It's "remainder", with "a" before "i". Remainder of the processed
> string. That way strings that are too short for having even a single
> n-gram, will still generate some index entries and will be
> discoverable by exact queries.

Perhaps the comment should say it? That if the string too short, it'll
be emitted here?

> > > +case MYSQL_FTPARSER_FULL_BOOLEAN_INFO:
> > > +  // Reusing InnoDB's boolean mode parser to handle all special 
> > > characters.
> > > +  while (fts_get_word(cs, , end, , _info)) {
> >
> > nope. InnoDB might not even be compiled it. Use param->mysql_parse()
> 
> Copied fts_get_word() implementation into the plugin code.
> 
> As far as I understand, param->mysql_parse() cannot be used here as it
> splits a string into words like default fulltext search does. This
> plugin however must generate n-grams.

Of course, it can. Note that fts_get_word() doesn't generate n-grams
either, it gets the whole word and the n-gram plugin later splits it
into n-grams. Similarly param->mysql_parse() will extract words for you
and you'll split them into n-grams.

Regards,
Sergei
VP of MariaDB Server Engineering
and secur...@mariadb.org

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] e3f45b2f9ea: MDEV-10267 Add ngram fulltext parser plugin

2020-10-19 Thread Sergei Golubchik
Hi, Rinat!

On Oct 19, Rinat Ibragimov wrote:
> revision-id: e3f45b2f9ea (mariadb-10.2.31-368-ge3f45b2f9ea)
> parent(s): 44c48cefdbe
> author: Rinat Ibragimov 
> committer: Rinat Ibragimov 
> timestamp: 2020-08-25 01:59:06 +0300
> message:
> 
> MDEV-10267 Add ngram fulltext parser plugin
> 
> Ngram is a fulltext parser plugin that splits words into overlapping
> segments of fixed lengths. For example, if 3-grams are in use, string
> "abcdef" is split into "abc", "bcd", "cde", "def", thus allowing an
> efficient substring search for CJK texts where splitting by words in not
> feasible. Only word characters are going into n-grams. Spaces and
> punctuation characters are treated as separators.
> 
> diff --git a/plugin/fulltext/CMakeLists.txt b/plugin/fulltext/CMakeLists.txt
> index 3c37666cb3e..9704733a9aa 100644
> --- a/plugin/fulltext/CMakeLists.txt
> +++ b/plugin/fulltext/CMakeLists.txt
> @@ -18,3 +18,7 @@ set(CMAKE_CXX_FLAGS_RELWITHDEBINFO 
> "${CMAKE_CXX_FLAGS_RELWITHDEBINFO} -DNDEBUG")
>  
>  MYSQL_ADD_PLUGIN(ftexample plugin_example.c
>MODULE_ONLY MODULE_OUTPUT_NAME "mypluglib" COMPONENT Test)
> +
> +# n-gram parser
> +MYSQL_ADD_PLUGIN(ftngram ngram_parser/plugin_ngram.cc
> +  MODULE_ONLY MODULE_OUTPUT_NAME "fulltext_ngram")

why to put the plugin under plugin/fulltext/ngram_parser/ ? it's only
one file, just put it in plugin/fulltext/

> diff --git a/plugin/fulltext/mysql-test/ngram/disabled.def 
> b/plugin/fulltext/mysql-test/ngram/disabled.def
> new file mode 100644
> index 000..0ca9dd33ef7
> --- /dev/null
> +++ b/plugin/fulltext/mysql-test/ngram/disabled.def
> @@ -0,0 +1 @@
> +ngram_gb18030 : GB18030 character set is not supported yet.

I don't think there's any need to add tests for unsupported charsets.
The plugin is trivial, no need to test it for every possible
charset that we have and don't have.

> diff --git a/plugin/fulltext/mysql-test/ngram/ngram_basic.inc 
> b/plugin/fulltext/mysql-test/ngram/ngram_basic.inc
> --- /dev/null
> +++ b/plugin/fulltext/mysql-test/ngram/ngram_basic.inc
> @@ -0,0 +1,25 @@
> +--echo
> +--eval CREATE TABLE t (lang TEXT, str TEXT, FULLTEXT(str) WITH PARSER ngram) 
> ENGINE=$engine CHARSET $charset
> +
> +# Some character are not possible to represent in some encodings. Ignore 
> them.
> +--disable_warnings
> +--disable_query_log
> +# http://www.gerolf.org/doc/polyglot/polyglot.htm.
> +INSERT IGNORE INTO t VALUES
> +  ('Chinese', '他们为什么不说中文?'),
> +  ('English', 'Why can''t they just speak English?'),
> +  ('Japanese', 'なぜ、みんな日本語を話してくれないのか?'),
> +  ('Korean', '세계의 모든 사람들이 한국어를 이해한다면 얼마나 좋을까?');
> +--enable_query_log
> +--enable_warnings
> +
> +SELECT * FROM t ORDER BY lang;
> +
> +SELECT * FROM t WHERE MATCH(str) AGAINST('之一' IN BOOLEAN MODE) ORDER BY lang;
> +SELECT * FROM t WHERE MATCH(str) AGAINST('数据' IN BOOLEAN MODE) ORDER BY lang;
> +SELECT * FROM t WHERE MATCH(str) AGAINST('zzz' IN BOOLEAN MODE) ORDER BY 
> lang;
> +SELECT * FROM t WHERE MATCH(str) AGAINST('gli' IN BOOLEAN MODE) ORDER BY 
> lang;
> +SELECT * FROM t WHERE MATCH(str) AGAINST('ない' IN BOOLEAN MODE) ORDER BY lang;
> +SELECT * FROM t WHERE MATCH(str) AGAINST('람들' IN BOOLEAN MODE) ORDER BY lang;
> +
> +DROP TABLE t;
> diff --git a/plugin/fulltext/mysql-test/ngram/ngram_basic.test 
> b/plugin/fulltext/mysql-test/ngram/ngram_basic.test
> new file mode 100644
> index 000..eaa8829c3ee
> --- /dev/null
> +++ b/plugin/fulltext/mysql-test/ngram/ngram_basic.test
> @@ -0,0 +1,25 @@
> +# Ngram fulltext parser basic functionality.
> +
> +SET NAMES utf8mb4;
> +SELECT @@ngram_token_size;
> +
> +let $engine_index= 3;
> +while ($engine_index) {
> +  if ($engine_index == 3) { let $engine= InnoDB; }
> +  if ($engine_index == 2) { let $engine= MyISAM; }
> +  if ($engine_index == 1) { let $engine= Aria; }
> +
> +  let $charset_index= 6;
> +  while ($charset_index) {
> +if ($charset_index == 6) { let $charset= utf8mb4; }
> +if ($charset_index == 5) { let $charset= big5; }
> +if ($charset_index == 4) { let $charset= gbk; }
> +if ($charset_index == 3) { let $charset= gb2312; }
> +if ($charset_index == 2) { let $charset= sjis; }
> +if ($charset_index == 1) { let $charset= ujis; }
> +
> +source ngram_basic.inc;
> +dec $charset_index;
> +  }

eh. unroll the loop please, write

  let $engine= InnoDB;
  let $charset= utf8mb4;
  source ngram_basic.inc;

  let $charset= big5;
  source ngram_basic.inc;
  ...

Or, better:

  CREATE TABLE t (lang TEXT, str TEXT, FULLTEXT(str) WITH PARSER ngram) 
ENGINE=InnoDB CHARSET utf8mb4;
  source ngram_basic.inc;

  CREATE TABLE t (lang TEXT, str TEXT, FULLTEXT(str) WITH PARSER ngram) 
ENGINE=InnoDB CHARSET big5;
  source ngram_basic.inc;
  ...

> +  dec $engine_index;
> +}
> diff --git a/plugin/fulltext/mysql-test/ngram/ngram_missing_plugin.result 
> b/plugin/fulltext/mysql-test/ngram/ngram_missing_plugin.result
> new file mode 100644
> index 000..aa17b1d6053
> --- /dev/null
> +++