Re: [sqlite] Opinions about per-row tokenizers for fts?

Scott Hess Wed, 19 Sep 2007 10:24:45 -0700

On 9/19/07, Ralf Junker <[EMAIL PROTECTED]> wrote:
> >Regarding per-row versus per-column tokenizers, your suggesting to
> >have something like 'content_kr TOKENIZER icu_kr' for each variant is
> >reasonable, but I'm not certain what gain it would have over simply
> >having separate tables for each variant.  Since data about records is
> >pushed into the schema itself, you potentially have to generate fairly
> >intricate queries, probably dynamically.
>
> Queries like this
>
>   select * from recipe where recipe match 'pie';
>
> would not need dynamic creation.


But wouldn't this usage exactly _force_ you to tokenize the query for
every column-defined tokenizer, and match against each column
separately?  I had thought you meant something like:

   select * from recipe where recipe_kr match 'pie';

to force the tokenizer selection.

> >This implementation also suffers from the query-tokenization problem you 
> >mention.
>
> I suggested column-tokenizers to work around the query-tokenization problem.
> The column-tokenizer definition would enable FTS to parse the query using
> the appropriate tokenizer for each row. Users would be guaranteed that each
> column would be searched with its matching query tokenizer. This would of
> course mean that, for multiple columns of different tokenizers, FTS would
> have to search each column individually.

My original reading of your concern was that you'd have to re-tokenize
the query for every possible tokenizer, which I understood was
considered expensive (it would be!).  But now it sounds like you
desire that for correctness reasons?

My proposal had two areas of tokenization, which were disconnected.
There was tokenization of documents before building a unified index,
and tokenization of queries before querying that unified index.  While
you might assert that a query is in English, you might get hits from
documents written in Korean.  This is certainly a mixed bag, but it
does loosely match the way such things are done in large-scale systems
such as web-search (which is a strong input from my end, for obvious
reasons!).

> >To be clear on the type of problem my proposal was targetted at, say
> >you have something like Google Reader, where there are a bunch of
> >articles that you want to search over.  Each article generally doesn't
> >have multiple translated variants, instead the language is a piece of
> >data about the article.  For this kind of system, pre-defining columns
> >for every language the system may encounter might result in dozens of
> >columns, with exactly one column used for any particular row.
> >
> >NOTE: I think that your idea of per-column default tokenizers may be a
> >good idea for fts to have.  I'm just questioning whether it is
> >targetted at the same problem my proposal is.
>
> Both approaches certainly have somewhat different targets from a user's
> perspective. Form a storage perspective, the column approach is reasonable
> if most columns contain non-null values or if storage space for a null value
> is negligable.

One concern I have is that the use-cases I've seen are likely to
result in a fair number of columns, possibly with combinatoric impacts
(60 language-custom versions multiplied by 5 desired columns).  I
don't have a strong enough formal grounding in database systems to put
up a quantifiable argument, but I know that anytime I've seen a schema
along these lines, it turned out to be wrong :-).

> >Regarding query tokenization, yes, the query must be parsed using a
> >tokenizer appropriate to the query.  That was where the suggested
> >change to the query syntax came from:
> >
> >   SELECT rowid FROM t WHERE t MATCH 'tokenizer:<x> ...';
>
> If the tokenizer must be explicitly stated, wouldn't this require to 
> construct queries
> danymically?

The fts query is in a place where it's easy to construct dynamically
by adjusting the strong passed into sqlite3_bind_*().  The distinction
I'm making is whether you're constructing the SQL query itself
dynamically, which, in my experience, almost always means you're
expressing something inappropriately.

> In addition to that, must not the user take care that the tokenizer properly
> matches the row? My column-tokenizer suggestion (as I immagine it) would
> release users from this responsibility.

Again, I'm asserting that, in general, the query is in a specific
language, in which case the content of the rows themselves is not
material.  You can certainly construct queries which contain elements
from multiple languages, but solving that is a very hard problem (to
match an AND query containing elements requiring multiple tokenizers
will require the ability to change tokenizers within an fts column!).

> >Tokenizing the query using every possible tokenizer may result in more
> >results, but is unlikely to result in higher quality of results.
>
> Besides this, I am concerned that results might be missed because the
> the query tokenizer does not match the text tokenizer.

This is a bigger problem than tokenizer selection.  Even if the text
and query are using the same tokenizer, tokenization for CJK is
generally driven by statistical rules applied to context.  Putting it
in terms of basic symbols, the text "ABCD" may tokenize as "A BC D",
while the query "BCD" may tokenize as "B CD".

> A final thought which occurred to me last night: How to handle tables with
> unregistered tokenizers? The column-tokenizer approach has all the
> information to throw an error at sqlite3_prepare(). For tokenizers defined
> at the row level, the error will only be available at sqlite3_step(). In other
> words: A prepared statement is not guaranteed to run successfully. AFAIK,
> this behaviour is as yet nowhere present in SQLite.

I'm not entirely certain that the final assertion is true given the
present of things like fts which call back into SQLite.  I _believe_
that fts3 currently replicates the SQLite semantics correctly, but
there are definitely subtle edge cases, and it is pretty challenging
to wrap your head around it.

Furthermore, fts has no input at the prepare level.  It can prevent
the table from being created/connected if all of the tokenizers used
are not present, but it cannot do anything more subtle.  [Sorry if
this is what you meant!]  One implementation tweak to my proposal
would be to have fts keep an internal table of tokenizers used, to
make things more efficient.  That could be used to prevent
creating/connecting the table if all tokenizers used aren't present.

A cute solution which just occurred to me would be to have the
complete set of tokenizers specified in the create statement.  Then
fts doesn't have to keep a mapping table, and the per-row cost is an
integer column, which is about as minimal as you're going to get.
Hmm.  Something like:

CREATE VIRTUAL TABLE t USING fts3(
  content,
  TOKENIZER DEFAULT icu_en (icu_en, icu_fr, icu_de, ...)
);

That feels clunky, though.

Also, I'm starting to think that just arbitrarily defining the
row-specific tokenizer to only apply to columns without explicit
tokenizers might be a reasonable way to go.  So something like:

CREATE VIRTUAL TABLE t USING fts3(
  content,
  title_en TOKENIZER icu_en,
  title_fr TOKENIZER icu_fr,
  TOKENIZER DEFAULT icu_en (icu_en, icu_fr, icu_de, ...)
);

would mean that content would tokenize using icu_en, unless overridden
by the tokenizer column, while title_en would _always_ tokenize with
icu_en, and title_fr would _always_ tokenize with icu_fr.

Unfortunately, this leaves query-handling a bit in the air, and I'm
also still wondering in which cases this is useful that you couldn't
just do with per-tokenizer virtual tables.

-scott

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Re: [sqlite] Opinions about per-row tokenizers for fts?

Reply via email to