Re: [sqlite] Opinions about per-row tokenizers for fts?

Ralf Junker Wed, 19 Sep 2007 02:30:16 -0700

Hello Scott Hess,

>I think that if you do not need the ability to customize tokenizer on
>a per-row basis, there should be no storage cost compared to the
>current implementation.


Glad to read this!

>----
>
>Regarding per-row versus per-column tokenizers, your suggesting to
>have something like 'content_kr TOKENIZER icu_kr' for each variant is
>reasonable, but I'm not certain what gain it would have over simply
>having separate tables for each variant.  Since data about records is
>pushed into the schema itself, you potentially have to generate fairly
>intricate queries, probably dynamically.

Queries like this

  select * from recipe where recipe match 'pie';

would not need dynamic creation.

>This implementation also suffers from the query-tokenization problem you 
>mention.

I suggested column-tokenizers to work around the query-tokenization problem. 
The column-tokenizer definition would enable FTS to parse the query using the 
appropriate tokenizer for each row. Users would be guaranteed that each column 
would be searched with its matching query tokenizer. This would of course mean 
that, for multiple columns of different tokenizers, FTS would have to search 
each column individually.

>To be clear on the type of problem my proposal was targetted at, say
>you have something like Google Reader, where there are a bunch of
>articles that you want to search over.  Each article generally doesn't
>have multiple translated variants, instead the language is a piece of
>data about the article.  For this kind of system, pre-defining columns
>for every language the system may encounter might result in dozens of
>columns, with exactly one column used for any particular row.
>
>NOTE: I think that your idea of per-column default tokenizers may be a
>good idea for fts to have.  I'm just questioning whether it is
>targetted at the same problem my proposal is.

Both approaches certainly have somewhat different targets from a user's 
perspective. Form a storage perspective, the column approach is reasonable if 
most columns contain non-null values or if storage space for a null value is 
negligable.

>----
>
>Regarding query tokenization, yes, the query must be parsed using a
>tokenizer appropriate to the query.  That was where the suggested
>change to the query syntax came from:
>
>   SELECT rowid FROM t WHERE t MATCH 'tokenizer:<x> ...';

If the tokenizer must be explicitly stated, wouldn't this require to construct 
queries danymically? In addition to that, must not the user take care that the 
tokenizer properly matches the row? My column-tokenizer suggestion (as I 
immagine it) would release users from this responsibility.

>Tokenizing the query using every possible tokenizer may result in more
>results, but is unlikely to result in higher quality of results.

Besides this, I am concerned that results might be missed because the the query 
tokenizer does not match the text tokenizer.

------

A final thought which occurred to me last night: How to handle tables with 
unregistered tokenizers? The column-tokenizer approach has all the information 
to throw an error at sqlite3_prepare(). For tokenizers defined at the row 
level, the error will only be available at sqlite3_step(). In other words: A 
prepared statement is not guaranteed to run successfully. AFAIK, this behaviour 
is as yet nowhere present in SQLite. FTS could of course just return NULL for 
rows with unavailable tokenizers, but this would clearly exclude potential 
matches and is, IMO, undesirable.

Well, enough said. Thanks for your feedback, I just hope mine makes some sense, 
too ;-)

Ralf 


-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Re: [sqlite] Opinions about per-row tokenizers for fts?

Reply via email to