[Input == great!]

----

Regarding space usage, my current prototype stores an additional
column on %_content, defined like 'tokenizer TEXT DEFAULT NULL'.  If
NULL, the default tokenizer is used and the per-row cost is
negligible.  Otherwise, it is the string specifying the tokenizer,
which I would expect to generally be 5-8 bytes, which I think is
pretty reasonable.

I think that if you do not need the ability to customize tokenizer on
a per-row basis, there should be no storage cost compared to the
current implementation.  I'm not quite confident in making that an
explicit goal, but I generally do think fts should minimize the cost
of a feature for anyone who's not using that feature, so long as it
doesn't cause excessive complexity (I don't think being low-cost in
this case will).

There are a couple options which could be used if this is deemed
excessive.  An internal mapping table could be used to allow the
tokenizers to be represented as varints, which would effectively make
things fixed, at the cost of a page for the mapping table.  An
alternative would be to have the mapping table store the actual
rowid->tokenizer mapping, using various compression tricks (like the
fts index uses) to keep things compact.  In either case, there's a bit
higher fixed cost involved.

----

Regarding per-row versus per-column tokenizers, your suggesting to
have something like 'content_kr TOKENIZER icu_kr' for each variant is
reasonable, but I'm not certain what gain it would have over simply
having separate tables for each variant.  Since data about records is
pushed into the schema itself, you potentially have to generate fairly
intricate queries, probably dynamically.  This implementation also
suffers from the query-tokenization problem you mention, except
insofar as you MATCH only against specific columns.

To be clear on the type of problem my proposal was targetted at, say
you have something like Google Reader, where there are a bunch of
articles that you want to search over.  Each article generally doesn't
have multiple translated variants, instead the language is a piece of
data about the article.  For this kind of system, pre-defining columns
for every language the system may encounter might result in dozens of
columns, with exactly one column used for any particular row.

NOTE: I think that your idea of per-column default tokenizers may be a
good idea for fts to have.  I'm just questioning whether it is
targetted at the same problem my proposal is.  I'm going to think
about whether the proposals can be sensibly merged (they're not quite
orthogonal, because using both on the same table effectively requires
per-field tokenizer specification).

----

Regarding query tokenization, yes, the query must be parsed using a
tokenizer appropriate to the query.  That was where the suggested
change to the query syntax came from:

   SELECT rowid FROM t WHERE t MATCH 'tokenizer:<x> ...';

Tokenizing the query using every possible tokenizer may result in more
results, but is unlikely to result in higher quality of results.  fts
could do some sort of index-directed tokenization (if you tokenize
into a bunch of terms which aren't in the index, you won't hit
anything!), but that seems like a project for some future quarter.  I
think doing anything automagic with query tokenization is a challenge,
because queries tend much more towards sentence fragments.

-scott


On 9/18/07, Ralf Junker <[EMAIL PROTECTED]> wrote:
> Hello Scott Hess,
>
> >In the interests of not committing something that people won't like, my 
> >current proposal would be to add an implicit TOKENIZER column, which will 
> >override the table's default tokenizer for that row.
>
> There are a few things I am worried about with this approach:
>
> 1. FTS storage size
>
> Will the TOKENIZER column not add to the overall size of the FTS storage, 
> even if the default tokenizer is used? As FTS requires to store all text, its 
> storage requirements are quite high already and did put people of SQLite as 
> their full text search implementation.
>
> 2. Potential incompatability with query parser tokenizer
>
> The table's text tokenizer is used to tokenize the query string as well. 
> AFAIK, both must be identical. I can not see how this single query tokenizer 
> can then cooperate with a potentially unlmited number of incompatible row 
> tokenizers. Reparsing the query for each row is, it guess, out of the 
> question for performance reasons.
>
> * Alternative suggestion
>
> Offer a per COLUMN tokenizer option instead of a per ROW one. This would get 
> rid of problem 1 because the tokenizer can be stored with the column 
> definition.
>
> The COLUMN tokenizer option would also help with problem 2: The engine can 
> then parse the query according to each column's tokenizer setting. Not all 
> queries might make sense with all columns, but at least the engine would 
> guarantee that both are using the identical tokenizer. It would be up to the 
> application to search certain columns for a particular language query only.
>
> I also find the per column tokenizer override easier to grasp (like for 
> translations, for that purpose), because one can different language columns 
> with different tokenizers: Content_EN, Content_KR, and so on. This of course 
> assumes that the number of supported languages is limited. New languages can 
> be added with ALTER TABLE, but an application with support for an infinite 
> number of langages would probably opt for the one table per language option.
>
> Ralf
>
>
> -----------------------------------------------------------------------------
> To unsubscribe, send email to [EMAIL PROTECTED]
> -----------------------------------------------------------------------------
>
>

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Reply via email to