Re: [sqlite] Ranking in fts.

2007-06-10 Thread Scott Hess

On 6/10/07, Ralf Junker <[EMAIL PROTECTED]> wrote:

Hello Scott Hess,

>I've lined up some time to work on fts, again, which means fts3.  One
>thing I'd like to include would be to order doclists by some baked-in
>ranking.  The idea is to sort to most important items to the front of
>the list, and then you can do queries which limit the number of hits
>and can thus be significantly faster for popular terms.  [Note that
>"limit the number of hits" cannot currently be done at the fts layer,
>but I'm thinking on that problem, too.]

Maybe I am missing something here, but I can already rank and
limit a FTS2 search with the help of a simple join:


Just to clarify a bit, the ranking I'm talking about here isn't to
control the output from a select statement, it's to help optimize
queries.  Let's say you wanted to run a select which logically looks
something like:

 SELECT rowid FROM fts LEFT JOIN aux USING (rowid)
  WHERE fts MATCH 't*' ORDER BY score(fts) DESCENDING LIMIT 100;

Without the score() function, this would just return the first 100
matches, which is probably unsatisfactory.  With score(), it could
return the 100 best matches.

Unfortunately, since virtual tables do not get insight into the
overall query, this will return ALL of the hits to 't*', then join
against the aux table, then order by score, then limit 100.  This is
faster than doing it all manually in the client app, but still is very
slow if 't*' matches, say, every row in an fts database containing a
million rows.

I don't want to have to enhance fts to fully comprehend queries like
this - that would be complicated and error-prone.  Instead, I'd like
to add a baked-in rank, with some sort of fts query addition to limit
the result set to a more reasonable scale.  For instance, in the above
query, you might limit the fts results to 1000 items, then use more
general SQL to further limit to 100 (or 10) items, or even do that
work in the client itself.  To a certain extent you could emulate this
using two tables (one with only the highest-rank items), but that
doesn't scale up to handle complicated fts queries.

-scott

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



[sqlite] Ranking in fts.

2007-06-08 Thread Scott Hess

I've lined up some time to work on fts, again, which means fts3.  One
thing I'd like to include would be to order doclists by some baked-in
ranking.  The idea is to sort to most important items to the front of
the list, and then you can do queries which limit the number of hits
and can thus be significantly faster for popular terms.  [Note that
"limit the number of hits" cannot currently be done at the fts layer,
but I'm thinking on that problem, too.]

Here's a proposal for how this might look:

CREATE VIRTUAL TABLE MyTable USING fts3(
 column_name,
 other_column,
 RANK rank_column_name DESCENDING,
 TOKENIZER simple
);

column_name and other_column would be just like in fts2, and would
manifest as TEXT columns.  Likewise the TOKENIZER clause.  RANK would
introduce a new column, in this case called rank_column_name, which
would be INTEGER.  The developer could then insert values into that
column which would be used by fts3 to order document lists.

ASCENDING or DESCENDING are listed after the column name to describe
whether higher values are better or lower values.  I think this is
necessary to handle open-ended ranges reasonably.  If you assigned
incrementing values (such as from an autoincrement rowid) to a
DESCENDING rank, then earlier documents will sort to the beginning
(like how fts2 currently works, where things are ordered by docids).
If you assigned a timestamp to an ASCENDING rank, then later documents
will sort to the beginning.  You can somewhat map one to the other by
setting the rank to "very big constant - actual number", but I think
the syntactic sugar is nice.  I'd expect the default to be ASCENDING.
[I'm open to the notion that I have ASCENDING and DESCENDING exactly
backwards.  There's the logical sense of how importance is handled,
and the physical sense of how it is stored.]

If no RANK is given at all, then things work just like they do now
(essentially ranked by rowid, DESCENDING).

Anyone out there have any thoughts which might be important to consider?

Thanks,
scott

-
To unsubscribe, send email to [EMAIL PROTECTED]
-