Re: [sqlite] SELECT from fts3 tokens, is it possible?

Max Vlasov Thu, 24 Dec 2009 00:30:50 -0800

Thanks, Scott,

I see problems with regular tables, but to discuss in details I have to make
some tests comparing the approaches.


I read once again the info about fts3 internals and would like to share new
thoughts.
Let's forget about "hit count". Although I mentioned google suggestions, it
was mainly for recognition of the principle of popup hint list, my primary
interest is just in getting tokens list that starts with entered string.

So if fts3 data stored in b-trees, where a token is the key, there is a
similarity in how sqlite searches for range of indexed varchars and how fts3
searches for tokens. You mentioned that a problem exists with merging
results, but general full-text search already have to merge them for found
doclist so the phrase "..Of course, this scheme makes queries more expensive
(as the FTS3 code may have to look up individual terms in more than one
b-tree and merge the results).." is appliable to tokens as well.
Moreover, I suppose that the code in fts3 that handles queries with
wildcards like Match("some*") can also be used for querying tokens matching
the same mask, the difference probably that the former should merge doclists
(from different segments/nodes) while the latter tokens themselves.

One of the problems (if such queries can be implemented technically) is the
proper language sintax. As the tokens are not columns of the virtual table,
this syntax have to be different from general Select.

On Wed, Dec 23, 2009 at 8:51 PM, Scott Hess <[email protected]> wrote:

> Doing this in the tokenizer is probably not great because you can't
> determine which of the tokens actually are in the database after
> accounting for deletes and updates.  When I was last thinking about
> this problem, I think the best option looked like keeping a distinct
> token index with token_id stored in the fts index.  Another
> alternative was to rearrange how segments are stored and merged so
> that the token part of the index would float above the doclists.
>
> If you wanted to hack something in right now, you could keep a set of
> parallel segments mapping tokens to counts of the number of hits for
> that token.  A document insert would generate a bunch of positive
> counts, a document delete a bunch of negative counts, merges would net
> them out, and if a count went to 0 that token would be dropped from
> that segment.  Queries would still be somewhat costly, but this index
> could probably be merged more aggressively than the fts index (it
> should be much smaller).
>
> If your database isn't too large, you could maybe even keep this in a
> regular SQLite-style table.  Maybe, maybe not, fts1 did something like
> that and it got terribly slow once the fts table had a few tens of
> thousands of documents.  The problem was that the tokens were
> distributed across a large portion of the index, so data locality went
> down the drain and every update was like a full table scan.
>
> -scott
>
>
> On Wed, Dec 23, 2009 at 7:21 AM, Max Vlasov <[email protected]> wrote:
> > Scott, thank,
> >
> > I see that the hit counts is the next level of complexity.
> > As for tokens, I think that if one really needs to implement this
> feature,
> > there would be a possible solution in allowing the program to "mirror"
> > tokenized" data. Right now as I see it would be possible (if a custom
> > tokenizer implemented), to collect such data inside xNext function, but
> > there are problems. For example, we should know we're not parsing Match
> > query. Is there a way to distinguish  whether xNext works with MATCH
> > operator content or from UPDATE or INSERT INTO? Because it makes no sense
> to
> > collect data from MATCH, because it may contain irrelevant, non-exising
> > words (but ironically it would be helpful for collecting "hit" data). If
> we
> > know inside xNext that the call from a real data appending then at least
> we
> > will be able to maintain a dictionary of words used at least once.
> >
> > Max
> >
> > On Wed, Dec 23, 2009 at 5:54 PM, Scott Hess <[email protected]> wrote:
> >
> >> The tokens are all there, so it is "theoretically possible".  But the
> >> performance would be very bad because it would require iterating over
> >> all the segment data, basically a worst-case query.  Even then, you
> >> wouldn't have information about hit counts.  To do it reasonably would
> >> require redesigning the data format to take this use-case into
> >> consideration.
> >>
> >> -scott
> >>
> >> On Wed, Dec 23, 2009 at 3:56 AM, Max Vlasov <[email protected]>
> wrote:
> >> > Is it theoretically possible to allow some kind of sql SELECT query
> >> > operations with tokens?
> >> >
> >> > Particulary one could implement suggestions similar to Google
> suggestions
> >> in
> >> > Web search, when for example "some" typed in a search box and a
> listbox
> >> > popped up with this database tokens starting with these letters shown
> >> > ("someone", "something" etc).
> >> > Having some experience with full-text in past once I did my own simple
> >> > implementation of full-text for sqlite, where all tokens saved in a
> >> > separated table that also has a text index, so in this case SELECT
> Title
> >> > FROM Tokens WHERE Title LIKE "some%" works as expected (with
> >> > case-sensitivity pragma effectively set). I tried to read the
> technical
> >> part
> >> > of http://www.sqlite.org/fts3.html document, but could not figure out
> >> > whether it is possible to implement this in the current implementation
> of
> >> > fts3
> >> >
> >> > Thanks
> >> >
> >> > Max
> >> > _______________________________________________
> >> > sqlite-users mailing list
> >> > [email protected]
> >> > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> >> >
> >> _______________________________________________
> >> sqlite-users mailing list
> >> [email protected]
> >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> >>
> > _______________________________________________
> > sqlite-users mailing list
> > [email protected]
> > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> >
> _______________________________________________
> sqlite-users mailing list
> [email protected]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] SELECT from fts3 tokens, is it possible?

Reply via email to