Since performance is going to be dependent on your data distribution why don't 
you just try the default way and see what happens to your query performance?



Also take a look at adding prefix=1,prefix=2,prefix=3, etc to an FTS4 table to 
add those indexes.  That's probably a lot more space efiicient and faster for 
queries too since it's more compact.



Then let the rest of us know what the performance difference is.





Michael D. Black

Senior Scientist

Advanced Analytics Directorate

Advanced GEOINT Solutions Operating Unit

Northrop Grumman Information Systems

________________________________
From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] on 
behalf of Johannes Krude [johan...@krude.de]
Sent: Sunday, December 04, 2011 8:46 AM
To: General Discussion of SQLite Database
Subject: EXT :Re: [sqlite] RE Infinite Loop in MATCH on self written fts3 
tokenizer

hi,

On Sunday 04 December 2011 14:23:09 Black, Michael (IS) wrote:
> It says "here's token 'hal'" and if you return the pointer to "h" it points
> to the same place so it returns "hal" right back to you....ergo the loop.
I have read through the ext/fts3/fts3/expr.c code and found out the following:
piEndOffset must point to the zero byte after the returned token. fts3 expects
the tokenizer to generate exactly one token for each search string.

The first call to my xNext always returned the prefix with length 1 and
piStartOffset=piEndOffset=0. Therefore fts3 incremented its internal pointer
by 0 after each loop and then called xNext on the same string again.

I fixed this by returning first the longest prefix (the given word itself) and
pointing piEndOffset after the returned string. Now it works.

> You don't say why you're doing this.  FTS already supports prefix queries.
The fts documentation states, that if I want to efficently search for prefixes
I should give the maximum size of such prefixes such that fts can optimize for
those prefixes. I want to efficently search for prefixes of any length.

The drawback of my tokenizer is, that it consumes a lot of space, for 56Mb of
strings I get a 1.2Gb file. I assume since everything is done in trees, a
search with my tokenizer is in O(log(n)) where n is the number of tokens in
the table. Is this still O(log(n)) if I write a tokenizer for which
input=output and use the fts prefix search?

Greetings johannes
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to