On 06/17/2014 10:48 AM, Josh Wilson wrote:
Yeah I had thought about using the byte distance between words but you get
these instances:

[Example A]
|word1|10charword|word2|

[Example B]
|word1|3charword|4charword|3charword|word2|

By using byte distances, both of these score the same, where Example A
should score more highly.

But it would seem I can use the fts3_tokenizer somehow to get the token
positions or that this underlying value is available but just not stored in
an accessible manner.

I think it's possible to do. When it visits a row as part of a full-text search, internally FTS has a list of matches within the current row for each phrase in the query. Each match is stored as a column and token offset - the number of tokens that precede the match within the column text.

Is that what you need? Do you have any ideas for an fts4 interface it?

Dan.





I implemented OkapiBM25f [1] but was hoping to implement something like the
following proximity ranking [2] as it combines Bag-Of-Words ranking and
proximity ranking. Although that article proposes to precalculate the
distance pairs for all tokens, I'm happy to accept the TimeCost and
calculate on the fly as that SpaceCost won't be worth it.

[1] https://github.com/neozenith/sqlite-okapi-bm25
[2] http://infolab.stanford.edu/~theobald/pub/proximity-spire07.pdf



--
View this message in context: 
http://sqlite.1065341.n5.nabble.com/Proximity-ranking-with-FTS-tp76149p76152.html
Sent from the SQLite mailing list archive at Nabble.com.
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to