Yeah I had thought about using the byte distance between words but you get
these instances:

[Example A]
|word1|10charword|word2|

[Example B]
|word1|3charword|4charword|3charword|word2|

By using byte distances, both of these score the same, where Example A
should score more highly.

But it would seem I can use the fts3_tokenizer somehow to get the token
positions or that this underlying value is available but just not stored in
an accessible manner.

I implemented OkapiBM25f [1] but was hoping to implement something like the
following proximity ranking [2] as it combines Bag-Of-Words ranking and
proximity ranking. Although that article proposes to precalculate the
distance pairs for all tokens, I'm happy to accept the TimeCost and
calculate on the fly as that SpaceCost won't be worth it.

[1] https://github.com/neozenith/sqlite-okapi-bm25
[2] http://infolab.stanford.edu/~theobald/pub/proximity-spire07.pdf



--
View this message in context: 
http://sqlite.1065341.n5.nabble.com/Proximity-ranking-with-FTS-tp76149p76152.html
Sent from the SQLite mailing list archive at Nabble.com.
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to