Re: [sqlite] Constraining FTS5 results based on offsets()

Dan Kennedy Tue, 10 Apr 2018 01:49:02 -0700

On 04/08/2018 02:00 PM, miroslav.marango...@outlook.com wrote:

Hi,
For my use case I need to index some texts, and I also need be able to assign attributes to 
sub-sequences/spans of tokens in the texts. I want to be able to search only for keywords/phrases 
that occur in spans with a certain attribute. As an example, imagine we have a set of rich text 
documents and we want to find the locations of the "SQLite rocks" phrase in that set, but 
not just any instance of it- only those that have an attribute "bold".


The general idea, I'm considering at the moment is as follows:
1) create a table 'spans' with columns (doc_id, attrib_id, start_tok, end_tok), 
where doc_id is equal to the corresponding rowid in the FTS table, and 
start_tok and end_tok are the 0-based offsets that delimit a span
2) issue a match (sub?)query against the FTS table and obtain a list of (rowid, 
list-of-offsets in that doc)
3) somehow convert the above results into (rowid, start_offset, end_offset) for 
each entry of the list-of-offsets
4) join the results from 3) with the "spans" table on "rowid=spans.doc_id" where 
"spans.attrib_id=X and start_offset >= spans.start_pos and end_offset <= spans.end_pos"

Do you think this general approach makes sense, and how would you approach the 
problem if not? I only have a cursory knowledge of both SQL and SQLite at this 
point, so it's quite possible I'm missing something obvious.

On the implementation side:
- as far as I understand FTS5 has some clear advantages over FTS3/4. Apparently 
there isn't readily available offsets() function in FTS5 yet, but the API 
should make writing one rather straightforward

- step 3) is what I wonder about the most... What would be a good way to convert a (doc_id, 
<list-of-offsets>) row into (doc_id, start_offset, end_offset) tuples for every entry in the 
list? I'd guess I will have to implement some sort of virtual table(a.k.a. "table-valued 
function")? Perhaps a table-valued function that takes a FTS query as its parameter that it then 
uses to obtain (doc_id, <list-of-offsets>), and expose each hit as (doc_id, start_offset, 
end_offset) rows in the virtual table?

Any thoughts and ideas would be greatly appreciated.

Maybe just implement an FTS5 function to do the filtering in step 4above. So that your query looks like:


  SELECT * FROM fts5tbl('SQLite+rocks'), spans
  WHERE ft.rowid=spans.doc_id
    AND spans.attrib_id = 'bold'
    AND custom_fts5_function(fts5tbl, spans.start_tok, spans.end_tok);

where custom_fts5_function() returns true if there are any phrase hitswith token offsets between its two trailing arguments. Or whatever it isyou require.


  https://sqlite.org/fts5.html#custom_auxiliary_functions

Dan.


Thank you!
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Constraining FTS5 results based on offsets()

Reply via email to