On 04/08/2018 02:00 PM, miroslav.marango...@outlook.com wrote:
For my use case I need to index some texts, and I also need be able to assign attributes to 
sub-sequences/spans of tokens in the texts. I want to be able to search only for keywords/phrases 
that occur in spans with a certain attribute. As an example, imagine we have a set of rich text 
documents and we want to find the locations of the "SQLite rocks" phrase in that set, but 
not just any instance of it- only those that have an attribute "bold".

The general idea, I'm considering at the moment is as follows:
1) create a table 'spans' with columns (doc_id, attrib_id, start_tok, end_tok), 
where doc_id is equal to the corresponding rowid in the FTS table, and 
start_tok and end_tok are the 0-based offsets that delimit a span
2) issue a match (sub?)query against the FTS table and obtain a list of (rowid, 
list-of-offsets in that doc)
3) somehow convert the above results into (rowid, start_offset, end_offset) for 
each entry of the list-of-offsets
4) join the results from 3) with the "spans" table on "rowid=spans.doc_id" where 
"spans.attrib_id=X and start_offset >= spans.start_pos and end_offset <= spans.end_pos"

Do you think this general approach makes sense, and how would you approach the 
problem if not? I only have a cursory knowledge of both SQL and SQLite at this 
point, so it's quite possible I'm missing something obvious.

On the implementation side:
- as far as I understand FTS5 has some clear advantages over FTS3/4. Apparently 
there isn't readily available offsets() function in FTS5 yet, but the API 
should make writing one rather straightforward

- step 3) is what I wonder about the most... What would be a good way to convert a (doc_id, 
<list-of-offsets>) row into (doc_id, start_offset, end_offset) tuples for every entry in the 
list? I'd guess I will have to implement some sort of virtual table(a.k.a. "table-valued 
function")? Perhaps a table-valued function that takes a FTS query as its parameter that it then 
uses to obtain (doc_id, <list-of-offsets>), and expose each hit as (doc_id, start_offset, 
end_offset) rows in the virtual table?

Any thoughts and ideas would be greatly appreciated.

Maybe just implement an FTS5 function to do the filtering in step 4 above. So that your query looks like:

  SELECT * FROM fts5tbl('SQLite+rocks'), spans
  WHERE ft.rowid=spans.doc_id
    AND spans.attrib_id = 'bold'
    AND custom_fts5_function(fts5tbl, spans.start_tok, spans.end_tok);

where custom_fts5_function() returns true if there are any phrase hits with token offsets between its two trailing arguments. Or whatever it is you require.



Thank you!
sqlite-users mailing list

sqlite-users mailing list

Reply via email to