On 04/08/2018 02:00 PM, miroslav.marango...@outlook.com wrote:
Hi,
For my use case I need to index some texts, and I also need be able to assign attributes to
sub-sequences/spans of tokens in the texts. I want to be able to search only for keywords/phrases
that occur in spans with a certain attribute. As an example, imagine we have a set of rich text
documents and we want to find the locations of the "SQLite rocks" phrase in that set, but
not just any instance of it- only those that have an attribute "bold".
The general idea, I'm considering at the moment is as follows:
1) create a table 'spans' with columns (doc_id, attrib_id, start_tok, end_tok),
where doc_id is equal to the corresponding rowid in the FTS table, and
start_tok and end_tok are the 0-based offsets that delimit a span
2) issue a match (sub?)query against the FTS table and obtain a list of (rowid,
list-of-offsets in that doc)
3) somehow convert the above results into (rowid, start_offset, end_offset) for
each entry of the list-of-offsets
4) join the results from 3) with the "spans" table on "rowid=spans.doc_id" where
"spans.attrib_id=X and start_offset >= spans.start_pos and end_offset <= spans.end_pos"
Do you think this general approach makes sense, and how would you approach the
problem if not? I only have a cursory knowledge of both SQL and SQLite at this
point, so it's quite possible I'm missing something obvious.
On the implementation side:
- as far as I understand FTS5 has some clear advantages over FTS3/4. Apparently
there isn't readily available offsets() function in FTS5 yet, but the API
should make writing one rather straightforward
- step 3) is what I wonder about the most... What would be a good way to convert a (doc_id,
<list-of-offsets>) row into (doc_id, start_offset, end_offset) tuples for every entry in the
list? I'd guess I will have to implement some sort of virtual table(a.k.a. "table-valued
function")? Perhaps a table-valued function that takes a FTS query as its parameter that it then
uses to obtain (doc_id, <list-of-offsets>), and expose each hit as (doc_id, start_offset,
end_offset) rows in the virtual table?
Any thoughts and ideas would be greatly appreciated.
Maybe just implement an FTS5 function to do the filtering in step 4
above. So that your query looks like:
SELECT * FROM fts5tbl('SQLite+rocks'), spans
WHERE ft.rowid=spans.doc_id
AND spans.attrib_id = 'bold'
AND custom_fts5_function(fts5tbl, spans.start_tok, spans.end_tok);
where custom_fts5_function() returns true if there are any phrase hits
with token offsets between its two trailing arguments. Or whatever it is
you require.
https://sqlite.org/fts5.html#custom_auxiliary_functions
Dan.
Thank you!
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users