Thank you Dan!
This should be much easier to implement than a full blown virtual table, but as 
far as I can tell(as I've mentioned, I'm not an expert in this) it doesn't do 
exactly what I want. I need all the locations of the search phrases, subject to 
the attribute constraint, in a document, but this query will only return the 
spans that contain one or more instance of the phrase. Or maybe I'm missing 
something?

Provided my understanding is correct, what do you think would be most the 
pragmatic way to change the query, so that it returns the locations of the 
phrases? What I imagine could work, even though it's admittedly ugly, is to 
write another UDF, let's call it "hacky_offsets()", that has some sort of 
"private" communication channel with custom_fts5_function(). So the query will 
be something like:

   SELECT fts5tbl.rowid, hacky_offsets()
   FROM fts5tbl('SQLite+rocks'), spans
   WHERE ft5tbl.rowid=spans.doc_id
    AND spans.attrib_id = 'bold'
    AND custom_fts5_function(fts5tbl, spans.start_tok, spans.end_tok);

When custom_fts5_function() finds instances of the phrases that fall between 
start_tok/end_tok it records these instances into a location hacky_offsets() 
knows about, so it can read them from there and return them to the application.
Would this even work and can you think of a more idiomatic/elegant 
implementation?

Thanks!


________________________________
From: sqlite-users <sqlite-users-boun...@mailinglists.sqlite.org> on behalf of 
Dan Kennedy <danielk1...@gmail.com>
Sent: Tuesday, April 10, 2018 11:49 AM
To: sqlite-users@mailinglists.sqlite.org
Subject: Re: [sqlite] Constraining FTS5 results based on offsets()

On 04/08/2018 02:00 PM, miroslav.marango...@outlook.com wrote:
> Hi,
> For my use case I need to index some texts, and I also need be able to assign 
> attributes to sub-sequences/spans of tokens in the texts. I want to be able 
> to search only for keywords/phrases that occur in spans with a certain 
> attribute. As an example, imagine we have a set of rich text documents and we 
> want to find the locations of the "SQLite rocks" phrase in that set, but not 
> just any instance of it- only those that have an attribute "bold".
>
> The general idea, I'm considering at the moment is as follows:
> 1) create a table 'spans' with columns (doc_id, attrib_id, start_tok, 
> end_tok), where doc_id is equal to the corresponding rowid in the FTS table, 
> and start_tok and end_tok are the 0-based offsets that delimit a span
> 2) issue a match (sub?)query against the FTS table and obtain a list of 
> (rowid, list-of-offsets in that doc)
> 3) somehow convert the above results into (rowid, start_offset, end_offset) 
> for each entry of the list-of-offsets
> 4) join the results from 3) with the "spans" table on "rowid=spans.doc_id" 
> where "spans.attrib_id=X and start_offset >= spans.start_pos and end_offset 
> <= spans.end_pos"
>
> Do you think this general approach makes sense, and how would you approach 
> the problem if not? I only have a cursory knowledge of both SQL and SQLite at 
> this point, so it's quite possible I'm missing something obvious.
>
> On the implementation side:
> - as far as I understand FTS5 has some clear advantages over FTS3/4. 
> Apparently there isn't readily available offsets() function in FTS5 yet, but 
> the API should make writing one rather straightforward
>
> - step 3) is what I wonder about the most... What would be a good way to 
> convert a (doc_id, <list-of-offsets>) row into (doc_id, start_offset, 
> end_offset) tuples for every entry in the list? I'd guess I will have to 
> implement some sort of virtual table(a.k.a. "table-valued function")? Perhaps 
> a table-valued function that takes a FTS query as its parameter that it then 
> uses to obtain (doc_id, <list-of-offsets>), and expose each hit as (doc_id, 
> start_offset, end_offset) rows in the virtual table?
>
> Any thoughts and ideas would be greatly appreciated.

Maybe just implement an FTS5 function to do the filtering in step 4
above. So that your query looks like:

   SELECT * FROM fts5tbl('SQLite+rocks'), spans
   WHERE ft.rowid=spans.doc_id
     AND spans.attrib_id = 'bold'
     AND custom_fts5_function(fts5tbl, spans.start_tok, spans.end_tok);

where custom_fts5_function() returns true if there are any phrase hits
with token offsets between its two trailing arguments. Or whatever it is
you require.

   https://sqlite.org/fts5.html#custom_auxiliary_functions

Dan.





>
> Thank you!
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to