FTFY, you're welcome ;-)
Now *MAYBE* someone can read it and possibly reply.

--
   --
      --
         --???--
        K e V i N

On Thu, Jul 2, 2015 at 9:11 AM, <shuricksoft at ukr.net> wrote:

> We use SQLite for indexing and searching the text contents of our app
> using FTS4
> (about 27k unique words, about 1 million words for the whole contents). In
> particular,
> we use the offsets function. Currently, after some testing, we?re
> experiencing a plenty
> of problems with finding the results needed.
>
> For the forms of the words searching we use the ?all-to-all? way, which
> means we have
> some standard English endings and words forms and modify the initial
> request so that
> all possible combinations are included and separated by OR from each
> other.
>
> I. e. if we have two forms and two words in the request (for simplicity),
> that would look
> like (the MATCH part):
>
> ?(word1_form1 NEAR/10 word2_form1)
> OR (word1_form1 NEAR/10 word2_form2)
> OR (word1_form2 NEAR/10 word2_form1)
> OR (word1_form2 NEAR/10 word2_form2)?.
>
> Initially, the problem appeared that the query returned offsets for
> absolutely wrong words.
> While searching for ?honest fair?, we got words like ?good?, ?sport?,
> ?natural? offsets.
> We moved from the system provided (iOS, OS X) SQLite (version 3.8.5,
> further called
> ?old SQLite?) to the latest one available on the official site (version
> 3.8.10.2,
> further called ?new SQLite? or just ?SQLite?), which solved the issue, but
> returned offsets
> for words which were not near (according to NEAR/10). I supposed it?s
> because the request
> was too big because of lots of the words forms. But, for example, this
> request
>
> for ?offense is rank?:
> select documents.name, contents.rowid, part, offsets(contents)
> from contents left join documents on contents.document_id = documents.id
> where text match '(\"offense\" NEAR/10 \"is\" NEAR/10 \"rank\")'
> order by document_id
>
> returned the correct results, while this one returned false positive
> results (0 stands for
> apostrophe so it?s not interpreted as a separate token for the simple
> tokenizer we use):
>
> select documents.name, contents.rowid, part, offsets(contents)
> from contents left join documents on contents.document_id = documents.id
> where text match '(\"offense\" NEAR/10 \"is\" NEAR/10 \"rank\")
> OR (\"offense\" NEAR/10 \"is\" NEAR/10 \"rank0s\")'
> order by document_id
>
> The experiments revealed that ?rank0s? can be whatever: ranks",
> "rankqwerty" and so on,
> even "rankqwertyvrfbgbrevwkefovmwpsrvrm" or "yugbuoiipkipnuo?. If it's
> removed, nothing
> found. If after that the first word is modified, the one correct result
> found. Also, a
> search for ?speak again? didn?t find the results at all, though there
> should be quite a
> lot of them.
>
> The ?database disk image is malformed? error was reported, but integrity
> check completed
> with no errors, the database was recreated and that didn?t help, meanwhile
> the old SQLite
> did find the results for this request in exactly that file with no error
> reported.
>
> Also, the new SQLite worked well for the same request if the offsets
> function was replaced
> with the snippet function (that also solved the false positive results
> problem described
> above). The search for ?father? returned not all results. For example, the
> first result
> which can manually be found in the contents table using SQLite Database
> Browser for this
> request:
>
> select documents.name, contents.rowid, part, offsets(contents)
> from contents left join documents on contents.document_id = documents.id
> where text match '(\"father\")
> OR (\"fathere\") OR (\"fathering\")
> OR (\"fatherish\") OR (\"fathers\")
> OR (\"fatheres\") OR (\"fatherian\")
> OR (\"fatheral\") OR (\"father0s\")'
> and document_id in (25)
> order by document_id
>
> missed when actually searched, the snippet function doesn?t help here,
> neither helps
> searching for a single word form. Error logging with
>
> sqlite3_config(SQLITE_CONFIG_LOG, errorLogCallback, NULL)
>
> was never called.
>  Any ideas?
>

Reply via email to