[sqlite] Weird FTS bugs (wrong, missing and false positive results)

2015-07-06 Thread Dan Kennedy
On 07/06/2015 07:23 PM, shuricksoft at ukr.net wrote:
>
>   ---  ? ---
> ?? : "Dan Kennedy" 
> : 2  2015, 14:26:05
>
>   
>> On 07/02/2015 09:24 PM, Kevin Benson wrote:
>>> FTFY, you're welcome ;-)
>> Now *MAYBE* someone can read it and possibly reply.
>>> --
>>>  --
>>> --
>>>--???--
>>>   K e V i N
>>>
>>
>> It's difficult to say. The level of detail you have provided is a little
>> overwhelming.
>>
>> If you are using a custom tokenizer and it behaves inconsistently you
>> could get these kind of problems. Or it could also be a bug in the
>> combination of NEAR, OR and an auxiliary FTS function.
>>
>> Are you able to post a database online along with a query that returns
>> an incorrect result? And the tokenizer implementation too if possible,
>> although it might be possible to figure it out without that.
>>
>> Does "neither helps searching for a single word form" mean that a simple
>> query like "text MATCH 'father'" is also failing? If so, that's the one
>> to post.
>>
>> You are compiling with SQLITE_ENABLE_FTS3_PARENTHESIS defined, correct?
>>
>> Dan.
> Sorry for the delayed reply, I was traveling these days.
>
> Thanks, Kevin! What was the problem with the line breaks? When I sent it, it 
> looked fine.
>
> Dan, I'm not using a custom tokenizer, just the standard one (called simple).
>
> I can send you the database, but we don't like it to be publicly available 
> online, so, please, tell me a comfortable way for you I can make it available 
> only to you (about 6 MB).

6MB? Just zip it and mail it to me.

Thanks,
Dan.




>
> Yes, "neither helps searching for a single word form" means just the same you 
> wrote.
>
> Yes, sure, I'm compiling with SQLITE_ENABLE_FTS3_PARENTHESIS.
>   
>   
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



[sqlite] Weird FTS bugs (wrong, missing and false positive results)

2015-07-06 Thread shuricks...@ukr.net
Sorry, tried to make the reply from another message and the original text, but 
it failed to be placed in the thread I wanted. Sorry about that, already 
e-mailed to the admin.


 ---  ? ---
 ?? : shuricksoft at ukr.net
 : 6  2015, 15:23:05



> 
> 
> ---  ? ---
> ?? : "Dan Kennedy" 
> : 2  2015, 14:26:05
> 
> 
> > On 07/02/2015 09:24 PM, Kevin Benson wrote:
> >> FTFY, you're welcome ;-)
> > Now *MAYBE* someone can read it and possibly reply.
> >>
> >> --
> >> --
> >> --
> >> --???--
> >> K e V i N
> >>
> >
> >
> > It's difficult to say. The level of detail you have provided is a little 
> > overwhelming.
> >
> > If you are using a custom tokenizer and it behaves inconsistently you 
> > could get these kind of problems. Or it could also be a bug in the 
> > combination of NEAR, OR and an auxiliary FTS function.
> >
> > Are you able to post a database online along with a query that returns 
> > an incorrect result? And the tokenizer implementation too if possible, 
> > although it might be possible to figure it out without that.
> >
> > Does "neither helps searching for a single word form" mean that a simple 
> > query like "text MATCH 'father'" is also failing? If so, that's the one 
> > to post.
> > 
> > You are compiling with SQLITE_ENABLE_FTS3_PARENTHESIS defined, correct?
> >
> > Dan.
> 
> Sorry for the delayed reply, I was traveling these days.
> 
> Thanks, Kevin! What was the problem with the line breaks? When I sent it, it 
> looked fine.
> 
> Dan, I'm not using a custom tokenizer, just the standard one (called simple).
> 
> I can send you the database, but we don't like it to be publicly available 
> online, so, please, tell me a comfortable way for you I can make it available 
> only to you (about 6 MB).
> 
> Yes, "neither helps searching for a single word form" means just the same you 
> wrote.
> 
> Yes, sure, I'm compiling with SQLITE_ENABLE_FTS3_PARENTHESIS.
> 
> 
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> 




[sqlite] Weird FTS bugs (wrong, missing and false positive results)

2015-07-06 Thread shuricks...@ukr.net


 ---  ? ---
?? : "Dan Kennedy" 
: 2  2015, 14:26:05


> On 07/02/2015 09:24 PM, Kevin Benson wrote:
>> FTFY, you're welcome ;-)
> Now *MAYBE* someone can read it and possibly reply.
>>
>> --
>> --
>>--
>>   --???--
>>  K e V i N
>>
>
>
> It's difficult to say. The level of detail you have provided is a little 
> overwhelming.
>
> If you are using a custom tokenizer and it behaves inconsistently you 
> could get these kind of problems. Or it could also be a bug in the 
> combination of NEAR, OR and an auxiliary FTS function.
>
> Are you able to post a database online along with a query that returns 
> an incorrect result? And the tokenizer implementation too if possible, 
> although it might be possible to figure it out without that.
>
> Does "neither helps searching for a single word form" mean that a simple 
> query like "text MATCH 'father'" is also failing? If so, that's the one 
> to post.
> 
> You are compiling with SQLITE_ENABLE_FTS3_PARENTHESIS defined, correct?
>
> Dan.

Sorry for the delayed reply, I was traveling these days.

Thanks, Kevin! What was the problem with the line breaks? When I sent it, it 
looked fine.

Dan, I'm not using a custom tokenizer, just the standard one (called simple).

I can send you the database, but we don't like it to be publicly available 
online, so, please, tell me a comfortable way for you I can make it available 
only to you (about 6 MB).

Yes, "neither helps searching for a single word form" means just the same you 
wrote.

Yes, sure, I'm compiling with SQLITE_ENABLE_FTS3_PARENTHESIS.




[sqlite] Weird FTS bugs (wrong, missing and false positive results)

2015-07-06 Thread Kevin Benson
On Mon, Jul 6, 2015 at 8:23 AM,  wrote:

>
>
>  ---  ? ---
> ?? : "Dan Kennedy" 
> : 2  2015, 14:26:05
>
>
> > On 07/02/2015 09:24 PM, Kevin Benson wrote:
> >> FTFY, you're welcome ;-)
> > Now *MAYBE* someone can read it and possibly reply.
> >>
>


> *SNIP*
>


>
> Sorry for the delayed reply, I was traveling these days.
>
> Thanks, Kevin! What was the problem with the line breaks? When I sent it,
> it looked fine.
>
>
I don't know that our receiving end could tell what was wrong with what you
sent:

http://sqlite.1065341.n5.nabble.com/Weird-FTS-bugs-wrong-missing-and-false-positive-results-td82936.html
. 

Suffice to say that my edit was only fair to poor (still too wide for Gmail
reply) ...but provided enough impetus.

Cheers~
--
   --
  --
 --???--
K e V i N


[sqlite] Weird FTS bugs (wrong, missing and false positive results)

2015-07-03 Thread Dan Kennedy
On 07/02/2015 09:24 PM, Kevin Benson wrote:
> FTFY, you're welcome ;-)
> Now *MAYBE* someone can read it and possibly reply.
>
> --
> --
>--
>   --???--
>  K e V i N
>
> On Thu, Jul 2, 2015 at 9:11 AM,  wrote:
>
>> We use SQLite for indexing and searching the text contents of our app
>> using FTS4
>> (about 27k unique words, about 1 million words for the whole contents). In
>> particular,
>> we use the offsets function. Currently, after some testing, we?re
>> experiencing a plenty
>> of problems with finding the results needed.
>>
>> For the forms of the words searching we use the ?all-to-all? way, which
>> means we have
>> some standard English endings and words forms and modify the initial
>> request so that
>> all possible combinations are included and separated by OR from each
>> other.
>>
>> I. e. if we have two forms and two words in the request (for simplicity),
>> that would look
>> like (the MATCH part):
>>
>> ?(word1_form1 NEAR/10 word2_form1)
>> OR (word1_form1 NEAR/10 word2_form2)
>> OR (word1_form2 NEAR/10 word2_form1)
>> OR (word1_form2 NEAR/10 word2_form2)?.
>>
>> Initially, the problem appeared that the query returned offsets for
>> absolutely wrong words.
>> While searching for ?honest fair?, we got words like ?good?, ?sport?,
>> ?natural? offsets.
>> We moved from the system provided (iOS, OS X) SQLite (version 3.8.5,
>> further called
>> ?old SQLite?) to the latest one available on the official site (version
>> 3.8.10.2,
>> further called ?new SQLite? or just ?SQLite?), which solved the issue, but
>> returned offsets
>> for words which were not near (according to NEAR/10). I supposed it?s
>> because the request
>> was too big because of lots of the words forms. But, for example, this
>> request
>>
>> for ?offense is rank?:
>> select documents.name, contents.rowid, part, offsets(contents)
>> from contents left join documents on contents.document_id = documents.id
>> where text match '(\"offense\" NEAR/10 \"is\" NEAR/10 \"rank\")'
>> order by document_id
>>
>> returned the correct results, while this one returned false positive
>> results (0 stands for
>> apostrophe so it?s not interpreted as a separate token for the simple
>> tokenizer we use):
>>
>> select documents.name, contents.rowid, part, offsets(contents)
>> from contents left join documents on contents.document_id = documents.id
>> where text match '(\"offense\" NEAR/10 \"is\" NEAR/10 \"rank\")
>> OR (\"offense\" NEAR/10 \"is\" NEAR/10 \"rank0s\")'
>> order by document_id
>>
>> The experiments revealed that ?rank0s? can be whatever: ranks",
>> "rankqwerty" and so on,
>> even "rankqwertyvrfbgbrevwkefovmwpsrvrm" or "yugbuoiipkipnuo?. If it's
>> removed, nothing
>> found. If after that the first word is modified, the one correct result
>> found. Also, a
>> search for ?speak again? didn?t find the results at all, though there
>> should be quite a
>> lot of them.
>>
>> The ?database disk image is malformed? error was reported, but integrity
>> check completed
>> with no errors, the database was recreated and that didn?t help, meanwhile
>> the old SQLite
>> did find the results for this request in exactly that file with no error
>> reported.
>>
>> Also, the new SQLite worked well for the same request if the offsets
>> function was replaced
>> with the snippet function (that also solved the false positive results
>> problem described
>> above). The search for ?father? returned not all results. For example, the
>> first result
>> which can manually be found in the contents table using SQLite Database
>> Browser for this
>> request:
>>
>> select documents.name, contents.rowid, part, offsets(contents)
>> from contents left join documents on contents.document_id = documents.id
>> where text match '(\"father\")
>> OR (\"fathere\") OR (\"fathering\")
>> OR (\"fatherish\") OR (\"fathers\")
>> OR (\"fatheres\") OR (\"fatherian\")
>> OR (\"fatheral\") OR (\"father0s\")'
>> and document_id in (25)
>> order by document_id
>>
>> missed when actually searched, the snippet function doesn?t help here,
>> neither helps
>> searching for a single word form. Error logging with
>>
>> sqlite3_config(SQLITE_CONFIG_LOG, errorLogCallback, NULL)
>>
>> was never called.
>>   Any ideas?


It's difficult to say. The level of detail you have provided is a little 
overwhelming.

If you are using a custom tokenizer and it behaves inconsistently you 
could get these kind of problems. Or it could also be a bug in the 
combination of NEAR, OR and an auxiliary FTS function.

Are you able to post a database online along with a query that returns 
an incorrect result? And the tokenizer implementation too if possible, 
although it might be possible to figure it out without that.

Does "neither helps searching for a single word form" mean that a simple 
query like "text MATCH 'father'" is also failing? If so, that's the one 
to post.

You are compiling with SQLITE_ENABLE_FTS3_PARENTHESIS defined, correct?

Dan.






[sqlite] Weird FTS bugs (wrong, missing and false positive results)

2015-07-02 Thread shuricks...@ukr.net
We use SQLite for indexing and searching the text contents of our app using 
FTS4 (about 27k unique words, about 1 million words for the whole contents). In 
particular, we use the offsets function. Currently, after some testing, we?re 
experiencing a plenty of problems with finding the results needed. For the 
forms of the words searching we use the ?all-to-all? way, which means we have 
some standard English endings and words forms and modify the initial request so 
that all possible combinations are included and separated by OR from each 
other. I. e. if we have two forms and two words in the request (for 
simplicity), that would look like (the MATCH part): ?(word1_form1 NEAR/10 
word2_form1) OR (word1_form1 NEAR/10 word2_form2) OR (word1_form2 NEAR/10 
word2_form1) OR (word1_form2 NEAR/10 word2_form2)?. Initially, the problem 
appeared that the query returned offsets for absolutely wrong words. While 
searching for ?honest fair?, we got words like ?good?, ?sport?, ?natural? 
offsets. We moved from the system provided (iOS, OS X) SQLite (version 3.8.5, 
further called ?old SQLite?) to the latest one available on the official site 
(version 3.8.10.2, further called ?new SQLite? or just ?SQLite?), which solved 
the issue, but returned offsets for words which were not near (according to 
NEAR/10). I supposed it?s because the request was too big because of lots of 
the words forms. But, for example, this request for ?offense is rank?: select 
documents.name, contents.rowid, part, offsets(contents) from contents left join 
documents on contents.document_id = documents.id where text match '(\"offense\" 
NEAR/10 \"is\" NEAR/10 \"rank\")' order by document_id returned the correct 
results, while this one returned false positive results (0 stands for 
apostrophe so it?s not interpreted as a separate token for the simple tokenizer 
we use): select documents.name, contents.rowid, part, offsets(contents) from 
contents left join documents on contents.document_id = documents.id where text 
match '(\"offense\" NEAR/10 \"is\" NEAR/10 \"rank\") OR (\"offense\" NEAR/10 
\"is\" NEAR/10 \"rank0s\")' order by document_id The experiments revealed that 
?rank0s? can be whatever: ranks", "rankqwerty" and so on, even 
"rankqwertyvrfbgbrevwkefovmwpsrvrm" or "yugbuoiipkipnuo?. If it's removed, 
nothing found. If after that the first word is modified, the one correct result 
found. Also, a search for ?speak again? didn?t find the results at all, though 
there should be quite a lot of them. The ?database disk image is malformed? 
error was reported, but integrity check completed with no errors, the database 
was recreated and that didn?t help, meanwhile the old SQLite did find the 
results for this request in exactly that file with no error reported. Also, the 
new SQLite worked well for the same request if the offsets function was 
replaced with the snippet function (that also solved the false positive results 
problem described above). The search for ?father? returned not all results. For 
example, the first result which can manually be found in the contents table 
using SQLite Database Browser for this request: select documents.name, 
contents.rowid, part, offsets(contents) from contents left join documents on 
contents.document_id = documents.id where text match '(\"father\") OR 
(\"fathere\") OR (\"fathering\") OR (\"fatherish\") OR (\"fathers\") OR 
(\"fatheres\") OR (\"fatherian\") OR (\"fatheral\") OR (\"father0s\")' and 
document_id in (25) order by document_id missed when actually searched, the 
snippet function doesn?t help here, neither helps searching for a single word 
form. Error logging with sqlite3_config(SQLITE_CONFIG_LOG, errorLogCallback, 
NULL) was never called. 
Any ideas? 


[sqlite] Weird FTS bugs (wrong, missing and false positive results)

2015-07-02 Thread Kevin Benson
FTFY, you're welcome ;-)
Now *MAYBE* someone can read it and possibly reply.

--
   --
  --
 --???--
K e V i N

On Thu, Jul 2, 2015 at 9:11 AM,  wrote:

> We use SQLite for indexing and searching the text contents of our app
> using FTS4
> (about 27k unique words, about 1 million words for the whole contents). In
> particular,
> we use the offsets function. Currently, after some testing, we?re
> experiencing a plenty
> of problems with finding the results needed.
>
> For the forms of the words searching we use the ?all-to-all? way, which
> means we have
> some standard English endings and words forms and modify the initial
> request so that
> all possible combinations are included and separated by OR from each
> other.
>
> I. e. if we have two forms and two words in the request (for simplicity),
> that would look
> like (the MATCH part):
>
> ?(word1_form1 NEAR/10 word2_form1)
> OR (word1_form1 NEAR/10 word2_form2)
> OR (word1_form2 NEAR/10 word2_form1)
> OR (word1_form2 NEAR/10 word2_form2)?.
>
> Initially, the problem appeared that the query returned offsets for
> absolutely wrong words.
> While searching for ?honest fair?, we got words like ?good?, ?sport?,
> ?natural? offsets.
> We moved from the system provided (iOS, OS X) SQLite (version 3.8.5,
> further called
> ?old SQLite?) to the latest one available on the official site (version
> 3.8.10.2,
> further called ?new SQLite? or just ?SQLite?), which solved the issue, but
> returned offsets
> for words which were not near (according to NEAR/10). I supposed it?s
> because the request
> was too big because of lots of the words forms. But, for example, this
> request
>
> for ?offense is rank?:
> select documents.name, contents.rowid, part, offsets(contents)
> from contents left join documents on contents.document_id = documents.id
> where text match '(\"offense\" NEAR/10 \"is\" NEAR/10 \"rank\")'
> order by document_id
>
> returned the correct results, while this one returned false positive
> results (0 stands for
> apostrophe so it?s not interpreted as a separate token for the simple
> tokenizer we use):
>
> select documents.name, contents.rowid, part, offsets(contents)
> from contents left join documents on contents.document_id = documents.id
> where text match '(\"offense\" NEAR/10 \"is\" NEAR/10 \"rank\")
> OR (\"offense\" NEAR/10 \"is\" NEAR/10 \"rank0s\")'
> order by document_id
>
> The experiments revealed that ?rank0s? can be whatever: ranks",
> "rankqwerty" and so on,
> even "rankqwertyvrfbgbrevwkefovmwpsrvrm" or "yugbuoiipkipnuo?. If it's
> removed, nothing
> found. If after that the first word is modified, the one correct result
> found. Also, a
> search for ?speak again? didn?t find the results at all, though there
> should be quite a
> lot of them.
>
> The ?database disk image is malformed? error was reported, but integrity
> check completed
> with no errors, the database was recreated and that didn?t help, meanwhile
> the old SQLite
> did find the results for this request in exactly that file with no error
> reported.
>
> Also, the new SQLite worked well for the same request if the offsets
> function was replaced
> with the snippet function (that also solved the false positive results
> problem described
> above). The search for ?father? returned not all results. For example, the
> first result
> which can manually be found in the contents table using SQLite Database
> Browser for this
> request:
>
> select documents.name, contents.rowid, part, offsets(contents)
> from contents left join documents on contents.document_id = documents.id
> where text match '(\"father\")
> OR (\"fathere\") OR (\"fathering\")
> OR (\"fatherish\") OR (\"fathers\")
> OR (\"fatheres\") OR (\"fatherian\")
> OR (\"fatheral\") OR (\"father0s\")'
> and document_id in (25)
> order by document_id
>
> missed when actually searched, the snippet function doesn?t help here,
> neither helps
> searching for a single word form. Error logging with
>
> sqlite3_config(SQLITE_CONFIG_LOG, errorLogCallback, NULL)
>
> was never called.
>  Any ideas?
>