[sqlite] Some FTS5 guidance

2016-01-08 Thread Charles Leifer
You can create a custom tokenizer as well then use the standard search
APIs. I imagine that functionality would work well in this case:
https://sqlite.org/fts5.html#section_7

On Thu, Jan 7, 2016 at 3:59 PM, Stadin, Benjamin <
Benjamin.Stadin at heidelberg-mobil.com> wrote:

> One such algorithm would be a (generalized) Ukkonnen suffix tree (
> https://en.m.wikipedia.org/wiki/Ukkonen%27s_algorithm).
> It allows you to search efficiently for substrings.
> It would be possible to do some match weigthing based on match distance
> within words. But a general solution for a database is probably not trivial
> to implement.
>
> Ben
>
> Von meinem iPad gesendet
>
> > Am 07.01.2016 um 21:46 schrieb Matthias-Christian Ott :
> >
> >> On 2016-01-07 19:31, Mario M. Westphal wrote:
> >> I hence wonder if this problem has been tackled already and if there is
> a
> >> "standard" solution.
> >
> > If I understand you correctly, it seems that you are looking for a
> > compound splitting or decompounding algorithm. Unfortunately there is
> > not a "standard solution" for this. There are many languages in the
> > world and for some usable compound splitting algorithms exist. There are
> > also attempts to create statistical universal algorithms.
> >
> > As you said, for English a simple sub-string search might suffice but
> > for other languages it more complex. I assume that you speak German. If
> > you have a document that contains the term "Verkehrsleitsystem" and your
> > search query is "Verkehr leiten", it's reasonable to assume that the
> > document is relevant to the search query. Unfortunately a sub-string
> > search could not find the document. Other languages are even more
> > difficult (a textbook on linguistics will explain this better than I
> can).
> >
> > Even if you have such algorithm, it's not trivial to score the results
> > and there are more aspects to consider to create a simple search
> > algorithm. For example, in English you will also have to do some
> > analysis of the phrase structure to identify open compounds.
> >
> > Perhaps it helps to mention the languages you are interested in and the
> > application you have in mind to evaluate whether the SQLite FTS5 could
> > meet your requirements.
> > ___
> > sqlite-users mailing list
> > sqlite-users at mailinglists.sqlite.org
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>


[sqlite] Some FTS5 guidance

2016-01-08 Thread Scott Hess
With fts4 you could search for matching terms in an fts4aux table, then use
those to construct a query against the original table.  You'd have a full
scan of the fts index, but you'd not have to do a full table scan of the
primary data.  Unfortunately if there were a large number of hits in the
index scan, then it would be cheaper to just do the full table scan and
skip the index scan.  I don't know if there's a similar thing for fts5 at
this time.

This wouldn't be as efficient as something more suited to substring matches
(an N-gram index, maybe?), but I haven't heard anyone talking about writing
a virtual table to do that.

-scott


On Fri, Jan 8, 2016 at 11:54 AM, Charles Leifer  wrote:

> You can create a custom tokenizer as well then use the standard search
> APIs. I imagine that functionality would work well in this case:
> https://sqlite.org/fts5.html#section_7
>
> On Thu, Jan 7, 2016 at 3:59 PM, Stadin, Benjamin <
> Benjamin.Stadin at heidelberg-mobil.com> wrote:
>
> > One such algorithm would be a (generalized) Ukkonnen suffix tree (
> > https://en.m.wikipedia.org/wiki/Ukkonen%27s_algorithm).
> > It allows you to search efficiently for substrings.
> > It would be possible to do some match weigthing based on match distance
> > within words. But a general solution for a database is probably not
> trivial
> > to implement.
> >
> > Ben
> >
> > Von meinem iPad gesendet
> >
> > > Am 07.01.2016 um 21:46 schrieb Matthias-Christian Ott :
> > >
> > >> On 2016-01-07 19:31, Mario M. Westphal wrote:
> > >> I hence wonder if this problem has been tackled already and if there
> is
> > a
> > >> "standard" solution.
> > >
> > > If I understand you correctly, it seems that you are looking for a
> > > compound splitting or decompounding algorithm. Unfortunately there is
> > > not a "standard solution" for this. There are many languages in the
> > > world and for some usable compound splitting algorithms exist. There
> are
> > > also attempts to create statistical universal algorithms.
> > >
> > > As you said, for English a simple sub-string search might suffice but
> > > for other languages it more complex. I assume that you speak German. If
> > > you have a document that contains the term "Verkehrsleitsystem" and
> your
> > > search query is "Verkehr leiten", it's reasonable to assume that the
> > > document is relevant to the search query. Unfortunately a sub-string
> > > search could not find the document. Other languages are even more
> > > difficult (a textbook on linguistics will explain this better than I
> > can).
> > >
> > > Even if you have such algorithm, it's not trivial to score the results
> > > and there are more aspects to consider to create a simple search
> > > algorithm. For example, in English you will also have to do some
> > > analysis of the phrase structure to identify open compounds.
> > >
> > > Perhaps it helps to mention the languages you are interested in and the
> > > application you have in mind to evaluate whether the SQLite FTS5 could
> > > meet your requirements.
> > > ___
> > > sqlite-users mailing list
> > > sqlite-users at mailinglists.sqlite.org
> > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > ___
> > sqlite-users mailing list
> > sqlite-users at mailinglists.sqlite.org
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>


[sqlite] Some FTS5 guidance

2016-01-07 Thread Stadin, Benjamin
One such algorithm would be a (generalized) Ukkonnen suffix tree 
(https://en.m.wikipedia.org/wiki/Ukkonen%27s_algorithm). 
It allows you to search efficiently for substrings. 
It would be possible to do some match weigthing based on match distance within 
words. But a general solution for a database is probably not trivial to 
implement.

Ben

Von meinem iPad gesendet

> Am 07.01.2016 um 21:46 schrieb Matthias-Christian Ott :
> 
>> On 2016-01-07 19:31, Mario M. Westphal wrote:
>> I hence wonder if this problem has been tackled already and if there is a
>> "standard" solution.
> 
> If I understand you correctly, it seems that you are looking for a
> compound splitting or decompounding algorithm. Unfortunately there is
> not a "standard solution" for this. There are many languages in the
> world and for some usable compound splitting algorithms exist. There are
> also attempts to create statistical universal algorithms.
> 
> As you said, for English a simple sub-string search might suffice but
> for other languages it more complex. I assume that you speak German. If
> you have a document that contains the term "Verkehrsleitsystem" and your
> search query is "Verkehr leiten", it's reasonable to assume that the
> document is relevant to the search query. Unfortunately a sub-string
> search could not find the document. Other languages are even more
> difficult (a textbook on linguistics will explain this better than I can).
> 
> Even if you have such algorithm, it's not trivial to score the results
> and there are more aspects to consider to create a simple search
> algorithm. For example, in English you will also have to do some
> analysis of the phrase structure to identify open compounds.
> 
> Perhaps it helps to mention the languages you are interested in and the
> application you have in mind to evaluate whether the SQLite FTS5 could
> meet your requirements.
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Some FTS5 guidance

2016-01-07 Thread Matthias-Christian Ott
On 2016-01-07 19:31, Mario M. Westphal wrote:
> I hence wonder if this problem has been tackled already and if there is a
> "standard" solution. 

If I understand you correctly, it seems that you are looking for a
compound splitting or decompounding algorithm. Unfortunately there is
not a "standard solution" for this. There are many languages in the
world and for some usable compound splitting algorithms exist. There are
also attempts to create statistical universal algorithms.

As you said, for English a simple sub-string search might suffice but
for other languages it more complex. I assume that you speak German. If
you have a document that contains the term "Verkehrsleitsystem" and your
search query is "Verkehr leiten", it's reasonable to assume that the
document is relevant to the search query. Unfortunately a sub-string
search could not find the document. Other languages are even more
difficult (a textbook on linguistics will explain this better than I can).

Even if you have such algorithm, it's not trivial to score the results
and there are more aspects to consider to create a simple search
algorithm. For example, in English you will also have to do some
analysis of the phrase structure to identify open compounds.

Perhaps it helps to mention the languages you are interested in and the
application you have in mind to evaluate whether the SQLite FTS5 could
meet your requirements.


[sqlite] Some FTS5 guidance

2016-01-07 Thread Mario M. Westphal
Hello,



I recently looked into FTS 5. 

The documentation is clear and I was able to get it running with a small
test database quickly. And the response times are awesome :-)



My question: 



At least as I understand it at this point, FTS can only do prefix queries.



If my database contains the words



moon

moonlight

moonshine

shine

sunshine



A FTS query like "moon*" will find all three terms starting with "moon" -
very fast.



But there is no way to find "moonshine" or "sunshine" by running a query for
"shine" or "shine*" ?



Currently I search using LIKE and there such 'contains' queries are easy. My
users of course don't understand all this and want to find all words
containing shine, wherever the term appears in the word.



The only idea I had so far was to write my own tokenizer and to store each
word with every possible 'sub-word':



When "moonshine" is added to FTS, it is split into multiple words:



moonshine
oonshine
onshine
nshine
shine
. 



(maybe I limit this to a minimum of 2 or 3 characters).



This of course produces a log of extra entries in FTS and may impact
performance and database size. 

I hence wonder if this problem has been tackled already and if there is a
"standard" solution. 



[sqlite] Some FTS5 guidance

2016-01-07 Thread Graham Holden
I've never used FTS, just throwing an off-the-wall idea out: instead of 
tokenising partial words, could you tokenise/store the reverse of each word 
(possibly in a separate place if that can be done):

enihsnoom
enihs
enihsnus

Then search for "enihs" as well as "shine". If you can't separate the forward 
and reversed versions, you'd have to filter-out when "dog" matches "god".

Graham

Sent from Samsung Mobile

 Original message 
From: "Mario M. Westphal" <m...@mwlabs.de> 
Date: 07/01/2016  18:31  (GMT+00:00) 
To: sqlite-users at mailinglists.sqlite.org 
Subject: [sqlite] Some FTS5 guidance 

Hello,



I recently looked into FTS 5. 

The documentation is clear and I was able to get it running with a small
test database quickly. And the response times are awesome :-)



My question: 



At least as I understand it at this point, FTS can only do prefix queries.



If my database contains the words



moon

moonlight

moonshine

shine

sunshine



A FTS query like "moon*" will find all three terms starting with "moon" -
very fast.



But there is no way to find "moonshine" or "sunshine" by running a query for
"shine" or "shine*" ?



Currently I search using LIKE and there such 'contains' queries are easy. My
users of course don't understand all this and want to find all words
containing shine, wherever the term appears in the word.



The only idea I had so far was to write my own tokenizer and to store each
word with every possible 'sub-word':



When "moonshine" is added to FTS, it is split into multiple words:



moonshine
oonshine
onshine
nshine
shine
. 



(maybe I limit this to a minimum of 2 or 3 characters).



This of course produces a log of extra entries in FTS and may impact
performance and database size. 

I hence wonder if this problem has been tackled already and if there is a
"standard" solution. 

___
sqlite-users mailing list
sqlite-users at mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users