[sqlite] Some FTS5 guidance
You can create a custom tokenizer as well then use the standard search APIs. I imagine that functionality would work well in this case: https://sqlite.org/fts5.html#section_7 On Thu, Jan 7, 2016 at 3:59 PM, Stadin, Benjamin < Benjamin.Stadin at heidelberg-mobil.com> wrote: > One such algorithm would be a (generalized) Ukkonnen suffix tree ( > https://en.m.wikipedia.org/wiki/Ukkonen%27s_algorithm). > It allows you to search efficiently for substrings. > It would be possible to do some match weigthing based on match distance > within words. But a general solution for a database is probably not trivial > to implement. > > Ben > > Von meinem iPad gesendet > > > Am 07.01.2016 um 21:46 schrieb Matthias-Christian Ott : > > > >> On 2016-01-07 19:31, Mario M. Westphal wrote: > >> I hence wonder if this problem has been tackled already and if there is > a > >> "standard" solution. > > > > If I understand you correctly, it seems that you are looking for a > > compound splitting or decompounding algorithm. Unfortunately there is > > not a "standard solution" for this. There are many languages in the > > world and for some usable compound splitting algorithms exist. There are > > also attempts to create statistical universal algorithms. > > > > As you said, for English a simple sub-string search might suffice but > > for other languages it more complex. I assume that you speak German. If > > you have a document that contains the term "Verkehrsleitsystem" and your > > search query is "Verkehr leiten", it's reasonable to assume that the > > document is relevant to the search query. Unfortunately a sub-string > > search could not find the document. Other languages are even more > > difficult (a textbook on linguistics will explain this better than I > can). > > > > Even if you have such algorithm, it's not trivial to score the results > > and there are more aspects to consider to create a simple search > > algorithm. For example, in English you will also have to do some > > analysis of the phrase structure to identify open compounds. > > > > Perhaps it helps to mention the languages you are interested in and the > > application you have in mind to evaluate whether the SQLite FTS5 could > > meet your requirements. > > ___ > > sqlite-users mailing list > > sqlite-users at mailinglists.sqlite.org > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > ___ > sqlite-users mailing list > sqlite-users at mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users >
[sqlite] Some FTS5 guidance
With fts4 you could search for matching terms in an fts4aux table, then use those to construct a query against the original table. You'd have a full scan of the fts index, but you'd not have to do a full table scan of the primary data. Unfortunately if there were a large number of hits in the index scan, then it would be cheaper to just do the full table scan and skip the index scan. I don't know if there's a similar thing for fts5 at this time. This wouldn't be as efficient as something more suited to substring matches (an N-gram index, maybe?), but I haven't heard anyone talking about writing a virtual table to do that. -scott On Fri, Jan 8, 2016 at 11:54 AM, Charles Leifer wrote: > You can create a custom tokenizer as well then use the standard search > APIs. I imagine that functionality would work well in this case: > https://sqlite.org/fts5.html#section_7 > > On Thu, Jan 7, 2016 at 3:59 PM, Stadin, Benjamin < > Benjamin.Stadin at heidelberg-mobil.com> wrote: > > > One such algorithm would be a (generalized) Ukkonnen suffix tree ( > > https://en.m.wikipedia.org/wiki/Ukkonen%27s_algorithm). > > It allows you to search efficiently for substrings. > > It would be possible to do some match weigthing based on match distance > > within words. But a general solution for a database is probably not > trivial > > to implement. > > > > Ben > > > > Von meinem iPad gesendet > > > > > Am 07.01.2016 um 21:46 schrieb Matthias-Christian Ott : > > > > > >> On 2016-01-07 19:31, Mario M. Westphal wrote: > > >> I hence wonder if this problem has been tackled already and if there > is > > a > > >> "standard" solution. > > > > > > If I understand you correctly, it seems that you are looking for a > > > compound splitting or decompounding algorithm. Unfortunately there is > > > not a "standard solution" for this. There are many languages in the > > > world and for some usable compound splitting algorithms exist. There > are > > > also attempts to create statistical universal algorithms. > > > > > > As you said, for English a simple sub-string search might suffice but > > > for other languages it more complex. I assume that you speak German. If > > > you have a document that contains the term "Verkehrsleitsystem" and > your > > > search query is "Verkehr leiten", it's reasonable to assume that the > > > document is relevant to the search query. Unfortunately a sub-string > > > search could not find the document. Other languages are even more > > > difficult (a textbook on linguistics will explain this better than I > > can). > > > > > > Even if you have such algorithm, it's not trivial to score the results > > > and there are more aspects to consider to create a simple search > > > algorithm. For example, in English you will also have to do some > > > analysis of the phrase structure to identify open compounds. > > > > > > Perhaps it helps to mention the languages you are interested in and the > > > application you have in mind to evaluate whether the SQLite FTS5 could > > > meet your requirements. > > > ___ > > > sqlite-users mailing list > > > sqlite-users at mailinglists.sqlite.org > > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > > ___ > > sqlite-users mailing list > > sqlite-users at mailinglists.sqlite.org > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > > > ___ > sqlite-users mailing list > sqlite-users at mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users >
[sqlite] Some FTS5 guidance
One such algorithm would be a (generalized) Ukkonnen suffix tree (https://en.m.wikipedia.org/wiki/Ukkonen%27s_algorithm). It allows you to search efficiently for substrings. It would be possible to do some match weigthing based on match distance within words. But a general solution for a database is probably not trivial to implement. Ben Von meinem iPad gesendet > Am 07.01.2016 um 21:46 schrieb Matthias-Christian Ott : > >> On 2016-01-07 19:31, Mario M. Westphal wrote: >> I hence wonder if this problem has been tackled already and if there is a >> "standard" solution. > > If I understand you correctly, it seems that you are looking for a > compound splitting or decompounding algorithm. Unfortunately there is > not a "standard solution" for this. There are many languages in the > world and for some usable compound splitting algorithms exist. There are > also attempts to create statistical universal algorithms. > > As you said, for English a simple sub-string search might suffice but > for other languages it more complex. I assume that you speak German. If > you have a document that contains the term "Verkehrsleitsystem" and your > search query is "Verkehr leiten", it's reasonable to assume that the > document is relevant to the search query. Unfortunately a sub-string > search could not find the document. Other languages are even more > difficult (a textbook on linguistics will explain this better than I can). > > Even if you have such algorithm, it's not trivial to score the results > and there are more aspects to consider to create a simple search > algorithm. For example, in English you will also have to do some > analysis of the phrase structure to identify open compounds. > > Perhaps it helps to mention the languages you are interested in and the > application you have in mind to evaluate whether the SQLite FTS5 could > meet your requirements. > ___ > sqlite-users mailing list > sqlite-users at mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] Some FTS5 guidance
On 2016-01-07 19:31, Mario M. Westphal wrote: > I hence wonder if this problem has been tackled already and if there is a > "standard" solution. If I understand you correctly, it seems that you are looking for a compound splitting or decompounding algorithm. Unfortunately there is not a "standard solution" for this. There are many languages in the world and for some usable compound splitting algorithms exist. There are also attempts to create statistical universal algorithms. As you said, for English a simple sub-string search might suffice but for other languages it more complex. I assume that you speak German. If you have a document that contains the term "Verkehrsleitsystem" and your search query is "Verkehr leiten", it's reasonable to assume that the document is relevant to the search query. Unfortunately a sub-string search could not find the document. Other languages are even more difficult (a textbook on linguistics will explain this better than I can). Even if you have such algorithm, it's not trivial to score the results and there are more aspects to consider to create a simple search algorithm. For example, in English you will also have to do some analysis of the phrase structure to identify open compounds. Perhaps it helps to mention the languages you are interested in and the application you have in mind to evaluate whether the SQLite FTS5 could meet your requirements.
[sqlite] Some FTS5 guidance
Hello, I recently looked into FTS 5. The documentation is clear and I was able to get it running with a small test database quickly. And the response times are awesome :-) My question: At least as I understand it at this point, FTS can only do prefix queries. If my database contains the words moon moonlight moonshine shine sunshine A FTS query like "moon*" will find all three terms starting with "moon" - very fast. But there is no way to find "moonshine" or "sunshine" by running a query for "shine" or "shine*" ? Currently I search using LIKE and there such 'contains' queries are easy. My users of course don't understand all this and want to find all words containing shine, wherever the term appears in the word. The only idea I had so far was to write my own tokenizer and to store each word with every possible 'sub-word': When "moonshine" is added to FTS, it is split into multiple words: moonshine oonshine onshine nshine shine . (maybe I limit this to a minimum of 2 or 3 characters). This of course produces a log of extra entries in FTS and may impact performance and database size. I hence wonder if this problem has been tackled already and if there is a "standard" solution.
[sqlite] Some FTS5 guidance
I've never used FTS, just throwing an off-the-wall idea out: instead of tokenising partial words, could you tokenise/store the reverse of each word (possibly in a separate place if that can be done): enihsnoom enihs enihsnus Then search for "enihs" as well as "shine". If you can't separate the forward and reversed versions, you'd have to filter-out when "dog" matches "god". Graham Sent from Samsung Mobile Original message From: "Mario M. Westphal" <m...@mwlabs.de> Date: 07/01/2016 18:31 (GMT+00:00) To: sqlite-users at mailinglists.sqlite.org Subject: [sqlite] Some FTS5 guidance Hello, I recently looked into FTS 5. The documentation is clear and I was able to get it running with a small test database quickly. And the response times are awesome :-) My question: At least as I understand it at this point, FTS can only do prefix queries. If my database contains the words moon moonlight moonshine shine sunshine A FTS query like "moon*" will find all three terms starting with "moon" - very fast. But there is no way to find "moonshine" or "sunshine" by running a query for "shine" or "shine*" ? Currently I search using LIKE and there such 'contains' queries are easy. My users of course don't understand all this and want to find all words containing shine, wherever the term appears in the word. The only idea I had so far was to write my own tokenizer and to store each word with every possible 'sub-word': When "moonshine" is added to FTS, it is split into multiple words: moonshine oonshine onshine nshine shine . (maybe I limit this to a minimum of 2 or 3 characters). This of course produces a log of extra entries in FTS and may impact performance and database size. I hence wonder if this problem has been tackled already and if there is a "standard" solution. ___ sqlite-users mailing list sqlite-users at mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users