Hi,

Thank you anybody for your replies and ideas to "FTS and postfix 
search". I thought a lot about it and came to the conclusion: In general 
it is not necessary for a fulltext system not find subwords. If it would 
be, then I either need no index (search through whole data) or put 
subwords into the index too.

So if my documents would be English I would be perfectly finished, even 
better with the PORTER-Tokenizer.

But unfortunately the language is German and there are words consisting 
of of other words (e.g. Telefonkabel = telephone cable). It is still a 
requirement finding the "Telefonkabel" as well when searching for 
"Kabel". Does anybody have an idea what would be the best approach? In 
my opinion, I have no chance except to split these words with a 
predefined dictionary (e.g. {"Telefonkabel"} will become {"telefon", 
"kabel", "telefonkabel"}. Even this is a challenge (the index-generation 
should not take too long). My idea now would be to extend the FTS in 
some way to
a) Support splitting words with predefined dictonary
b) maybe support for non-english (german) versions of the Porter 
Stemming algorithm.

I have programming experience with C and C++ but no idea of SQLite. 
Where to begin? How easy would it be to implement this and how much time 
would it take?

I also found [1]. This indexer seems to be more powerful than the 
builtin FTS. However, I can't find support for word-splitting too. Does 
anybody have experience with that indexer? Would it be simpler to extent 
this indexer? Maybe someone have already tested both...on which should I 
concentrate, which one is faster?

Thank you again all,

Luke

[1] http://ft3.sourceforge.net/

_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to