This is a deep problem.  You need a segmenter (*) to tell you how you
can break the words into sub-words, then you need to build a tokenizer
which will return the right pieces.  This should be possible with
what's in there, but you're definitely going to have to sit down with
the source code and watch how the tokenizer works.  A huge proportion
of the fts code is involved with managing data, that should all be
ignorable, but understanding how documents are tokenized and assembled
into the index is important.  Aim your debugger at the code for
fts3_tokenizer1.c, and fts3.c buildTerms(), and go to town.

-scott

(*)  Segmenter may be the wrong term, I think that applies to
languages where there are no clear spaces between words.  In this case
it may be de-compounding or something like that.  In fts terms both
cases (and stemming) would manifest through the tokenizer, so the
concepts may help give you an angle on things.
   http://en.wikipedia.org/wiki/Compound_(linguistics)
   http://en.wikipedia.org/wiki/Text_segmentation

On Fri, Aug 7, 2009 at 5:31 AM, Lukas Haase<lukasha...@gmx.at> wrote:
> Hi,
>
> Thank you anybody for your replies and ideas to "FTS and postfix
> search". I thought a lot about it and came to the conclusion: In general
> it is not necessary for a fulltext system not find subwords. If it would
> be, then I either need no index (search through whole data) or put
> subwords into the index too.
>
> So if my documents would be English I would be perfectly finished, even
> better with the PORTER-Tokenizer.
>
> But unfortunately the language is German and there are words consisting
> of of other words (e.g. Telefonkabel = telephone cable). It is still a
> requirement finding the "Telefonkabel" as well when searching for
> "Kabel". Does anybody have an idea what would be the best approach? In
> my opinion, I have no chance except to split these words with a
> predefined dictionary (e.g. {"Telefonkabel"} will become {"telefon",
> "kabel", "telefonkabel"}. Even this is a challenge (the index-generation
> should not take too long). My idea now would be to extend the FTS in
> some way to
> a) Support splitting words with predefined dictonary
> b) maybe support for non-english (german) versions of the Porter
> Stemming algorithm.
>
> I have programming experience with C and C++ but no idea of SQLite.
> Where to begin? How easy would it be to implement this and how much time
> would it take?
>
> I also found [1]. This indexer seems to be more powerful than the
> builtin FTS. However, I can't find support for word-splitting too. Does
> anybody have experience with that indexer? Would it be simpler to extent
> this indexer? Maybe someone have already tested both...on which should I
> concentrate, which one is faster?
>
> Thank you again all,
>
> Luke
>
> [1] http://ft3.sourceforge.net/
>
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to