Le mercredi 15 novembre 2006 à 21:32 +0100, Ulrik Mikaelsson a écrit :
>         Currently it is not possible and it is not normal... I do not
>         know if
>         QDBM (which stores file names associated with keywords) can be
>         set to 
>         split string like "ziegler-nichols" into "ziegler" and
>         "nichols"
>         automatically for searching or if we need to split strings
>         ourselves.
> 
> This question were raised before, regarding filenames with dashes and
> underscores in them. That time, the reply were that in C-code, dashes
> and underscores often have a meaning. I think we'll need to be context
> sensitive in this case, where regular documents and filenames usually
> require word-splitting, while sourcecode usually don't. (However, in
> c, the string "difference=alpha-beta" actually have three interesting
> lexemes and a dash should neither here create the lexeme
> "alpha-beta".) 

And if you code in Lisp, you can write "foo-bar" as a variable, the all
name!

IMHO we should split with characters:
  . , ; - * / \ ! ? ' < > & ~ " | `
and I think it is all... I consider only these characters because there
are used in shells. So we do not split words around "_".
But perhaps we should add a parameter to do a search which splits words
or not.


>         What I also dislike with libstemmer (which aims to "reduce"
>         strings to 
>         radicals to ignore plural for instance) is that it does not
>         ignore
>         accentuated characters, so if I have a file which contains
>         "éléphant",
>         then "élephant" or "elephant" will not be found. "éléphant" is
>         the 
>         correct orthography but it happens very often that french
>         people miss
>         some accents or add superflus ones... and it is the same
>         problem in
>         other languages.
> 
> Unfortunately, this is not always applicable. For instance in Swedish,
> there's a big difference in the words "öst" and "ost", where the
> meanings is "east" and "cheese", respectively. However, "café" is
> often spelled "cafe", with the same meaning.
> 
> I'm not sure at all how to handle this.

I have begun to search algorithms and I found:

* N-grams
  http://en.wikipedia.org/wiki/N-gram
* levenshtein
  http://www.php.net/manual/en/function.levenshtein.php
* similar text
  http://www.php.net/manual/en/function.similar-text.php
* soundex
  http://www.php.net/manual/en/function.soundex.php
* metaphone
  http://www.php.net/manual/en/function.metaphone.php

I really do not know what algorithm is the best or what are the pros and
cons for each of them.

I found this library which implements N-grams:
http://hyperestraier.sourceforge.net/
It seems that it (or its predecessor Estraier) is used by Strigi...


Laurent.
_______________________________________________
tracker-list mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/tracker-list

Reply via email to