Le mercredi 15 novembre 2006 à 21:32 +0100, Ulrik Mikaelsson a écrit : > Currently it is not possible and it is not normal... I do not > know if > QDBM (which stores file names associated with keywords) can be > set to > split string like "ziegler-nichols" into "ziegler" and > "nichols" > automatically for searching or if we need to split strings > ourselves. > > This question were raised before, regarding filenames with dashes and > underscores in them. That time, the reply were that in C-code, dashes > and underscores often have a meaning. I think we'll need to be context > sensitive in this case, where regular documents and filenames usually > require word-splitting, while sourcecode usually don't. (However, in > c, the string "difference=alpha-beta" actually have three interesting > lexemes and a dash should neither here create the lexeme > "alpha-beta".)
And if you code in Lisp, you can write "foo-bar" as a variable, the all name! IMHO we should split with characters: . , ; - * / \ ! ? ' < > & ~ " | ` and I think it is all... I consider only these characters because there are used in shells. So we do not split words around "_". But perhaps we should add a parameter to do a search which splits words or not. > What I also dislike with libstemmer (which aims to "reduce" > strings to > radicals to ignore plural for instance) is that it does not > ignore > accentuated characters, so if I have a file which contains > "éléphant", > then "élephant" or "elephant" will not be found. "éléphant" is > the > correct orthography but it happens very often that french > people miss > some accents or add superflus ones... and it is the same > problem in > other languages. > > Unfortunately, this is not always applicable. For instance in Swedish, > there's a big difference in the words "öst" and "ost", where the > meanings is "east" and "cheese", respectively. However, "café" is > often spelled "cafe", with the same meaning. > > I'm not sure at all how to handle this. I have begun to search algorithms and I found: * N-grams http://en.wikipedia.org/wiki/N-gram * levenshtein http://www.php.net/manual/en/function.levenshtein.php * similar text http://www.php.net/manual/en/function.similar-text.php * soundex http://www.php.net/manual/en/function.soundex.php * metaphone http://www.php.net/manual/en/function.metaphone.php I really do not know what algorithm is the best or what are the pros and cons for each of them. I found this library which implements N-grams: http://hyperestraier.sourceforge.net/ It seems that it (or its predecessor Estraier) is used by Strigi... Laurent. _______________________________________________ tracker-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/tracker-list
