Mikkel Kamstrup Erlandsen wrote: > 2006/11/15, Eyal Oren <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>: > > On 11/15/06/11/06 22:02 +0100, Laurent Aguerreche wrote: > > >I have begun to search algorithms and I found: > > > >* N-grams > > http://en.wikipedia.org/wiki/N-gram > <http://en.wikipedia.org/wiki/N-gram> > >* levenshtein > > http://www.php.net/manual/en/function.levenshtein.php > >* similar text > > http://www.php.net/manual/en/function.similar-text.php > >* soundex > > http://www.php.net/manual/en/function.soundex.php > soundex allows you to find term that *sound* similar to an indexed > term, so > that might actually solve the french/swedish/danish transliteration > problem. > > I'll ask a computational linguist colleague tomorrow, maybe he has some > ideas. > > I do see one problem, namely that in one context (programming code) > people > seem to prefer exact matches, without stemming or similarity-matching, > while in other contexts (words in text, file names) people do want > stemming > and some form of similarity search regarding the orthography > (spelling). > There is probably not one solution that fits these two uses, but > probably a > search based on similarity would be fine also for source code. > > > I see there has been a lot of focus on how wording breaking would work > for various programming languages. I must say that I find that the least > important use case. People writing programs very often are quite able to > search everything and nothing and find what they want. It is of course > still a case we should consider, while I do consider natural languages > more important than their programmatic cousins. > > The swedish example brought up about "öst" vs "ost" is a good one. It > demonstrates the need for language specific transliteration - and I > expect the same to apply to word breaking. But don't we already have > language sensitive stemming - maybe only french and english, but others > could be added no?
we already have this - we have both stemmers and stopword lists for : french, german, danish, spanish, findlandish, norwegian, italian, dutch, portugese, russian and swedish for stopwords see http://cvs.gnome.org/viewcvs/tracker/data/languages/ and for stemmers see http://cvs.gnome.org/viewcvs/tracker/src/libstemmer/src_c/ -- Mr Jamie McCracken http://jamiemcc.livejournal.com/ _______________________________________________ tracker-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/tracker-list
