2006/11/15, Eyal Oren <[EMAIL PROTECTED]>:

On 11/15/06/11/06 22:02 +0100, Laurent Aguerreche wrote:

>I have begun to search algorithms and I found:
>
>* N-grams
>  http://en.wikipedia.org/wiki/N-gram
>* levenshtein
>  http://www.php.net/manual/en/function.levenshtein.php
>* similar text
>  http://www.php.net/manual/en/function.similar-text.php
>* soundex
>  http://www.php.net/manual/en/function.soundex.php
soundex allows you to find term that *sound* similar to an indexed term,
so
that might actually solve the french/swedish/danish transliteration
problem.

I'll ask a computational linguist colleague tomorrow, maybe he has some
ideas.

I do see one problem, namely that in one context (programming code) people
seem to prefer exact matches, without stemming or similarity-matching,
while in other contexts (words in text, file names) people do want
stemming
and some form of similarity search regarding the orthography (spelling).
There is probably not one solution that fits these two uses, but probably
a
search based on similarity would be fine also for source code.


I see there has been a lot of focus on how wording breaking would work for
various programming languages. I must say that I find that the least
important use case. People writing programs very often are quite able to
search everything and nothing and find what they want. It is of course still
a case we should consider, while I do consider natural languages more
important than their programmatic cousins.

The swedish example brought up about "öst" vs "ost" is a good one. It
demonstrates the need for language specific transliteration - and I expect
the same to apply to word breaking. But don't we already have language
sensitive stemming  - maybe only french and english, but others could be
added no?

Cheers,
Mikkel
_______________________________________________
tracker-list mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/tracker-list

Reply via email to