Mikkel Kamstrup Erlandsen wrote:
> 2006/11/15, Eyal Oren <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>:
> 
>     On 11/15/06/11/06 22:02 +0100, Laurent Aguerreche wrote:
> 
>      >I have begun to search algorithms and I found:
>      >
>      >* N-grams
>      >  http://en.wikipedia.org/wiki/N-gram
>     <http://en.wikipedia.org/wiki/N-gram>
>      >* levenshtein
>      >  http://www.php.net/manual/en/function.levenshtein.php
>      >* similar text
>      >   http://www.php.net/manual/en/function.similar-text.php
>      >* soundex
>      >  http://www.php.net/manual/en/function.soundex.php
>     soundex allows you to find term that *sound* similar to an indexed
>     term, so
>     that might actually solve the french/swedish/danish transliteration
>     problem.
> 
>     I'll ask a computational linguist colleague tomorrow, maybe he has some
>     ideas.
> 
>     I do see one problem, namely that in one context (programming code)
>     people
>     seem to prefer exact matches, without stemming or similarity-matching,
>     while in other contexts (words in text, file names) people do want
>     stemming
>     and some form of similarity search regarding the orthography
>     (spelling).
>     There is probably not one solution that fits these two uses, but
>     probably a
>     search based on similarity would be fine also for source code.
> 
> 
> I see there has been a lot of focus on how wording breaking would work 
> for various programming languages. I must say that I find that the least 
> important use case. People writing programs very often are quite able to 
> search everything and nothing and find what they want. It is of course 
> still a case we should consider, while I do consider natural languages 
> more important than their programmatic cousins.
> 
> The swedish example brought up about "öst" vs "ost" is a good one. It 
> demonstrates the need for language specific transliteration - and I 
> expect the same to apply to word breaking. But don't we already have 
> language sensitive stemming  - maybe only french and english, but others 
> could be added no?

we already have this - we have both stemmers and stopword lists for :

french, german, danish, spanish, findlandish, norwegian, italian, dutch, 
portugese, russian and swedish

for stopwords see http://cvs.gnome.org/viewcvs/tracker/data/languages/

and for stemmers see 
http://cvs.gnome.org/viewcvs/tracker/src/libstemmer/src_c/


-- 
Mr Jamie McCracken
http://jamiemcc.livejournal.com/

_______________________________________________
tracker-list mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/tracker-list

Reply via email to