Re: Writing a stemmer
On Sat, 05 Jun 2004 21:15:23 +0200 Andrzej Bialecki [EMAIL PROTECTED] wrote: Vladimir Yuryev wrote: Hi, Andjej! How you tested the Polish texts with what stemer? Thanks, Vladimir. No reason to be too modest, Leo.. I tested your stemmer on English, Swedish and Polish texts (including F-measure vs. training set size plots), and it works exceptionally well indeed. Highly recommended! Well, I have several corpora of Polish language, which together amount to roughly 90,000 words (nouns and verbs) having at least 4 inflected forms. This set is randomized (i.e. lines of words + forms are in random order). I've split this into two parts - one of a fixed size, as a test set, and one of variable size as a training set. Then I compile stemmer tables using variable number of training examples, and using differnt settings (trie, multi-trie, different optimizations, etc..). Then for each output table I test the precision/recall of correct base forms (lemmatization), and of ability to create unique stems (stemming). Finally, I select the best table, which gives reasonably good results vs. table size. To put it in plain terms, e.g. for tables roughly 300kB in size (created from training set of 3000 unique words + their forms) in best cases I get ~90% of correct stems, and ~70% of correct lemmas. Which is a _very_ good result! -- Best regards, Andrzej Bialecki Thanks for the detailed description of the test of the Polish texts. It was very important for me. Vladimir. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Writing a stemmer
Hi, Andjej! How you tested the Polish texts with what stemer? Thanks, Vladimir. No reason to be too modest, Leo.. I tested your stemmer on English, Swedish and Polish texts (including F-measure vs. training set size plots), and it works exceptionally well indeed. Highly recommended! -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Writing a stemmer
Vladimir Yuryev wrote: Hi, Andjej! How you tested the Polish texts with what stemer? Thanks, Vladimir. No reason to be too modest, Leo.. I tested your stemmer on English, Swedish and Polish texts (including F-measure vs. training set size plots), and it works exceptionally well indeed. Highly recommended! Well, I have several corpora of Polish language, which together amount to roughly 90,000 words (nouns and verbs) having at least 4 inflected forms. This set is randomized (i.e. lines of words + forms are in random order). I've split this into two parts - one of a fixed size, as a test set, and one of variable size as a training set. Then I compile stemmer tables using variable number of training examples, and using differnt settings (trie, multi-trie, different optimizations, etc..). Then for each output table I test the precision/recall of correct base forms (lemmatization), and of ability to create unique stems (stemming). Finally, I select the best table, which gives reasonably good results vs. table size. To put it in plain terms, e.g. for tables roughly 300kB in size (created from training set of 3000 unique words + their forms) in best cases I get ~90% of correct stems, and ~70% of correct lemmas. Which is a _very_ good result! -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Writing a stemmer
Leo Galambos wrote: Erik Hatcher [EMAIL PROTECTED] wrote: __ How proficient must I be in a language for which I wish to write the stemmer? I would venture to say you would need to be an expert in a language to write a decent stemmer. I'm sorry for a self-promo ;), but the stemmer of egothor project can be adapted to any language, and you needn't be a language expert. Moreover, the stemmer achieves better F-measure than Porter's stemmers. No reason to be too modest, Leo.. I tested your stemmer on English, Swedish and Polish texts (including F-measure vs. training set size plots), and it works exceptionally well indeed. Highly recommended! -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Writing a stemmer
Leo Thanks for your reply. I have taken a look at egothor.org. It does appear to be pretty simple. However, I need to use Lucene as my search engine. From what I understand, it appears that I need to be pretty conversant (if not an expert) with a language for which I wish to write a stemmer. Moreover, this stemmer can be used with the egothor search engine only? Can I use this stemmer with Lucene? If yes, how? Regards, Anil -Original Message- From: Leo Galambos [mailto:[EMAIL PROTECTED] Sent: Thursday, June 03, 2004 8:54 PM To: Lucene Users List Subject: Re: Writing a stemmer Erik Hatcher [EMAIL PROTECTED] wrote: __ How proficient must I be in a language for which I wish to write the stemmer? I would venture to say you would need to be an expert in a language to write a decent stemmer. I'm sorry for a self-promo ;), but the stemmer of egothor project can be adapted to any language, and you needn't be a language expert. Moreover, the stemmer achieves better F-measure than Porter's stemmers. Cheers, Leo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Writing a stemmer
Hi, Can anyone provide some help on writing a stemmer for non-english languages? How proficient must I be in a language for which I wish to write the stemmer? Regards, Anil - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Writing a stemmer
Anil, I suppose it depends on how complex the language is and what is acceptable for your program. I have written a couple of stemmers that are fairly straightforward based on papers that I have read and work well for the langs. we are using. Your best bet is probably to do a literature search for the languages you are interested in and go from there. I am, of course, assumming stemmers for your languages don't already exist. If your languages are common, there probably is a stemmer available in some form that you can use or adapt. You'd be suprised at what you get by doing a simple google search for lang X stemmer where lang X is the language you are interested in and no quotes. Hooking them into Lucene is straightforward and there are several examples of this available in the docs and code. -Grant [EMAIL PROTECTED] 06/03/04 04:09PM Hi, Can anyone provide some help on writing a stemmer for non-english languages? How proficient must I be in a language for which I wish to write the stemmer? Regards, Anil - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Writing a stemmer
On Jun 3, 2004, at 4:09 PM, Musku, Anil (LA) wrote: Can anyone provide some help on writing a stemmer for non-english languages? Have a look at the snowball project in the Lucene sandbox. If its non-European-based languages, I suspect it's quite complex. It's highly language dependent. How proficient must I be in a language for which I wish to write the stemmer? I would venture to say you would need to be an expert in a language to write a decent stemmer. The SnowballAnalyzer is quite hairy underneath, that's for sure. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Writing a stemmer
Erik Hatcher [EMAIL PROTECTED] wrote: __ How proficient must I be in a language for which I wish to write the stemmer? I would venture to say you would need to be an expert in a language to write a decent stemmer. I'm sorry for a self-promo ;), but the stemmer of egothor project can be adapted to any language, and you needn't be a language expert. Moreover, the stemmer achieves better F-measure than Porter's stemmers. Cheers, Leo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]