Re: Writing a stemmer

2004-06-06 Thread Vladimir Yuryev
On Sat, 05 Jun 2004 21:15:23 +0200
 Andrzej Bialecki [EMAIL PROTECTED] wrote:
Vladimir Yuryev wrote:
Hi, Andjej!
How you tested the Polish texts with what stemer?
Thanks,
Vladimir.
No reason to be too modest, Leo.. I tested your stemmer on English, 
Swedish and Polish texts (including F-measure vs. training set size 
plots), and it works exceptionally well indeed. Highly recommended!
Well, I have several corpora of Polish language, which together 
amount to roughly 90,000 words (nouns and verbs) having at least 4 
inflected forms. This set is randomized (i.e. lines of words + forms 
are in random order). I've split this into two parts - one of a fixed 
size, as a test set, and one of variable size as a training set. Then 
I compile stemmer tables using variable number of training examples, 
and using differnt settings (trie, multi-trie, different 
optimizations, etc..). Then for each output table I test the 
precision/recall of correct base forms (lemmatization), and of 
ability to create unique stems (stemming). Finally, I select the 
best table, which gives reasonably good results vs. table size. To 
put it in plain terms, e.g. for tables roughly 300kB in size (created 
from training set of 3000 unique words + their forms) in best cases I 
get ~90% of correct stems, and ~70% of correct lemmas. Which is a 
_very_ good result!

--
Best regards,
Andrzej Bialecki
Thanks for the detailed description of the test of the Polish texts. 
It was very important for me.
Vladimir.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Writing a stemmer

2004-06-05 Thread Vladimir Yuryev
Hi, Andjej!
How you tested the Polish texts with what stemer?
Thanks,
Vladimir.
No reason to be too modest, Leo.. I tested your stemmer on English, 
Swedish and Polish texts (including F-measure vs. training set size 
plots), and it works exceptionally well indeed. Highly recommended!

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Writing a stemmer

2004-06-05 Thread Andrzej Bialecki
Vladimir Yuryev wrote:
Hi, Andjej!
How you tested the Polish texts with what stemer?
Thanks,
Vladimir.
No reason to be too modest, Leo.. I tested your stemmer on English, 
Swedish and Polish texts (including F-measure vs. training set size 
plots), and it works exceptionally well indeed. Highly recommended!
Well, I have several corpora of Polish language, which together amount 
to roughly 90,000 words (nouns and verbs) having at least 4 inflected 
forms. This set is randomized (i.e. lines of words + forms are in random 
order). I've split this into two parts - one of a fixed size, as a test 
set, and one of variable size as a training set. Then I compile stemmer 
tables using variable number of training examples, and using differnt 
settings (trie, multi-trie, different optimizations, etc..). Then for 
each output table I test the precision/recall of correct base forms 
(lemmatization), and of ability to create unique stems (stemming). 
Finally, I select the best table, which gives reasonably good results 
vs. table size. To put it in plain terms, e.g. for tables roughly 300kB 
in size (created from training set of 3000 unique words + their forms) 
in best cases I get ~90% of correct stems, and ~70% of correct lemmas. 
Which is a _very_ good result!

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Writing a stemmer

2004-06-04 Thread Andrzej Bialecki
Leo Galambos wrote:
Erik Hatcher [EMAIL PROTECTED] wrote:
__

How proficient must I be in a language for which I wish to write the 
stemmer?
I would venture to say you would need to be an expert in a language to 
write a decent stemmer.

I'm sorry for a self-promo ;), but
the stemmer of egothor project can be
adapted to any language, and you needn't be
a language expert. Moreover, the stemmer
achieves better F-measure than Porter's stemmers.
No reason to be too modest, Leo.. I tested your stemmer on English, 
Swedish and Polish texts (including F-measure vs. training set size 
plots), and it works exceptionally well indeed. Highly recommended!

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Writing a stemmer

2004-06-04 Thread Musku, Anil (LA)
Leo

Thanks for your reply. I have taken a look at egothor.org. It does appear to
be pretty simple. However, I need to use Lucene as my search engine.

From what I understand, it appears that I need to be pretty conversant (if
not an expert) with a language for which I wish to write a stemmer. Moreover,
this stemmer can be used with the egothor search engine only? Can I use this
stemmer with Lucene? If yes, how?

Regards,
Anil

-Original Message-
From: Leo Galambos [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 03, 2004 8:54 PM
To: Lucene Users List
Subject: Re: Writing a stemmer

Erik Hatcher [EMAIL PROTECTED] wrote:
__

 How proficient must I be in a language for which I wish to write the 
 stemmer?
I would venture to say you would need to be an expert in a language to 
write a decent stemmer.

I'm sorry for a self-promo ;), but
the stemmer of egothor project can be
adapted to any language, and you needn't be
a language expert. Moreover, the stemmer
achieves better F-measure than Porter's stemmers.

Cheers,
Leo



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Writing a stemmer

2004-06-03 Thread Musku, Anil (LA)

Hi,

Can anyone provide some help on writing a stemmer for non-english languages?
How proficient must I be in a language for which I wish to write the stemmer?

Regards,
Anil

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Writing a stemmer

2004-06-03 Thread Grant Ingersoll
Anil,

I suppose it depends on how complex the language is and what is acceptable for your 
program.  I have written a couple of stemmers that are fairly straightforward based on 
papers that I have read and work well for the langs. we are using.  Your best bet is 
probably to do a literature search for the languages you are interested in and go from 
there.  

I am, of course, assumming stemmers for your languages don't already exist.  If your 
languages are common, there probably is a stemmer available in some form that you can 
use or adapt. You'd be suprised at what you get by doing a simple google search for 
lang X stemmer where lang X is the language you are interested in and no quotes.

Hooking them into Lucene is straightforward and there are several examples of this 
available in the docs and code.

-Grant

 [EMAIL PROTECTED] 06/03/04 04:09PM 

Hi,

Can anyone provide some help on writing a stemmer for non-english languages?
How proficient must I be in a language for which I wish to write the stemmer?

Regards,
Anil

-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Writing a stemmer

2004-06-03 Thread Erik Hatcher
On Jun 3, 2004, at 4:09 PM, Musku, Anil (LA) wrote:
Can anyone provide some help on writing a stemmer for non-english 
languages?
Have a look at the snowball project in the Lucene sandbox.  If its 
non-European-based languages, I suspect it's quite complex.  It's 
highly language dependent.

How proficient must I be in a language for which I wish to write the 
stemmer?
I would venture to say you would need to be an expert in a language to 
write a decent stemmer.  The SnowballAnalyzer is quite hairy 
underneath, that's for sure.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Writing a stemmer

2004-06-03 Thread Leo Galambos
Erik Hatcher [EMAIL PROTECTED] wrote:
__

 How proficient must I be in a language for which I wish to write the 
 stemmer?
I would venture to say you would need to be an expert in a language to 
write a decent stemmer.

I'm sorry for a self-promo ;), but
the stemmer of egothor project can be
adapted to any language, and you needn't be
a language expert. Moreover, the stemmer
achieves better F-measure than Porter's stemmers.

Cheers,
Leo



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]