Re: Arabic analyzer

2004-10-19 Thread Pierrick Brihaye
Hi,
Scott Smith a écrit :
Is anyone aware of an open source (non-GPL; i.e.., free for commercial
use) Arabic analyzer for Lucene?
Unfortunately (for you), my Arabic Analyzer for Java 
(http://savannah.nongnu.org/projects/aramorph) is GPL-ed.

 Does Arabic really require a stemmer
as well (some of the reading I've seen on the web would suggest that a
stemmer is almost a necessity with Arabic to get anything useful where
it is not with other languages).
IMHO, stemming *is* a necessity in arabic since this language involves 
prefixing, suffixing and infixing as well as written a few yet very 
frequent word agregations.

Good luck,
--
Pierrick Brihaye
mailto:[EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Arabic analyzer

2004-10-07 Thread Dawid Weiss

nothing to do with each other furthermore, Arabic uses phonetic 
indicators on each letter called diacritics that change the way you 
pronounce the word which in turn changes the words meaning so two word 
spelled exactly the same way with different diacritics will mean two 
separate things, 
Just to point out the fact: most slavic languages also use diacritic 
marks (above, like 'acute', or 'dot' marks, or below, like the Polish 
'ogonek' mark). Some people argue that they can be stripped off the text 
upon indexing and that the queries usually disambiguate the context of 
the word.

It is just a digression. Now back to the arabic stemmer -- there has to 
be a way of doing it. I know Vivisimo has clustering options for arabic. 
They must be using a stemmer (and an English translation dictionary), 
although it might be a commercial one. Take a look:

http://vivisimo.com/search?v:file=cnnarabic
D.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Arabic analyzer

2004-10-07 Thread Nader Henein
There is a way of writing an Arabic stemmer, it's just not a weekend 
project, I've seen the translate/stem option as well, and even tried it 
with Lucene, we've implemented Lucene on our database and we have about 
a million records in our DB with 19 indexed fields (some of which are 
clobs) in each record, the free text fields in each record are in many 
cases Arabic, we do not provide stemming on those just because I 
couldn't find a valid stemming or translation option, which held up to 
proper testing, some were ok, but after collecting data from user 
searches (averaging out at 5 searches per second) the Arabic stemming 
options would not be able to manage user expectations, which is what it 
comes down to, sometimes theory does not translate well to practice.

Nader Henein
Dawid Weiss wrote:

nothing to do with each other furthermore, Arabic uses phonetic 
indicators on each letter called diacritics that change the way you 
pronounce the word which in turn changes the words meaning so two 
word spelled exactly the same way with different diacritics will mean 
two separate things, 

Just to point out the fact: most slavic languages also use diacritic 
marks (above, like 'acute', or 'dot' marks, or below, like the Polish 
'ogonek' mark). Some people argue that they can be stripped off the 
text upon indexing and that the queries usually disambiguate the 
context of the word.

It is just a digression. Now back to the arabic stemmer -- there has 
to be a way of doing it. I know Vivisimo has clustering options for 
arabic. They must be using a stemmer (and an English translation 
dictionary), although it might be a commercial one. Take a look:

http://vivisimo.com/search?v:file=cnnarabic
D.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Arabic analyzer

2004-10-07 Thread Andrzej Bialecki
Dawid Weiss wrote:

nothing to do with each other furthermore, Arabic uses phonetic 
indicators on each letter called diacritics that change the way you 
pronounce the word which in turn changes the words meaning so two word 
spelled exactly the same way with different diacritics will mean two 
separate things, 

Just to point out the fact: most slavic languages also use diacritic 
marks (above, like 'acute', or 'dot' marks, or below, like the Polish 
'ogonek' mark). Some people argue that they can be stripped off the text 
upon indexing and that the queries usually disambiguate the context of 
the word.
Hmm. This brings up a question: the algorithmic stemmer package from 
Egothor works quite well for Polish (http://www.getopt.org/stempel), 
wouldn't it work well for Arabic, too?

I lack the necessary expertise to evaluate results (knowing only two or 
three arabic words ;-) ), but I can certainly help someone to get 
started with testing...

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Arabic analyzer

2004-10-07 Thread Nader Henein
I'd be happy to help anyone test this out, my Arabic is pretty good.
Nader
Andrzej Bialecki wrote:
Dawid Weiss wrote:

nothing to do with each other furthermore, Arabic uses phonetic 
indicators on each letter called diacritics that change the way you 
pronounce the word which in turn changes the words meaning so two 
word spelled exactly the same way with different diacritics will 
mean two separate things, 

Just to point out the fact: most slavic languages also use diacritic 
marks (above, like 'acute', or 'dot' marks, or below, like the Polish 
'ogonek' mark). Some people argue that they can be stripped off the 
text upon indexing and that the queries usually disambiguate the 
context of the word.

Hmm. This brings up a question: the algorithmic stemmer package from 
Egothor works quite well for Polish (http://www.getopt.org/stempel), 
wouldn't it work well for Arabic, too?

I lack the necessary expertise to evaluate results (knowing only two 
or three arabic words ;-) ), but I can certainly help someone to get 
started with testing...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Arabic analyzer

2004-10-07 Thread Grant Ingersoll
Someone posted an Arabic analyzer about 1 year ago, however, I don't
think the licensing was very friendly and we no longer use it.

We have a cross language system that works w/ Arabic (among other
languages).  We have written several stemmers based on the literature
that perform pretty well
and were not too difficult to implement (but are not available as open
source at this point).  Light stemming seems to work much better in IR
applications then aggressive stemmers due to the problems with roots
discussed earlier.

-Grant

--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
http://www.cnlp.org 



 [EMAIL PROTECTED] 10/7/2004 8:45:42 AM 
Dawid Weiss wrote:
 
 nothing to do with each other furthermore, Arabic uses phonetic 
 indicators on each letter called diacritics that change the way you

 pronounce the word which in turn changes the words meaning so two
word 
 spelled exactly the same way with different diacritics will mean two

 separate things, 
 
 
 Just to point out the fact: most slavic languages also use diacritic

 marks (above, like 'acute', or 'dot' marks, or below, like the Polish

 'ogonek' mark). Some people argue that they can be stripped off the
text 
 upon indexing and that the queries usually disambiguate the context
of 
 the word.

Hmm. This brings up a question: the algorithmic stemmer package from 
Egothor works quite well for Polish (http://www.getopt.org/stempel), 
wouldn't it work well for Arabic, too?

I lack the necessary expertise to evaluate results (knowing only two or

three arabic words ;-) ), but I can certainly help someone to get 
started with testing...

-- 
Best regards,
Andrzej Bialecki

-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Arabic analyzer

2004-10-06 Thread Scott Smith
Is anyone aware of an open source (non-GPL; i.e.., free for commercial
use) Arabic analyzer for Lucene?  Does Arabic really require a stemmer
as well (some of the reading I've seen on the web would suggest that a
stemmer is almost a necessity with Arabic to get anything useful where
it is not with other languages).

 

Scott