Re: Aramorph Analyzer

2004-12-20 Thread Pierrick Brihaye
Hi,
Sorry, I (the aramorph maintainer ;-) was absent from the office...
Daniel Naber a crit :
Analyzers that provide ambiguous terms (i.e. a token with more than one term 
at the same position) don't work in Lucene 1.4.
The is the correct answer. I've filled a bug about this : 
http://issues.apache.org/bugzilla/show_bug.cgi?id=23307

This feature has only 
recently been added to CVS.
... and I thank you very much for this commit.
Notice however that you may experiment some problems with the query 
parser because Buckwalter's arabic transliteration uses the standard * 
joker character as a representation for dhal.

Notice also that aramorph has a mailing-list for such questions :
http://lists.nongnu.org/mailman/listinfo/aramorph-users
Cheers,
--
Pierrick Brihaye, informaticien
Service rgional de l'Inventaire
DRAC Bretagne
mailto:[EMAIL PROTECTED]
+33 (0)2 99 29 67 78
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Aramorph Analyzer

2004-12-16 Thread Safarnejad, Ali (AFIS)
I wanted to share some results from trying out Aramorph Arabic Analyzer with
Lucene.  I experimented with a set of 100 web documents in Windows-1256
encoding.  The indexing took just over 200 seconds, although I had to
increase the heap-size to 500Meg, or I would get OutOfMemory Exceptions
halfway thru the documents.  The 200 seconds includes time to make the url
connection and tidy the documents to extract the text out.

Has anyone done similar experiments with a larger set of Arabic documents?
I'm interested in hearing from anyone else who has used Aramorph with Lucene.

Thanks,
Ali

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Aramorph Analyzer

2004-12-16 Thread Safarnejad, Ali (AFIS)
Actually, one thing worth mentioning about the search, is when searching for
whole phrases, if there is any ambiguous words in the phrase, then the Search
fails to find the document, even if the phrase was copied and pasted from the
original document.
So for example, I have a document containing this phrase:  

The first two words only have one stem, but the last word has two stems:
munaZ~im AND munaZ~am,
So the entire search query becomes: Aljh___zp riyAsiy~ munaZ~im munaZ~am
Which fails to find any matching documents.
Whereas, a search for Aljh___zp riyAsiy~ would succeed.
Even placing the accent over the ZAH (), will not disambiguate the search.
Has anyone found a workaround for this?

ali


-Original Message-
From: Safarnejad, Ali (AFIS) 
Sent: 16 December 2004 10:23
To: Lucene Users List
Subject: Aramorph Analyzer


I wanted to share some results from trying out Aramorph Arabic Analyzer with
Lucene.  I experimented with a set of 100 web documents in Windows-1256
encoding.  The indexing took just over 200 seconds, although I had to
increase the heap-size to 500Meg, or I would get OutOfMemory Exceptions
halfway thru the documents.  The 200 seconds includes time to make the url
connection and tidy the documents to extract the text out.

Has anyone done similar experiments with a larger set of Arabic documents?
I'm interested in hearing from anyone else who has used Aramorph with Lucene.

Thanks,
Ali

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Aramorph Analyzer

2004-12-16 Thread Andrzej Bialecki
Safarnejad, Ali (AFIS) wrote:
Actually, one thing worth mentioning about the search, is when searching for
whole phrases, if there is any ambiguous words in the phrase, then the Search
fails to find the document, even if the phrase was copied and pasted from the
original document.
So for example, I have a document containing this phrase:  

The first two words only have one stem, but the last word has two stems:
munaZ~im AND munaZ~am,
So the entire search query becomes: Aljh___zp riyAsiy~ munaZ~im munaZ~am
Which fails to find any matching documents.
Whereas, a search for Aljh___zp riyAsiy~ would succeed.
Even placing the accent over the ZAH (), will not disambiguate the search.
Has anyone found a workaround for this?
Although my knowledge of Arabic is equal to zero, I suggest that you 
should see how your query looks like after it is parsed 
(Query.toString()), and then compare it to the terms that are actually 
stored in the index. There is a chance that you e.g. apply the stemmer 
twice by using incorrect analyzer, or don't add the stemmed terms to the 
index, or similar. I suggest using Luke (http://www.getopt.org/luke) to 
diagnose your problem - in the Search tab you can also view the final 
query terms.

--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Aramorph Analyzer

2004-12-16 Thread Safarnejad, Ali (AFIS)
The Search query, as I mentioned in my previous email, looks like this:
Aljh___zp riyAsiy~ munaZ~im munaZ~am
In fact, all the individual words are in the index, however, the complete
phrase, in double quoutes, does not match.  Neither does any other phrase
that contains ambiguous stems. And that's the problem.



-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: 16 December 2004 14:35
To: Lucene Users List
Subject: Re: Aramorph Analyzer


Safarnejad, Ali (AFIS) wrote:
 Actually, one thing worth mentioning about the search, is when 
 searching for whole phrases, if there is any ambiguous words in the 
 phrase, then the Search fails to find the document, even if the phrase 
 was copied and pasted from the original document. So for example, I 
 have a document containing this phrase:   
 The first two words only have one stem, but the last word has two stems:
 munaZ~im AND munaZ~am,
 So the entire search query becomes: Aljh___zp riyAsiy~ munaZ~im munaZ~am
 Which fails to find any matching documents.
 Whereas, a search for Aljh___zp riyAsiy~ would succeed.
 Even placing the accent over the ZAH (), will not disambiguate the search.
 Has anyone found a workaround for this?

Although my knowledge of Arabic is equal to zero, I suggest that you 
should see how your query looks like after it is parsed 
(Query.toString()), and then compare it to the terms that are actually 
stored in the index. There is a chance that you e.g. apply the stemmer 
twice by using incorrect analyzer, or don't add the stemmed terms to the 
index, or similar. I suggest using Luke (http://www.getopt.org/luke) to 
diagnose your problem - in the Search tab you can also view the final 
query terms.


-- 
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||  \|  ||
|  Embedded Unix, System Integration http://www.sigram.com  Contact: info at
sigram dot com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Aramorph Analyzer

2004-12-16 Thread Daniel Naber
On Thursday 16 December 2004 11:59, Safarnejad, Ali (AFIS) wrote:

 Actually, one thing worth mentioning about the search, is when searching
 for whole phrases, if there is any ambiguous words in the phrase, then the
 Search fails to find the document, even if the phrase was copied and pasted
 from the original document.

Analyzers that provide ambiguous terms (i.e. a token with more than one term 
at the same position) don't work in Lucene 1.4. This feature has only 
recently been added to CVS. The workaround would be to backport that change.

Regards
 Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]