Re: Aramorph Analyzer
Hi, Sorry, I (the aramorph maintainer ;-) was absent from the office... Daniel Naber a crit : Analyzers that provide ambiguous terms (i.e. a token with more than one term at the same position) don't work in Lucene 1.4. The is the correct answer. I've filled a bug about this : http://issues.apache.org/bugzilla/show_bug.cgi?id=23307 This feature has only recently been added to CVS. ... and I thank you very much for this commit. Notice however that you may experiment some problems with the query parser because Buckwalter's arabic transliteration uses the standard * joker character as a representation for dhal. Notice also that aramorph has a mailing-list for such questions : http://lists.nongnu.org/mailman/listinfo/aramorph-users Cheers, -- Pierrick Brihaye, informaticien Service rgional de l'Inventaire DRAC Bretagne mailto:[EMAIL PROTECTED] +33 (0)2 99 29 67 78 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Aramorph Analyzer
I wanted to share some results from trying out Aramorph Arabic Analyzer with Lucene. I experimented with a set of 100 web documents in Windows-1256 encoding. The indexing took just over 200 seconds, although I had to increase the heap-size to 500Meg, or I would get OutOfMemory Exceptions halfway thru the documents. The 200 seconds includes time to make the url connection and tidy the documents to extract the text out. Has anyone done similar experiments with a larger set of Arabic documents? I'm interested in hearing from anyone else who has used Aramorph with Lucene. Thanks, Ali - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Aramorph Analyzer
Actually, one thing worth mentioning about the search, is when searching for whole phrases, if there is any ambiguous words in the phrase, then the Search fails to find the document, even if the phrase was copied and pasted from the original document. So for example, I have a document containing this phrase: The first two words only have one stem, but the last word has two stems: munaZ~im AND munaZ~am, So the entire search query becomes: Aljh___zp riyAsiy~ munaZ~im munaZ~am Which fails to find any matching documents. Whereas, a search for Aljh___zp riyAsiy~ would succeed. Even placing the accent over the ZAH (), will not disambiguate the search. Has anyone found a workaround for this? ali -Original Message- From: Safarnejad, Ali (AFIS) Sent: 16 December 2004 10:23 To: Lucene Users List Subject: Aramorph Analyzer I wanted to share some results from trying out Aramorph Arabic Analyzer with Lucene. I experimented with a set of 100 web documents in Windows-1256 encoding. The indexing took just over 200 seconds, although I had to increase the heap-size to 500Meg, or I would get OutOfMemory Exceptions halfway thru the documents. The 200 seconds includes time to make the url connection and tidy the documents to extract the text out. Has anyone done similar experiments with a larger set of Arabic documents? I'm interested in hearing from anyone else who has used Aramorph with Lucene. Thanks, Ali - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Aramorph Analyzer
Safarnejad, Ali (AFIS) wrote: Actually, one thing worth mentioning about the search, is when searching for whole phrases, if there is any ambiguous words in the phrase, then the Search fails to find the document, even if the phrase was copied and pasted from the original document. So for example, I have a document containing this phrase: The first two words only have one stem, but the last word has two stems: munaZ~im AND munaZ~am, So the entire search query becomes: Aljh___zp riyAsiy~ munaZ~im munaZ~am Which fails to find any matching documents. Whereas, a search for Aljh___zp riyAsiy~ would succeed. Even placing the accent over the ZAH (), will not disambiguate the search. Has anyone found a workaround for this? Although my knowledge of Arabic is equal to zero, I suggest that you should see how your query looks like after it is parsed (Query.toString()), and then compare it to the terms that are actually stored in the index. There is a chance that you e.g. apply the stemmer twice by using incorrect analyzer, or don't add the stemmed terms to the index, or similar. I suggest using Luke (http://www.getopt.org/luke) to diagnose your problem - in the Search tab you can also view the final query terms. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Aramorph Analyzer
The Search query, as I mentioned in my previous email, looks like this: Aljh___zp riyAsiy~ munaZ~im munaZ~am In fact, all the individual words are in the index, however, the complete phrase, in double quoutes, does not match. Neither does any other phrase that contains ambiguous stems. And that's the problem. -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: 16 December 2004 14:35 To: Lucene Users List Subject: Re: Aramorph Analyzer Safarnejad, Ali (AFIS) wrote: Actually, one thing worth mentioning about the search, is when searching for whole phrases, if there is any ambiguous words in the phrase, then the Search fails to find the document, even if the phrase was copied and pasted from the original document. So for example, I have a document containing this phrase: The first two words only have one stem, but the last word has two stems: munaZ~im AND munaZ~am, So the entire search query becomes: Aljh___zp riyAsiy~ munaZ~im munaZ~am Which fails to find any matching documents. Whereas, a search for Aljh___zp riyAsiy~ would succeed. Even placing the accent over the ZAH (), will not disambiguate the search. Has anyone found a workaround for this? Although my knowledge of Arabic is equal to zero, I suggest that you should see how your query looks like after it is parsed (Query.toString()), and then compare it to the terms that are actually stored in the index. There is a chance that you e.g. apply the stemmer twice by using incorrect analyzer, or don't add the stemmed terms to the index, or similar. I suggest using Luke (http://www.getopt.org/luke) to diagnose your problem - in the Search tab you can also view the final query terms. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Aramorph Analyzer
On Thursday 16 December 2004 11:59, Safarnejad, Ali (AFIS) wrote: Actually, one thing worth mentioning about the search, is when searching for whole phrases, if there is any ambiguous words in the phrase, then the Search fails to find the document, even if the phrase was copied and pasted from the original document. Analyzers that provide ambiguous terms (i.e. a token with more than one term at the same position) don't work in Lucene 1.4. This feature has only recently been added to CVS. The workaround would be to backport that change. Regards Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]