Bradford, there is an arabic analyzer in trunk. for farsi there is
currently a patch available:
http://issues.apache.org/jira/browse/LUCENE-1628

one option is not to detect languages at all.
it could be hard for short queries due to the languages you mentioned
borrowing from each other.
but you do not want to apply things like stemming to the wrong language.

instead, you could use ArabicTokenizer + ArabicNormalizationFilter +
PersianNormalizationFilter and just treat it at the script level.

On Thu, Aug 6, 2009 at 3:46 PM, Bradford
Stephens<bradfordsteph...@gmail.com> wrote:
> Hey there,
>
> We're trying to add foreign language support into our new search
> engine -- languages like Arabic, Farsi, and Urdu (that don't work with
> standard analyzers). But our data source doesn't tell us which
> languages we're actually collecting -- we just get blocks of text. Has
> anyone here worked on language detection so we can figure out what
> analyzers to use? Are there commercial solutions?
>
> Much appreciated!
>
> --
> http://www.roadtofailure.com -- The Fringes of Scalability, Social
> Media, and Computer Science
>



-- 
Robert Muir
rcm...@gmail.com

Reply via email to