Re: Language Detection for Analysis?

2009-08-07 Thread Andrzej Bialecki
Otis Gospodnetic wrote: Bradford, If I may: Have a look at http://www.sematext.com/products/language-identifier/index.html And/or http://www.sematext.com/products/multilingual-indexer/index.html .. and a Nutch plugin with similar functionality:

Re: Language Detection for Analysis?

2009-08-07 Thread Jukka Zitting
Hi, On Fri, Aug 7, 2009 at 12:31 PM, Andrzej Bialeckia...@getopt.org wrote: .. and a Nutch plugin with similar functionality: http://lucene.apache.org/nutch/apidocs-1.0/org/apache/nutch/analysis/lang/LanguageIdentifier.html See also TIKA-209 [1] where I'm currently integrating the Nutch code

Re: Language Detection for Analysis?

2009-08-07 Thread Grant Ingersoll
There are several free Language Detection libraries out there, as well as a few commercial ones. I think Karl Wettin has even written one as a plugin for Lucene. Nutch also has one, AIUI. I would just Google language detection. Also see

Language Detection for Analysis?

2009-08-06 Thread Bradford Stephens
Hey there, We're trying to add foreign language support into our new search engine -- languages like Arabic, Farsi, and Urdu (that don't work with standard analyzers). But our data source doesn't tell us which languages we're actually collecting -- we just get blocks of text. Has anyone here

Re: Language Detection for Analysis?

2009-08-06 Thread Robert Muir
Bradford, there is an arabic analyzer in trunk. for farsi there is currently a patch available: http://issues.apache.org/jira/browse/LUCENE-1628 one option is not to detect languages at all. it could be hard for short queries due to the languages you mentioned borrowing from each other. but you

Re: Language Detection for Analysis?

2009-08-06 Thread Cheolgoo Kang
Is that 'blocks of text' is a (unicode) Java string? I don't think this is the case, but then, use Character.UnicodeBlock to identify the language of the text. And, is that just text files with unknown character encoding? Then ICU has a 'charset detector' that you can use. This feature 'suggests'

Re: Language Detection for Analysis?

2009-08-06 Thread Robert Muir
fyi, you can use the block property,but I think even better is to use the unicode script property: http://unicode.org/reports/tr24/ . This is easier because some characters are common across different scripts. Also, some scripts span multiple unicode blocks. This is the direction I was heading

Re: Language Detection for Analysis?

2009-08-06 Thread Lucas F. A. Teixeira
Google Translate just released (last week) its language API with translation and LANGUAGE DETECTION. :) It's very simple to use, and you can query it with some text to define witch language is it. Here is a simple example using groovy, but all you need is the url to query:

Re: Language Detection for Analysis?

2009-08-06 Thread Otis Gospodnetic
, NER, IR - Original Message From: Bradford Stephens bradfordsteph...@gmail.com To: solr-user@lucene.apache.org; java-u...@lucene.apache.org Sent: Thursday, August 6, 2009 3:46:21 PM Subject: Language Detection for Analysis? Hey there, We're trying to add foreign language