Otis Gospodnetic wrote:
Bradford,
If I may:
Have a look at http://www.sematext.com/products/language-identifier/index.html
And/or http://www.sematext.com/products/multilingual-indexer/index.html
.. and a Nutch plugin with similar functionality:
Hi,
On Fri, Aug 7, 2009 at 12:31 PM, Andrzej Bialeckia...@getopt.org wrote:
.. and a Nutch plugin with similar functionality:
http://lucene.apache.org/nutch/apidocs-1.0/org/apache/nutch/analysis/lang/LanguageIdentifier.html
See also TIKA-209 [1] where I'm currently integrating the Nutch code
There are several free Language Detection libraries out there, as well
as a few commercial ones. I think Karl Wettin has even written one as
a plugin for Lucene. Nutch also has one, AIUI. I would just Google
language detection.
Also see
Hey there,
We're trying to add foreign language support into our new search
engine -- languages like Arabic, Farsi, and Urdu (that don't work with
standard analyzers). But our data source doesn't tell us which
languages we're actually collecting -- we just get blocks of text. Has
anyone here
Bradford, there is an arabic analyzer in trunk. for farsi there is
currently a patch available:
http://issues.apache.org/jira/browse/LUCENE-1628
one option is not to detect languages at all.
it could be hard for short queries due to the languages you mentioned
borrowing from each other.
but you
Is that 'blocks of text' is a (unicode) Java string? I don't think
this is the case, but then, use Character.UnicodeBlock to identify the
language of the text.
And, is that just text files with unknown character encoding? Then ICU
has a 'charset detector' that you can use. This feature 'suggests'
fyi, you can use the block property,but I think even better is to use
the unicode script property: http://unicode.org/reports/tr24/ . This
is easier because some characters are common across different scripts.
Also, some scripts span multiple unicode blocks.
This is the direction I was heading
Google Translate just released (last week) its language API with translation
and LANGUAGE DETECTION.
:)
It's very simple to use, and you can query it with some text to define witch
language is it.
Here is a simple example using groovy, but all you need is the url to
query:
, NER, IR
- Original Message
From: Bradford Stephens bradfordsteph...@gmail.com
To: solr-user@lucene.apache.org; java-u...@lucene.apache.org
Sent: Thursday, August 6, 2009 3:46:21 PM
Subject: Language Detection for Analysis?
Hey there,
We're trying to add foreign language