Re: Language Detection for Analysis?

2009-08-07 Thread Andrzej Bialecki

Otis Gospodnetic wrote:

Bradford,

If I may:

Have a look at http://www.sematext.com/products/language-identifier/index.html
And/or http://www.sematext.com/products/multilingual-indexer/index.html


.. and a Nutch plugin with similar functionality:

http://lucene.apache.org/nutch/apidocs-1.0/org/apache/nutch/analysis/lang/LanguageIdentifier.html

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Language Detection for Analysis?

2009-08-07 Thread Jukka Zitting
Hi,

On Fri, Aug 7, 2009 at 12:31 PM, Andrzej Bialeckia...@getopt.org wrote:
 .. and a Nutch plugin with similar functionality:

 http://lucene.apache.org/nutch/apidocs-1.0/org/apache/nutch/analysis/lang/LanguageIdentifier.html

See also TIKA-209 [1] where I'm currently integrating the Nutch code
to work with Tika.

Tika 0.5 will have built-in language detection based on this.

[1] https://issues.apache.org/jira/browse/TIKA-209

BR,

Jukka Zitting


Re: Language Detection for Analysis?

2009-08-07 Thread Grant Ingersoll
There are several free Language Detection libraries out there, as well  
as a few commercial ones.  I think Karl Wettin has even written one as  
a plugin for Lucene.  Nutch also has one, AIUI.  I would just Google  
language detection.


Also see http://www.lucidimagination.com/search/?q=language+detection,  
as this has been brought up many times before and I'm sure there are  
links in the archives.


On Aug 6, 2009, at 3:46 PM, Bradford Stephens wrote:


Hey there,

We're trying to add foreign language support into our new search
engine -- languages like Arabic, Farsi, and Urdu (that don't work with
standard analyzers). But our data source doesn't tell us which
languages we're actually collecting -- we just get blocks of text. Has
anyone here worked on language detection so we can figure out what
analyzers to use? Are there commercial solutions?

Much appreciated!

--
http://www.roadtofailure.com -- The Fringes of Scalability, Social
Media, and Computer Science

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org






Language Detection for Analysis?

2009-08-06 Thread Bradford Stephens
Hey there,

We're trying to add foreign language support into our new search
engine -- languages like Arabic, Farsi, and Urdu (that don't work with
standard analyzers). But our data source doesn't tell us which
languages we're actually collecting -- we just get blocks of text. Has
anyone here worked on language detection so we can figure out what
analyzers to use? Are there commercial solutions?

Much appreciated!

-- 
http://www.roadtofailure.com -- The Fringes of Scalability, Social
Media, and Computer Science


Re: Language Detection for Analysis?

2009-08-06 Thread Robert Muir
Bradford, there is an arabic analyzer in trunk. for farsi there is
currently a patch available:
http://issues.apache.org/jira/browse/LUCENE-1628

one option is not to detect languages at all.
it could be hard for short queries due to the languages you mentioned
borrowing from each other.
but you do not want to apply things like stemming to the wrong language.

instead, you could use ArabicTokenizer + ArabicNormalizationFilter +
PersianNormalizationFilter and just treat it at the script level.

On Thu, Aug 6, 2009 at 3:46 PM, Bradford
Stephensbradfordsteph...@gmail.com wrote:
 Hey there,

 We're trying to add foreign language support into our new search
 engine -- languages like Arabic, Farsi, and Urdu (that don't work with
 standard analyzers). But our data source doesn't tell us which
 languages we're actually collecting -- we just get blocks of text. Has
 anyone here worked on language detection so we can figure out what
 analyzers to use? Are there commercial solutions?

 Much appreciated!

 --
 http://www.roadtofailure.com -- The Fringes of Scalability, Social
 Media, and Computer Science




-- 
Robert Muir
rcm...@gmail.com


Re: Language Detection for Analysis?

2009-08-06 Thread Cheolgoo Kang
Is that 'blocks of text' is a (unicode) Java string? I don't think
this is the case, but then, use Character.UnicodeBlock to identify the
language of the text.

And, is that just text files with unknown character encoding? Then ICU
has a 'charset detector' that you can use. This feature 'suggests' a
charset (with some probability values) from a byte stream. I don't
know about it's performance on accuracy and speed. Go to the website
http://userguide.icu-project.org/conversion/detection.

Hope it helps.

- Cheolgoo Kang



On Fri, Aug 7, 2009 at 4:46 AM, Bradford
Stephensbradfordsteph...@gmail.com wrote:
 Hey there,

 We're trying to add foreign language support into our new search
 engine -- languages like Arabic, Farsi, and Urdu (that don't work with
 standard analyzers). But our data source doesn't tell us which
 languages we're actually collecting -- we just get blocks of text. Has
 anyone here worked on language detection so we can figure out what
 analyzers to use? Are there commercial solutions?

 Much appreciated!

 --
 http://www.roadtofailure.com -- The Fringes of Scalability, Social
 Media, and Computer Science



Re: Language Detection for Analysis?

2009-08-06 Thread Robert Muir
fyi, you can use the block property,but I think even better is to use
the unicode script property: http://unicode.org/reports/tr24/ . This
is easier because some characters are common across different scripts.
Also, some scripts span multiple unicode blocks.

This is the direction I was heading LUCENE-1488, based upon the
script, tokenize text in different ways, etc.  I think the last patch
I uploaded puts it in the token flags as well.

On Thu, Aug 6, 2009 at 6:44 PM, Cheolgoo Kangapp...@gmail.com wrote:
 Is that 'blocks of text' is a (unicode) Java string? I don't think
 this is the case, but then, use Character.UnicodeBlock to identify the
 language of the text.

 And, is that just text files with unknown character encoding? Then ICU
 has a 'charset detector' that you can use. This feature 'suggests' a
 charset (with some probability values) from a byte stream. I don't
 know about it's performance on accuracy and speed. Go to the website
 http://userguide.icu-project.org/conversion/detection.

 Hope it helps.

 - Cheolgoo Kang



 On Fri, Aug 7, 2009 at 4:46 AM, Bradford
 Stephensbradfordsteph...@gmail.com wrote:
 Hey there,

 We're trying to add foreign language support into our new search
 engine -- languages like Arabic, Farsi, and Urdu (that don't work with
 standard analyzers). But our data source doesn't tell us which
 languages we're actually collecting -- we just get blocks of text. Has
 anyone here worked on language detection so we can figure out what
 analyzers to use? Are there commercial solutions?

 Much appreciated!

 --
 http://www.roadtofailure.com -- The Fringes of Scalability, Social
 Media, and Computer Science





-- 
Robert Muir
rcm...@gmail.com


Re: Language Detection for Analysis?

2009-08-06 Thread Lucas F. A. Teixeira
Google Translate just released (last week) its language API with translation
and LANGUAGE DETECTION.
:)

It's very simple to use, and you can query it with some text to define witch
language is it.

Here is a simple example using groovy, but all you need is the url to
query: http://groovyconsole.appspot.com/view.groovy?id=16


[]s,

Lucas Frare Teixeira .ยท.
- lucas...@gmail.com
- blog.lucastex.com
- twitter.com/lucastex


On Thu, Aug 6, 2009 at 4:46 PM, Bradford Stephens 
bradfordsteph...@gmail.com wrote:

 Hey there,

 We're trying to add foreign language support into our new search
 engine -- languages like Arabic, Farsi, and Urdu (that don't work with
 standard analyzers). But our data source doesn't tell us which
 languages we're actually collecting -- we just get blocks of text. Has
 anyone here worked on language detection so we can figure out what
 analyzers to use? Are there commercial solutions?

 Much appreciated!

 --
 http://www.roadtofailure.com -- The Fringes of Scalability, Social
 Media, and Computer Science



Re: Language Detection for Analysis?

2009-08-06 Thread Otis Gospodnetic
Bradford,

If I may:

Have a look at http://www.sematext.com/products/language-identifier/index.html
And/or http://www.sematext.com/products/multilingual-indexer/index.html

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: Bradford Stephens bradfordsteph...@gmail.com
 To: solr-user@lucene.apache.org; java-u...@lucene.apache.org
 Sent: Thursday, August 6, 2009 3:46:21 PM
 Subject: Language Detection for Analysis?
 
 Hey there,
 
 We're trying to add foreign language support into our new search
 engine -- languages like Arabic, Farsi, and Urdu (that don't work with
 standard analyzers). But our data source doesn't tell us which
 languages we're actually collecting -- we just get blocks of text. Has
 anyone here worked on language detection so we can figure out what
 analyzers to use? Are there commercial solutions?
 
 Much appreciated!
 
 -- 
 http://www.roadtofailure.com -- The Fringes of Scalability, Social
 Media, and Computer Science
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org