Re: Language Detection for Analysis?
Otis Gospodnetic wrote: Bradford, If I may: Have a look at http://www.sematext.com/products/language-identifier/index.html And/or http://www.sematext.com/products/multilingual-indexer/index.html .. and a Nutch plugin with similar functionality: http://lucene.apache.org/nutch/apidocs-1.0/org/apache/nutch/analysis/lang/LanguageIdentifier.html -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Language Detection for Analysis?
Hi, On Fri, Aug 7, 2009 at 12:31 PM, Andrzej Bialeckia...@getopt.org wrote: .. and a Nutch plugin with similar functionality: http://lucene.apache.org/nutch/apidocs-1.0/org/apache/nutch/analysis/lang/LanguageIdentifier.html See also TIKA-209 [1] where I'm currently integrating the Nutch code to work with Tika. Tika 0.5 will have built-in language detection based on this. [1] https://issues.apache.org/jira/browse/TIKA-209 BR, Jukka Zitting
Re: Language Detection for Analysis?
There are several free Language Detection libraries out there, as well as a few commercial ones. I think Karl Wettin has even written one as a plugin for Lucene. Nutch also has one, AIUI. I would just Google language detection. Also see http://www.lucidimagination.com/search/?q=language+detection, as this has been brought up many times before and I'm sure there are links in the archives. On Aug 6, 2009, at 3:46 PM, Bradford Stephens wrote: Hey there, We're trying to add foreign language support into our new search engine -- languages like Arabic, Farsi, and Urdu (that don't work with standard analyzers). But our data source doesn't tell us which languages we're actually collecting -- we just get blocks of text. Has anyone here worked on language detection so we can figure out what analyzers to use? Are there commercial solutions? Much appreciated! -- http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Language Detection for Analysis?
Hey there, We're trying to add foreign language support into our new search engine -- languages like Arabic, Farsi, and Urdu (that don't work with standard analyzers). But our data source doesn't tell us which languages we're actually collecting -- we just get blocks of text. Has anyone here worked on language detection so we can figure out what analyzers to use? Are there commercial solutions? Much appreciated! -- http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science
Re: Language Detection for Analysis?
Bradford, there is an arabic analyzer in trunk. for farsi there is currently a patch available: http://issues.apache.org/jira/browse/LUCENE-1628 one option is not to detect languages at all. it could be hard for short queries due to the languages you mentioned borrowing from each other. but you do not want to apply things like stemming to the wrong language. instead, you could use ArabicTokenizer + ArabicNormalizationFilter + PersianNormalizationFilter and just treat it at the script level. On Thu, Aug 6, 2009 at 3:46 PM, Bradford Stephensbradfordsteph...@gmail.com wrote: Hey there, We're trying to add foreign language support into our new search engine -- languages like Arabic, Farsi, and Urdu (that don't work with standard analyzers). But our data source doesn't tell us which languages we're actually collecting -- we just get blocks of text. Has anyone here worked on language detection so we can figure out what analyzers to use? Are there commercial solutions? Much appreciated! -- http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science -- Robert Muir rcm...@gmail.com
Re: Language Detection for Analysis?
Is that 'blocks of text' is a (unicode) Java string? I don't think this is the case, but then, use Character.UnicodeBlock to identify the language of the text. And, is that just text files with unknown character encoding? Then ICU has a 'charset detector' that you can use. This feature 'suggests' a charset (with some probability values) from a byte stream. I don't know about it's performance on accuracy and speed. Go to the website http://userguide.icu-project.org/conversion/detection. Hope it helps. - Cheolgoo Kang On Fri, Aug 7, 2009 at 4:46 AM, Bradford Stephensbradfordsteph...@gmail.com wrote: Hey there, We're trying to add foreign language support into our new search engine -- languages like Arabic, Farsi, and Urdu (that don't work with standard analyzers). But our data source doesn't tell us which languages we're actually collecting -- we just get blocks of text. Has anyone here worked on language detection so we can figure out what analyzers to use? Are there commercial solutions? Much appreciated! -- http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science
Re: Language Detection for Analysis?
fyi, you can use the block property,but I think even better is to use the unicode script property: http://unicode.org/reports/tr24/ . This is easier because some characters are common across different scripts. Also, some scripts span multiple unicode blocks. This is the direction I was heading LUCENE-1488, based upon the script, tokenize text in different ways, etc. I think the last patch I uploaded puts it in the token flags as well. On Thu, Aug 6, 2009 at 6:44 PM, Cheolgoo Kangapp...@gmail.com wrote: Is that 'blocks of text' is a (unicode) Java string? I don't think this is the case, but then, use Character.UnicodeBlock to identify the language of the text. And, is that just text files with unknown character encoding? Then ICU has a 'charset detector' that you can use. This feature 'suggests' a charset (with some probability values) from a byte stream. I don't know about it's performance on accuracy and speed. Go to the website http://userguide.icu-project.org/conversion/detection. Hope it helps. - Cheolgoo Kang On Fri, Aug 7, 2009 at 4:46 AM, Bradford Stephensbradfordsteph...@gmail.com wrote: Hey there, We're trying to add foreign language support into our new search engine -- languages like Arabic, Farsi, and Urdu (that don't work with standard analyzers). But our data source doesn't tell us which languages we're actually collecting -- we just get blocks of text. Has anyone here worked on language detection so we can figure out what analyzers to use? Are there commercial solutions? Much appreciated! -- http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science -- Robert Muir rcm...@gmail.com
Re: Language Detection for Analysis?
Google Translate just released (last week) its language API with translation and LANGUAGE DETECTION. :) It's very simple to use, and you can query it with some text to define witch language is it. Here is a simple example using groovy, but all you need is the url to query: http://groovyconsole.appspot.com/view.groovy?id=16 []s, Lucas Frare Teixeira .ยท. - lucas...@gmail.com - blog.lucastex.com - twitter.com/lucastex On Thu, Aug 6, 2009 at 4:46 PM, Bradford Stephens bradfordsteph...@gmail.com wrote: Hey there, We're trying to add foreign language support into our new search engine -- languages like Arabic, Farsi, and Urdu (that don't work with standard analyzers). But our data source doesn't tell us which languages we're actually collecting -- we just get blocks of text. Has anyone here worked on language detection so we can figure out what analyzers to use? Are there commercial solutions? Much appreciated! -- http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science
Re: Language Detection for Analysis?
Bradford, If I may: Have a look at http://www.sematext.com/products/language-identifier/index.html And/or http://www.sematext.com/products/multilingual-indexer/index.html Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Bradford Stephens bradfordsteph...@gmail.com To: solr-user@lucene.apache.org; java-u...@lucene.apache.org Sent: Thursday, August 6, 2009 3:46:21 PM Subject: Language Detection for Analysis? Hey there, We're trying to add foreign language support into our new search engine -- languages like Arabic, Farsi, and Urdu (that don't work with standard analyzers). But our data source doesn't tell us which languages we're actually collecting -- we just get blocks of text. Has anyone here worked on language detection so we can figure out what analyzers to use? Are there commercial solutions? Much appreciated! -- http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org