There is two way LanguageIndentifier plugin uses to find language: 1. HTML tags.(Detect) 2. Statistical language identification (identify)
When plugin looks html tag for language description, it uses http://svn.apache.org/viewvc/nutch/trunk/src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/langmappings.properties?view=markup If plugin can not find language description in html tags, it uses statistical language identification to extract page language(Tika LangualeIndentifier). http://svn.apache.org/viewvc/tika/trunk/tika-core/src/main/resources/org/apache/tika/language/tika.language.properties?revision=1181278&view=markup On Mon, Jun 3, 2013 at 5:26 PM, Tejas Patil <[email protected]>wrote: > > http://svn.apache.org/viewvc/nutch/trunk/src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/langmappings.properties?view=markup > > > On Mon, Jun 3, 2013 at 6:35 AM, H. Coskun Gunduz > <[email protected]>wrote: > > > Hi, > > > > I'm looking for the list of Implemented Languages in Language Identifier > > Plugin. > > > > There's a list in wiki page [1] but the page last edited almost four > years > > ago. I'm not sure if the list there is up-to-date or not. > > > > Any help will be appreciated. > > > > Thanks. > > > > coskun... > > > > [1]http://wiki.apache.org/**nutch/LanguageIdentifierPlugin< > http://wiki.apache.org/nutch/LanguageIdentifierPlugin> > > >

