Re: Multilanguage support in Nutch 1.0

2009-09-30 Thread David Jashi
On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote:

 hi

 try to activate the language-identifier plugin
 you must add it in the nutch-site.xml file in the  
 nameplugin.includes/name section.

Shame on me! Thanks a lot.


 it's some thing like that



 property
  nameplugin.includes/name
  valueprotocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier/value
  descriptionRegular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  /description
 /property


 From: da...@jashi.ge
 Date: Tue, 29 Sep 2009 18:59:52 +0400
 Subject: Multilanguage support in Nutch 1.0
 To: nutch-user@lucene.apache.org

 Hello, all.

 I've got a bit of a trouble with Nutch 1.0 and multilanguage support:

 I have fresh install of Nutch and two analysis plugins I'd like to turn on:
 analysis-de (German) and analysis-ge (Georgian)
 Here are the innards of my seed file:
 ---
 http://212.72.133.54/l/test.html
 http://212.72.133.54/l/de.html
 ---
 The first is Georgian, other - German. When I run

 bin/nutch crawl seed -dir crawl -threads 10 -depth 2

 there is not a slightest sign of someone calling any analysis
 plug-ins, even though it's clearly stated in hadoop.log, that they are
 on and active:
 ---
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - crawl started in: crawl
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - rootUrlDir = seed
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - threads = 10
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - depth = 2
 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: starting
 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: crawlDb: 
 crawl/crawldb
 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: urlDir: seed
 2009-09-29 16:39:13,390 INFO  crawl.Injector - Injector: Converting
 injected urls to crawl db entries.
 2009-09-29 16:39:13,421 WARN  mapred.JobClient - Use
 GenericOptionsParser for parsing the arguments. Applications should
 implement Tool for the same.
 2009-09-29 16:39:15,546 INFO  plugin.PluginRepository - Plugins:
 looking in: C:\cygwin\opt\nutch\plugins
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Plugin
 Auto-activation mode: [true]
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Registered Plugins:
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         the nutch
 core extension points (nutch-extensionpoints)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic Query
 Filter (query-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Lucene
 Analysers (lib-lucene-analyzers)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic URL
 Normalizer (urlnormalizer-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Language
 Identification Parser/Filter (language-identifier)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Html Parse
 Plug-in (parse-html)

 !
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Georgian
 Analysis Plug-in (analysis-ge)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         German
 Analysis Plug-in (analysis-de)
 !

 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic
 Indexing Filter (index-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic
 Summarizer Plug-in (summary-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Site Query
 Filter (query-site)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         HTTP
 Framework (lib-http)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Text Parse
 Plug-in (parse-text)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         More Query
 Filter (query-more)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
 Filter (urlfilter-regex)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Pass-through
 URL Normalizer (urlnormalizer-pass)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Http Protocol
 Plug-in (protocol-http)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
 Normalizer (urlnormalizer-regex)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         OPIC Scoring
 Plug-in (scoring-opic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         CyberNeko
 HTML Parser (lib-nekohtml)
 2009-09-29 

Re: Multilanguage support in Nutch 1.0

2009-09-30 Thread David Jashi
On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote:

 hi

 try to activate the language-identifier plugin
 you must add it in the nutch-site.xml file in the  
 nameplugin.includes/name section.

Ooops. It IS activated.

2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -
Language Identification Parser/Filter (language-identifier)

But fetched pages are not passed to it, as I recon.


RE: Multilanguage support in Nutch 1.0

2009-09-30 Thread BELLINI ADAM

hi,
do you have some metadata 'lang' on the pages . becoz the plugin try first to 
get the language form metadata..
if you see in the java source of the plugin LanguageIndexingFilter.java


// check if LANGUAGE found, possibly put there by HTMLLanguageParser
String lang = parse.getData().getParseMeta().get(Metadata.LANGUAGE);

// check if HTTP-header tels us the language
if (lang == null) {
lang = parse.getData().getContentMeta().get(Response.CONTENT_LANGUAGE);
}

try to use also LUKE to check all your metadata on the index.





 From: da...@jashi.ge
 Date: Wed, 30 Sep 2009 17:22:26 +0400
 Subject: Re: Multilanguage support in Nutch 1.0
 To: nutch-user@lucene.apache.org
 
 On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote:
 
  hi
 
  try to activate the language-identifier plugin
  you must add it in the nutch-site.xml file in the  
  nameplugin.includes/name section.
 
 Ooops. It IS activated.
 
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -
 Language Identification Parser/Filter (language-identifier)
 
 But fetched pages are not passed to it, as I recon.
  
_
Windows Live helps you keep up with all your friends, in one place.
http://go.microsoft.com/?linkid=9660826

RE: Multilanguage support in Nutch 1.0

2009-09-29 Thread BELLINI ADAM

hi 

try to activate the language-identifier plugin
you must add it in the nutch-site.xml file in the  nameplugin.includes/name 
section.

it's some thing like that 



property
  nameplugin.includes/name
  
valueprotocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier/value
  descriptionRegular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  /description
/property


 From: da...@jashi.ge
 Date: Tue, 29 Sep 2009 18:59:52 +0400
 Subject: Multilanguage support in Nutch 1.0
 To: nutch-user@lucene.apache.org
 
 Hello, all.
 
 I've got a bit of a trouble with Nutch 1.0 and multilanguage support:
 
 I have fresh install of Nutch and two analysis plugins I'd like to turn on:
 analysis-de (German) and analysis-ge (Georgian)
 Here are the innards of my seed file:
 ---
 http://212.72.133.54/l/test.html
 http://212.72.133.54/l/de.html
 ---
 The first is Georgian, other - German. When I run
 
 bin/nutch crawl seed -dir crawl -threads 10 -depth 2
 
 there is not a slightest sign of someone calling any analysis
 plug-ins, even though it's clearly stated in hadoop.log, that they are
 on and active:
 ---
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - crawl started in: crawl
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - rootUrlDir = seed
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - threads = 10
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - depth = 2
 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: starting
 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: crawlDb: 
 crawl/crawldb
 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: urlDir: seed
 2009-09-29 16:39:13,390 INFO  crawl.Injector - Injector: Converting
 injected urls to crawl db entries.
 2009-09-29 16:39:13,421 WARN  mapred.JobClient - Use
 GenericOptionsParser for parsing the arguments. Applications should
 implement Tool for the same.
 2009-09-29 16:39:15,546 INFO  plugin.PluginRepository - Plugins:
 looking in: C:\cygwin\opt\nutch\plugins
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Plugin
 Auto-activation mode: [true]
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Registered Plugins:
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - the nutch
 core extension points (nutch-extensionpoints)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Basic Query
 Filter (query-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Lucene
 Analysers (lib-lucene-analyzers)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Basic URL
 Normalizer (urlnormalizer-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Language
 Identification Parser/Filter (language-identifier)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Html Parse
 Plug-in (parse-html)
 
 !
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Georgian
 Analysis Plug-in (analysis-ge)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - German
 Analysis Plug-in (analysis-de)
 !
 
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Basic
 Indexing Filter (index-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Basic
 Summarizer Plug-in (summary-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Site Query
 Filter (query-site)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - HTTP
 Framework (lib-http)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Text Parse
 Plug-in (parse-text)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - More Query
 Filter (query-more)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Regex URL
 Filter (urlfilter-regex)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Pass-through
 URL Normalizer (urlnormalizer-pass)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Http Protocol
 Plug-in (protocol-http)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Regex URL
 Normalizer (urlnormalizer-regex)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - OPIC Scoring
 Plug-in (scoring-opic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - CyberNeko
 HTML Parser (lib-nekohtml)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - JavaScript
 Parser (parse-js)
 2009-09-29