Re: Multilanguage support in Nutch 1.0
On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote: hi try to activate the language-identifier plugin you must add it in the nutch-site.xml file in the nameplugin.includes/name section. Shame on me! Thanks a lot. it's some thing like that property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property From: da...@jashi.ge Date: Tue, 29 Sep 2009 18:59:52 +0400 Subject: Multilanguage support in Nutch 1.0 To: nutch-user@lucene.apache.org Hello, all. I've got a bit of a trouble with Nutch 1.0 and multilanguage support: I have fresh install of Nutch and two analysis plugins I'd like to turn on: analysis-de (German) and analysis-ge (Georgian) Here are the innards of my seed file: --- http://212.72.133.54/l/test.html http://212.72.133.54/l/de.html --- The first is Georgian, other - German. When I run bin/nutch crawl seed -dir crawl -threads 10 -depth 2 there is not a slightest sign of someone calling any analysis plug-ins, even though it's clearly stated in hadoop.log, that they are on and active: --- 2009-09-29 16:39:13,328 INFO crawl.Crawl - crawl started in: crawl 2009-09-29 16:39:13,328 INFO crawl.Crawl - rootUrlDir = seed 2009-09-29 16:39:13,328 INFO crawl.Crawl - threads = 10 2009-09-29 16:39:13,328 INFO crawl.Crawl - depth = 2 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: starting 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: urlDir: seed 2009-09-29 16:39:13,390 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2009-09-29 16:39:13,421 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-09-29 16:39:15,546 INFO plugin.PluginRepository - Plugins: looking in: C:\cygwin\opt\nutch\plugins 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Registered Plugins: 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Lucene Analysers (lib-lucene-analyzers) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Language Identification Parser/Filter (language-identifier) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) ! 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Georgian Analysis Plug-in (analysis-ge) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - German Analysis Plug-in (analysis-de) ! 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Site Query Filter (query-site) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - More Query Filter (query-more) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2009-09-29
Re: Multilanguage support in Nutch 1.0
On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote: hi try to activate the language-identifier plugin you must add it in the nutch-site.xml file in the nameplugin.includes/name section. Ooops. It IS activated. 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Language Identification Parser/Filter (language-identifier) But fetched pages are not passed to it, as I recon.
RE: Multilanguage support in Nutch 1.0
hi, do you have some metadata 'lang' on the pages . becoz the plugin try first to get the language form metadata.. if you see in the java source of the plugin LanguageIndexingFilter.java // check if LANGUAGE found, possibly put there by HTMLLanguageParser String lang = parse.getData().getParseMeta().get(Metadata.LANGUAGE); // check if HTTP-header tels us the language if (lang == null) { lang = parse.getData().getContentMeta().get(Response.CONTENT_LANGUAGE); } try to use also LUKE to check all your metadata on the index. From: da...@jashi.ge Date: Wed, 30 Sep 2009 17:22:26 +0400 Subject: Re: Multilanguage support in Nutch 1.0 To: nutch-user@lucene.apache.org On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote: hi try to activate the language-identifier plugin you must add it in the nutch-site.xml file in the nameplugin.includes/name section. Ooops. It IS activated. 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Language Identification Parser/Filter (language-identifier) But fetched pages are not passed to it, as I recon. _ Windows Live helps you keep up with all your friends, in one place. http://go.microsoft.com/?linkid=9660826
RE: Multilanguage support in Nutch 1.0
hi try to activate the language-identifier plugin you must add it in the nutch-site.xml file in the nameplugin.includes/name section. it's some thing like that property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property From: da...@jashi.ge Date: Tue, 29 Sep 2009 18:59:52 +0400 Subject: Multilanguage support in Nutch 1.0 To: nutch-user@lucene.apache.org Hello, all. I've got a bit of a trouble with Nutch 1.0 and multilanguage support: I have fresh install of Nutch and two analysis plugins I'd like to turn on: analysis-de (German) and analysis-ge (Georgian) Here are the innards of my seed file: --- http://212.72.133.54/l/test.html http://212.72.133.54/l/de.html --- The first is Georgian, other - German. When I run bin/nutch crawl seed -dir crawl -threads 10 -depth 2 there is not a slightest sign of someone calling any analysis plug-ins, even though it's clearly stated in hadoop.log, that they are on and active: --- 2009-09-29 16:39:13,328 INFO crawl.Crawl - crawl started in: crawl 2009-09-29 16:39:13,328 INFO crawl.Crawl - rootUrlDir = seed 2009-09-29 16:39:13,328 INFO crawl.Crawl - threads = 10 2009-09-29 16:39:13,328 INFO crawl.Crawl - depth = 2 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: starting 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: urlDir: seed 2009-09-29 16:39:13,390 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2009-09-29 16:39:13,421 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-09-29 16:39:15,546 INFO plugin.PluginRepository - Plugins: looking in: C:\cygwin\opt\nutch\plugins 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Registered Plugins: 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Lucene Analysers (lib-lucene-analyzers) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Language Identification Parser/Filter (language-identifier) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) ! 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Georgian Analysis Plug-in (analysis-ge) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - German Analysis Plug-in (analysis-de) ! 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Site Query Filter (query-site) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - More Query Filter (query-more) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2009-09-29