Hi Radim,

Please see the final log output

11/10/14 06:01:02 WARN parse.ParserFactory: ParserFactory: Plugin:
org.apache.nutch.parse.html.Ht
mlParser mapped to contentType application/xhtml+xml via parse-plugins.xml,
but not enabled via plugin.includes in nutch-default.xml

Please try adding parse-html and re-running the indexerchecker


On Fri, Oct 14, 2011 at 5:18 AM, Radim Kolar <[email protected]> wrote:

>
> Hi,
>
> This is most likely an URL filter issue. Check all URL filters. There's
> also a
> test program for URL filtering. Try it out.
>
> This is indexchecker output for one URL. Is this URL filtered or not? I
> don't know how to interpret output
>
> ponto:(crawler)runtime/deploy>**bin/nutch indexchecker http://www.root.cz
>
> 11/10/14 06:01:00 INFO indexer.**IndexingFiltersChecker: fetching:
> http://www.root.cz
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Plugins: looking in:
> /tmp/hadoop-crawler/hadoop-**unjar3406850446948112163/**plugins
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Plugin Auto-activation
> mode: [true]
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Registered Plugins:
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         the nutch core
> extension points (nutch-extensionpoints)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Regex URL
> Normalizer (urlnormalizer-regex)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Basic URL
> Normalizer (urlnormalizer-basic)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Tika Parser Plug-in
> (parse-tika)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Domain URL Filter
> (urlfilter-domain)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         HTTP Framework
> (lib-http)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Regex URL Filter
> (urlfilter-regex)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Regex URL Filter
> Framework (lib-regex-filter)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Http Protocol
> Plug-in (protocol-http)
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Registered
> Extension-Points:
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch URL
> Normalizer (org.apache.nutch.net.**URLNormalizer)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Protocol
> (org.apache.nutch.protocol.**Protocol)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Segment Merge
> Filter (org.apache.nutch.segment.**SegmentMergeFilter)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch URL Filter (
> org.apache.nutch.net.**URLFilter)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Indexing
> Filter (org.apache.nutch.indexer.**IndexingFilter)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         HTML Parse Filter
> (org.apache.nutch.parse.**HtmlParseFilter)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Content
> Parser (org.apache.nutch.parse.**Parser)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Scoring
> (org.apache.nutch.scoring.**ScoringFilter)
> 11/10/14 06:01:00 INFO http.Http: http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 11/10/14 06:01:02 INFO indexer.**IndexingFiltersChecker: parsing:
> http://www.root.cz
> 11/10/14 06:01:02 INFO indexer.**IndexingFiltersChecker: contentType:
> application/xhtml+xml
> 11/10/14 06:01:02 INFO conf.Configuration: found resource parse-plugins.xml
> at file:/tmp/hadoop-crawler/**hadoop-**unjar3406850446948112163/**
> parse-plugins.xml
> 11/10/14 06:01:02 WARN parse.ParserFactory: ParserFactory: Plugin:
> org.apache.nutch.parse.html.**HtmlParser mapped to contentType
> application/xhtml+xml via parse-plugins.xml, but not enabled via
> plugin.includes in nutch-default.xml
>
>


-- 
*Lewis*

Reply via email to