Hi Radim, Please see the final log output
11/10/14 06:01:02 WARN parse.ParserFactory: ParserFactory: Plugin: org.apache.nutch.parse.html.Ht mlParser mapped to contentType application/xhtml+xml via parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml Please try adding parse-html and re-running the indexerchecker On Fri, Oct 14, 2011 at 5:18 AM, Radim Kolar <[email protected]> wrote: > > Hi, > > This is most likely an URL filter issue. Check all URL filters. There's > also a > test program for URL filtering. Try it out. > > This is indexchecker output for one URL. Is this URL filtered or not? I > don't know how to interpret output > > ponto:(crawler)runtime/deploy>**bin/nutch indexchecker http://www.root.cz > > 11/10/14 06:01:00 INFO indexer.**IndexingFiltersChecker: fetching: > http://www.root.cz > 11/10/14 06:01:00 INFO plugin.PluginRepository: Plugins: looking in: > /tmp/hadoop-crawler/hadoop-**unjar3406850446948112163/**plugins > 11/10/14 06:01:00 INFO plugin.PluginRepository: Plugin Auto-activation > mode: [true] > 11/10/14 06:01:00 INFO plugin.PluginRepository: Registered Plugins: > 11/10/14 06:01:00 INFO plugin.PluginRepository: the nutch core > extension points (nutch-extensionpoints) > 11/10/14 06:01:00 INFO plugin.PluginRepository: Regex URL > Normalizer (urlnormalizer-regex) > 11/10/14 06:01:00 INFO plugin.PluginRepository: Basic URL > Normalizer (urlnormalizer-basic) > 11/10/14 06:01:00 INFO plugin.PluginRepository: Tika Parser Plug-in > (parse-tika) > 11/10/14 06:01:00 INFO plugin.PluginRepository: Domain URL Filter > (urlfilter-domain) > 11/10/14 06:01:00 INFO plugin.PluginRepository: HTTP Framework > (lib-http) > 11/10/14 06:01:00 INFO plugin.PluginRepository: Regex URL Filter > (urlfilter-regex) > 11/10/14 06:01:00 INFO plugin.PluginRepository: Regex URL Filter > Framework (lib-regex-filter) > 11/10/14 06:01:00 INFO plugin.PluginRepository: Http Protocol > Plug-in (protocol-http) > 11/10/14 06:01:00 INFO plugin.PluginRepository: Registered > Extension-Points: > 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch URL > Normalizer (org.apache.nutch.net.**URLNormalizer) > 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Protocol > (org.apache.nutch.protocol.**Protocol) > 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Segment Merge > Filter (org.apache.nutch.segment.**SegmentMergeFilter) > 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch URL Filter ( > org.apache.nutch.net.**URLFilter) > 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Indexing > Filter (org.apache.nutch.indexer.**IndexingFilter) > 11/10/14 06:01:00 INFO plugin.PluginRepository: HTML Parse Filter > (org.apache.nutch.parse.**HtmlParseFilter) > 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Content > Parser (org.apache.nutch.parse.**Parser) > 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Scoring > (org.apache.nutch.scoring.**ScoringFilter) > 11/10/14 06:01:00 INFO http.Http: http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > 11/10/14 06:01:02 INFO indexer.**IndexingFiltersChecker: parsing: > http://www.root.cz > 11/10/14 06:01:02 INFO indexer.**IndexingFiltersChecker: contentType: > application/xhtml+xml > 11/10/14 06:01:02 INFO conf.Configuration: found resource parse-plugins.xml > at file:/tmp/hadoop-crawler/**hadoop-**unjar3406850446948112163/** > parse-plugins.xml > 11/10/14 06:01:02 WARN parse.ParserFactory: ParserFactory: Plugin: > org.apache.nutch.parse.html.**HtmlParser mapped to contentType > application/xhtml+xml via parse-plugins.xml, but not enabled via > plugin.includes in nutch-default.xml > > -- *Lewis*

