Index and parse filter checkers do not use URL filtering.
> Hi Radim, > > Please see the final log output > > 11/10/14 06:01:02 WARN parse.ParserFactory: ParserFactory: Plugin: > org.apache.nutch.parse.html.Ht > mlParser mapped to contentType application/xhtml+xml via parse-plugins.xml, > but not enabled via plugin.includes in nutch-default.xml > > Please try adding parse-html and re-running the indexerchecker > > On Fri, Oct 14, 2011 at 5:18 AM, Radim Kolar <[email protected]> wrote: > > Hi, > > > > This is most likely an URL filter issue. Check all URL filters. There's > > also a > > test program for URL filtering. Try it out. > > > > This is indexchecker output for one URL. Is this URL filtered or not? I > > don't know how to interpret output > > > > ponto:(crawler)runtime/deploy>**bin/nutch indexchecker http://www.root.cz > > > > 11/10/14 06:01:00 INFO indexer.**IndexingFiltersChecker: fetching: > > http://www.root.cz > > 11/10/14 06:01:00 INFO plugin.PluginRepository: Plugins: looking in: > > /tmp/hadoop-crawler/hadoop-**unjar3406850446948112163/**plugins > > 11/10/14 06:01:00 INFO plugin.PluginRepository: Plugin Auto-activation > > mode: [true] > > 11/10/14 06:01:00 INFO plugin.PluginRepository: Registered Plugins: > > 11/10/14 06:01:00 INFO plugin.PluginRepository: the nutch core > > extension points (nutch-extensionpoints) > > 11/10/14 06:01:00 INFO plugin.PluginRepository: Regex URL > > Normalizer (urlnormalizer-regex) > > 11/10/14 06:01:00 INFO plugin.PluginRepository: Basic URL > > Normalizer (urlnormalizer-basic) > > 11/10/14 06:01:00 INFO plugin.PluginRepository: Tika Parser > > Plug-in (parse-tika) > > 11/10/14 06:01:00 INFO plugin.PluginRepository: Domain URL Filter > > (urlfilter-domain) > > 11/10/14 06:01:00 INFO plugin.PluginRepository: HTTP Framework > > (lib-http) > > 11/10/14 06:01:00 INFO plugin.PluginRepository: Regex URL Filter > > (urlfilter-regex) > > 11/10/14 06:01:00 INFO plugin.PluginRepository: Regex URL Filter > > Framework (lib-regex-filter) > > 11/10/14 06:01:00 INFO plugin.PluginRepository: Http Protocol > > Plug-in (protocol-http) > > 11/10/14 06:01:00 INFO plugin.PluginRepository: Registered > > Extension-Points: > > 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch URL > > Normalizer (org.apache.nutch.net.**URLNormalizer) > > 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Protocol > > (org.apache.nutch.protocol.**Protocol) > > 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Segment > > Merge Filter (org.apache.nutch.segment.**SegmentMergeFilter) > > 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch URL Filter > > ( org.apache.nutch.net.**URLFilter) > > 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Indexing > > Filter (org.apache.nutch.indexer.**IndexingFilter) > > 11/10/14 06:01:00 INFO plugin.PluginRepository: HTML Parse Filter > > (org.apache.nutch.parse.**HtmlParseFilter) > > 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Content > > Parser (org.apache.nutch.parse.**Parser) > > 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Scoring > > (org.apache.nutch.scoring.**ScoringFilter) > > 11/10/14 06:01:00 INFO http.Http: http.accept.language = > > en-us,en-gb,en;q=0.7,*;q=0.3 > > 11/10/14 06:01:02 INFO indexer.**IndexingFiltersChecker: parsing: > > http://www.root.cz > > 11/10/14 06:01:02 INFO indexer.**IndexingFiltersChecker: contentType: > > application/xhtml+xml > > 11/10/14 06:01:02 INFO conf.Configuration: found resource > > parse-plugins.xml at > > file:/tmp/hadoop-crawler/**hadoop-**unjar3406850446948112163/** > > parse-plugins.xml > > 11/10/14 06:01:02 WARN parse.ParserFactory: ParserFactory: Plugin: > > org.apache.nutch.parse.html.**HtmlParser mapped to contentType > > application/xhtml+xml via parse-plugins.xml, but not enabled via > > plugin.includes in nutch-default.xml

