Hi,

This is most likely an URL filter issue. Check all URL filters. There's also a
test program for URL filtering. Try it out.

This is indexchecker output for one URL. Is this URL filtered or not? I don't 
know how to interpret output

ponto:(crawler)runtime/deploy>bin/nutch indexchecker http://www.root.cz

11/10/14 06:01:00 INFO indexer.IndexingFiltersChecker: fetching: http://www.root.cz 11/10/14 06:01:00 INFO plugin.PluginRepository: Plugins: looking in: /tmp/hadoop-crawler/hadoop-unjar3406850446948112163/plugins 11/10/14 06:01:00 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
11/10/14 06:01:00 INFO plugin.PluginRepository: Registered Plugins:
11/10/14 06:01:00 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints) 11/10/14 06:01:00 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex) 11/10/14 06:01:00 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic) 11/10/14 06:01:00 INFO plugin.PluginRepository: Tika Parser Plug-in (parse-tika) 11/10/14 06:01:00 INFO plugin.PluginRepository: Domain URL Filter (urlfilter-domain) 11/10/14 06:01:00 INFO plugin.PluginRepository: HTTP Framework (lib-http) 11/10/14 06:01:00 INFO plugin.PluginRepository: Regex URL Filter (urlfilter-regex) 11/10/14 06:01:00 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter) 11/10/14 06:01:00 INFO plugin.PluginRepository: Http Protocol Plug-in (protocol-http)
11/10/14 06:01:00 INFO plugin.PluginRepository: Registered Extension-Points:
11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol) 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter) 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 11/10/14 06:01:00 INFO plugin.PluginRepository: HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser) 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 11/10/14 06:01:00 INFO http.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 11/10/14 06:01:02 INFO indexer.IndexingFiltersChecker: parsing: http://www.root.cz 11/10/14 06:01:02 INFO indexer.IndexingFiltersChecker: contentType: application/xhtml+xml 11/10/14 06:01:02 INFO conf.Configuration: found resource parse-plugins.xml at file:/tmp/hadoop-crawler/hadoop-unjar3406850446948112163/parse-plugins.xml 11/10/14 06:01:02 WARN parse.ParserFactory: ParserFactory: Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType application/xhtml+xml via parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml

Reply via email to