Index and parse filter checkers do not use URL filtering.

> Hi Radim,
> 
> Please see the final log output
> 
> 11/10/14 06:01:02 WARN parse.ParserFactory: ParserFactory: Plugin:
> org.apache.nutch.parse.html.Ht
> mlParser mapped to contentType application/xhtml+xml via parse-plugins.xml,
> but not enabled via plugin.includes in nutch-default.xml
> 
> Please try adding parse-html and re-running the indexerchecker
> 
> On Fri, Oct 14, 2011 at 5:18 AM, Radim Kolar <[email protected]> wrote:
> > Hi,
> > 
> > This is most likely an URL filter issue. Check all URL filters. There's
> > also a
> > test program for URL filtering. Try it out.
> > 
> > This is indexchecker output for one URL. Is this URL filtered or not? I
> > don't know how to interpret output
> > 
> > ponto:(crawler)runtime/deploy>**bin/nutch indexchecker http://www.root.cz
> > 
> > 11/10/14 06:01:00 INFO indexer.**IndexingFiltersChecker: fetching:
> > http://www.root.cz
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Plugins: looking in:
> > /tmp/hadoop-crawler/hadoop-**unjar3406850446948112163/**plugins
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Plugin Auto-activation
> > mode: [true]
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Registered Plugins:
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         the nutch core
> > extension points (nutch-extensionpoints)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Regex URL
> > Normalizer (urlnormalizer-regex)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Basic URL
> > Normalizer (urlnormalizer-basic)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Tika Parser
> > Plug-in (parse-tika)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Domain URL Filter
> > (urlfilter-domain)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         HTTP Framework
> > (lib-http)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Regex URL Filter
> > (urlfilter-regex)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Regex URL Filter
> > Framework (lib-regex-filter)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Http Protocol
> > Plug-in (protocol-http)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Registered
> > Extension-Points:
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch URL
> > Normalizer (org.apache.nutch.net.**URLNormalizer)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Protocol
> > (org.apache.nutch.protocol.**Protocol)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Segment
> > Merge Filter (org.apache.nutch.segment.**SegmentMergeFilter)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch URL Filter
> > ( org.apache.nutch.net.**URLFilter)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Indexing
> > Filter (org.apache.nutch.indexer.**IndexingFilter)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         HTML Parse Filter
> > (org.apache.nutch.parse.**HtmlParseFilter)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Content
> > Parser (org.apache.nutch.parse.**Parser)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Scoring
> > (org.apache.nutch.scoring.**ScoringFilter)
> > 11/10/14 06:01:00 INFO http.Http: http.accept.language =
> > en-us,en-gb,en;q=0.7,*;q=0.3
> > 11/10/14 06:01:02 INFO indexer.**IndexingFiltersChecker: parsing:
> > http://www.root.cz
> > 11/10/14 06:01:02 INFO indexer.**IndexingFiltersChecker: contentType:
> > application/xhtml+xml
> > 11/10/14 06:01:02 INFO conf.Configuration: found resource
> > parse-plugins.xml at
> > file:/tmp/hadoop-crawler/**hadoop-**unjar3406850446948112163/**
> > parse-plugins.xml
> > 11/10/14 06:01:02 WARN parse.ParserFactory: ParserFactory: Plugin:
> > org.apache.nutch.parse.html.**HtmlParser mapped to contentType
> > application/xhtml+xml via parse-plugins.xml, but not enabled via
> > plugin.includes in nutch-default.xml

Reply via email to