Hello Markus,


> > > I cannot confirm this when parsing a local 404 page. What do you
> > > get when fetching that page with:
> > > bin/nutch org.apache.nutch.parse.ParserChecker
> > I get an error:
> > 
> > $ time bin/nutch org.apache.nutch.parse.ParserChecker
> > http://wiki.example.org/INTERN_WIKI:Impressum Exception in thread
> > "main" java.lang.NullPointerException at
> > org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)

> Strange! Can you confirm the parse checker with other 404 pages on
> the internet?
> 
> bin/nutch org.apache.nutch.parse.ParserChecker
> http://nutch.apache.org/404

This does work for me:
------------------
$ bin/nutch org.apache.nutch.parse.ParserChecker
http://nutch.apache.org/404 ---------
Url
---------------
http://nutch.apache.org/404---------
ParseData
---------
Version: 5
Status: success(1,0)
Title: 404 Not Found
Outlinks: 0
Content Metadata: Date=Mon, 01 Aug 2011 11:29:46 GMT Content-Length=309
Content-Type=text/html; charset=iso-8859-1 Connection=close
Server=Apache/2.3.8 (Unix) mod_ssl/2.3.8 OpenSSL/1.0.0c Parse Metadata:
CharEncodingForConversion=windows-1252
OriginalCharEncoding=windows-1252 
------------------


> Perhaps your wiki returns some funny data that protocol plugin
> doesn't understand. What do you use? Protocol-http or
> protocol-httpclient?

I do use the standard settings except 3 custom ones in
conf/nutch-site.xml:
> http.agent.name, fetcher.server.delay and fetcher.threads.per.host

When I understood it right, conf/nutch-default.xml contains
>  <name>plugin.includes</name>
>  <value>protocol-http|urlfilter-regex|parse-(html|tika)
> |index-(basic|anchor)|scoring-opic
> |urlnormalizer-(pass|regex|basic)</value>
so it's "protocol-http".


-- 
Viele Grüße
Christian Weiske

Reply via email to