I forgot about the parsechecker and indexchecker command line options.

When I run it parsechecker with the default nutch with the standard job
file it works.

14/08/13 11:35:28 INFO http.Http: http.proxy.host = null
14/08/13 11:35:28 INFO http.Http: http.proxy.port = 8080
14/08/13 11:35:28 INFO http.Http: http.timeout = 10000
14/08/13 11:35:28 INFO http.Http: http.content.limit = 65536
14/08/13 11:35:28 INFO http.Http: http.agent = tralala/Nutch-1.5.1 (Lucene
Random House Crawler; http://www.randomhouse.com/; [email protected]
)
14/08/13 11:35:28 INFO http.Http: http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
14/08/13 11:35:28 INFO http.Http: http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
14/08/13 11:35:28 INFO conf.Configuration: found resource parse-plugins.xml
at file:/tmp/hadoop-nutch/hadoop-unjar7029442108299209520/parse-plugins.xml
14/08/13 11:35:29 INFO crawl.SignatureFactory: Using Signature impl:
org.apache.nutch.crawl.MD5Signature
14/08/13 11:35:29 INFO parse.ParserChecker: parsing:
http://www.my-ebenefits.com/PenguinRandomHouse/
14/08/13 11:35:29 INFO parse.ParserChecker: contentType:
application/xhtml+xml
14/08/13 11:35:29 INFO parse.ParserChecker: signature:
6ac298a128080fcb51e4c3efa1c040df
---------
Url
---------------
http://www.my-ebenefits.com/PenguinRandomHouse/
---------
ParseData
---------
Version: 5
Status: success(1,0)
Title: Penguin Random House


When I run it with the job file the dev built it gives me this.


14/08/13 11:35:50 INFO httpclient.Http: http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
14/08/13 11:35:50 INFO conf.Configuration: found resource
httpclient-auth.xml at
file:/tmp/hadoop-nutch/hadoop-unjar8361088391392178235/httpclient-auth.xml
14/08/13 11:35:50 INFO conf.Configuration: found resource parse-plugins.xml
at file:/tmp/hadoop-nutch/hadoop-unjar8361088391392178235/parse-plugins.xml
HtmlParser setConf - read rules now
in parseNeko now
14/08/13 11:35:51 ERROR parse.html: Error:
java.lang.NullPointerException
    at org.apache.xerces.parsers.AbstractDOMParser.characters(Unknown
Source)
    at
org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:195)


So it is something with the configuration. Does the default job file use
Neko or TagSoup? I assume Neko since that is what it is nutch-default.xml.
How do I tell what rules have been changed?

Thanks,
Steve


On Wed, Aug 13, 2014 at 4:16 AM, Julien Nioche <
[email protected]> wrote:

> Hi Steve,
>
> I tried with Nutch 1.9 RC1 and am not getting this exception.
> =>  ./nutch parsechecker -D http.agent.name=tralala
> http://www.my-ebenefits.com/PenguinRandomHouse/
>
> Probably something that we fixed since 1.5.1 which is rather outdated. Why
> don't you give 1.9 a try instead?
>
> Julien
>
>
>
> On 12 August 2014 20:34, Steve Cohen <[email protected]> wrote:
>
> > Hello,
> >
> > I have been running nutch 1.5.1 without a problem but I have run across a
> > couple web pages that are giving me a null pointer exception when I try
> to
> > crawl them.
> >
> > 2014-08-12 14:01:21,844 ERROR org.apache.nutch.parse.html: Error:
> > java.lang.NullPointerException
> >         at org.apache.xerces.parsers.AbstractDOMParser.characters(Unknown
> > Source)
> >         at
> >
> org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:195)
> >         at
> >
> >
> org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(HTMLScanner.java:2033)
> >         at
> > org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1836)
> >         at
> > org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:809)
> >         at
> > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
> >         at
> > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
> >         at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
> >         at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> >         at
> > org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:347)
> >         at
> > org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:244)
> >         at
> > org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:160)
> >         at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> >         at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> >         at
> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >         at java.lang.Thread.run(Thread.java:662)
> > 2014-08-12 14:01:21,844 WARN org.apache.nutch.parse.ParseSegment: Error
> > parsing: http://www.my-ebenefits.com/PenguinRandomHouse/: failed(2,200):
> > java.lang.NullPointerException
> >
> >
> > What information do I need to provide for you to help me debug the issue?
> >
> > Thanks,
> > Steve
> >
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Reply via email to