Hi Steve, does the job file contain the original parse-html from Nutch 1.5.1? I cannot sync the stack with http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?view=markup (nor with the current trunk / 1.9), e.g. parseNeko() should be lines 228-266:
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:347) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:244) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:160) Sebastian On 08/13/2014 05:43 PM, Steve Cohen wrote: > I forgot about the parsechecker and indexchecker command line options. > > When I run it parsechecker with the default nutch with the standard job > file it works. > > 14/08/13 11:35:28 INFO http.Http: http.proxy.host = null > 14/08/13 11:35:28 INFO http.Http: http.proxy.port = 8080 > 14/08/13 11:35:28 INFO http.Http: http.timeout = 10000 > 14/08/13 11:35:28 INFO http.Http: http.content.limit = 65536 > 14/08/13 11:35:28 INFO http.Http: http.agent = tralala/Nutch-1.5.1 (Lucene > Random House Crawler; http://www.randomhouse.com/; [email protected] > ) > 14/08/13 11:35:28 INFO http.Http: http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > 14/08/13 11:35:28 INFO http.Http: http.accept = > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > 14/08/13 11:35:28 INFO conf.Configuration: found resource parse-plugins.xml > at file:/tmp/hadoop-nutch/hadoop-unjar7029442108299209520/parse-plugins.xml > 14/08/13 11:35:29 INFO crawl.SignatureFactory: Using Signature impl: > org.apache.nutch.crawl.MD5Signature > 14/08/13 11:35:29 INFO parse.ParserChecker: parsing: > http://www.my-ebenefits.com/PenguinRandomHouse/ > 14/08/13 11:35:29 INFO parse.ParserChecker: contentType: > application/xhtml+xml > 14/08/13 11:35:29 INFO parse.ParserChecker: signature: > 6ac298a128080fcb51e4c3efa1c040df > --------- > Url > --------------- > http://www.my-ebenefits.com/PenguinRandomHouse/ > --------- > ParseData > --------- > Version: 5 > Status: success(1,0) > Title: Penguin Random House > > > When I run it with the job file the dev built it gives me this. > > > 14/08/13 11:35:50 INFO httpclient.Http: http.accept = > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > 14/08/13 11:35:50 INFO conf.Configuration: found resource > httpclient-auth.xml at > file:/tmp/hadoop-nutch/hadoop-unjar8361088391392178235/httpclient-auth.xml > 14/08/13 11:35:50 INFO conf.Configuration: found resource parse-plugins.xml > at file:/tmp/hadoop-nutch/hadoop-unjar8361088391392178235/parse-plugins.xml > HtmlParser setConf - read rules now > in parseNeko now > 14/08/13 11:35:51 ERROR parse.html: Error: > java.lang.NullPointerException > at org.apache.xerces.parsers.AbstractDOMParser.characters(Unknown > Source) > at > org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:195) > > > So it is something with the configuration. Does the default job file use > Neko or TagSoup? I assume Neko since that is what it is nutch-default.xml. > How do I tell what rules have been changed? > > Thanks, > Steve > > > On Wed, Aug 13, 2014 at 4:16 AM, Julien Nioche < > [email protected]> wrote: > >> Hi Steve, >> >> I tried with Nutch 1.9 RC1 and am not getting this exception. >> => ./nutch parsechecker -D http.agent.name=tralala >> http://www.my-ebenefits.com/PenguinRandomHouse/ >> >> Probably something that we fixed since 1.5.1 which is rather outdated. Why >> don't you give 1.9 a try instead? >> >> Julien >> >> >> >> On 12 August 2014 20:34, Steve Cohen <[email protected]> wrote: >> >>> Hello, >>> >>> I have been running nutch 1.5.1 without a problem but I have run across a >>> couple web pages that are giving me a null pointer exception when I try >> to >>> crawl them. >>> >>> 2014-08-12 14:01:21,844 ERROR org.apache.nutch.parse.html: Error: >>> java.lang.NullPointerException >>> at org.apache.xerces.parsers.AbstractDOMParser.characters(Unknown >>> Source) >>> at >>> >> org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:195) >>> at >>> >>> >> org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(HTMLScanner.java:2033) >>> at >>> org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1836) >>> at >>> org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:809) >>> at >>> org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478) >>> at >>> org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431) >>> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) >>> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) >>> at >>> org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:347) >>> at >>> org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:244) >>> at >>> org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:160) >>> at >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) >>> at >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) >>> at >>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>> at java.lang.Thread.run(Thread.java:662) >>> 2014-08-12 14:01:21,844 WARN org.apache.nutch.parse.ParseSegment: Error >>> parsing: http://www.my-ebenefits.com/PenguinRandomHouse/: failed(2,200): >>> java.lang.NullPointerException >>> >>> >>> What information do I need to provide for you to help me debug the issue? >>> >>> Thanks, >>> Steve >>> >> >> >> >> -- >> >> Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >> http://twitter.com/digitalpebble >> >

