Hi Steve,

does the job file contain the original parse-html from Nutch 1.5.1?
I cannot sync the stack with
 
http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?view=markup
(nor with the current trunk / 1.9), e.g. parseNeko() should be lines 228-266:

at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:347)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:244)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:160)

Sebastian

On 08/13/2014 05:43 PM, Steve Cohen wrote:
> I forgot about the parsechecker and indexchecker command line options.
> 
> When I run it parsechecker with the default nutch with the standard job
> file it works.
> 
> 14/08/13 11:35:28 INFO http.Http: http.proxy.host = null
> 14/08/13 11:35:28 INFO http.Http: http.proxy.port = 8080
> 14/08/13 11:35:28 INFO http.Http: http.timeout = 10000
> 14/08/13 11:35:28 INFO http.Http: http.content.limit = 65536
> 14/08/13 11:35:28 INFO http.Http: http.agent = tralala/Nutch-1.5.1 (Lucene
> Random House Crawler; http://www.randomhouse.com/; [email protected]
> )
> 14/08/13 11:35:28 INFO http.Http: http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 14/08/13 11:35:28 INFO http.Http: http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 14/08/13 11:35:28 INFO conf.Configuration: found resource parse-plugins.xml
> at file:/tmp/hadoop-nutch/hadoop-unjar7029442108299209520/parse-plugins.xml
> 14/08/13 11:35:29 INFO crawl.SignatureFactory: Using Signature impl:
> org.apache.nutch.crawl.MD5Signature
> 14/08/13 11:35:29 INFO parse.ParserChecker: parsing:
> http://www.my-ebenefits.com/PenguinRandomHouse/
> 14/08/13 11:35:29 INFO parse.ParserChecker: contentType:
> application/xhtml+xml
> 14/08/13 11:35:29 INFO parse.ParserChecker: signature:
> 6ac298a128080fcb51e4c3efa1c040df
> ---------
> Url
> ---------------
> http://www.my-ebenefits.com/PenguinRandomHouse/
> ---------
> ParseData
> ---------
> Version: 5
> Status: success(1,0)
> Title: Penguin Random House
> 
> 
> When I run it with the job file the dev built it gives me this.
> 
> 
> 14/08/13 11:35:50 INFO httpclient.Http: http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 14/08/13 11:35:50 INFO conf.Configuration: found resource
> httpclient-auth.xml at
> file:/tmp/hadoop-nutch/hadoop-unjar8361088391392178235/httpclient-auth.xml
> 14/08/13 11:35:50 INFO conf.Configuration: found resource parse-plugins.xml
> at file:/tmp/hadoop-nutch/hadoop-unjar8361088391392178235/parse-plugins.xml
> HtmlParser setConf - read rules now
> in parseNeko now
> 14/08/13 11:35:51 ERROR parse.html: Error:
> java.lang.NullPointerException
>     at org.apache.xerces.parsers.AbstractDOMParser.characters(Unknown
> Source)
>     at
> org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:195)
> 
> 
> So it is something with the configuration. Does the default job file use
> Neko or TagSoup? I assume Neko since that is what it is nutch-default.xml.
> How do I tell what rules have been changed?
> 
> Thanks,
> Steve
> 
> 
> On Wed, Aug 13, 2014 at 4:16 AM, Julien Nioche <
> [email protected]> wrote:
> 
>> Hi Steve,
>>
>> I tried with Nutch 1.9 RC1 and am not getting this exception.
>> =>  ./nutch parsechecker -D http.agent.name=tralala
>> http://www.my-ebenefits.com/PenguinRandomHouse/
>>
>> Probably something that we fixed since 1.5.1 which is rather outdated. Why
>> don't you give 1.9 a try instead?
>>
>> Julien
>>
>>
>>
>> On 12 August 2014 20:34, Steve Cohen <[email protected]> wrote:
>>
>>> Hello,
>>>
>>> I have been running nutch 1.5.1 without a problem but I have run across a
>>> couple web pages that are giving me a null pointer exception when I try
>> to
>>> crawl them.
>>>
>>> 2014-08-12 14:01:21,844 ERROR org.apache.nutch.parse.html: Error:
>>> java.lang.NullPointerException
>>>         at org.apache.xerces.parsers.AbstractDOMParser.characters(Unknown
>>> Source)
>>>         at
>>>
>> org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:195)
>>>         at
>>>
>>>
>> org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(HTMLScanner.java:2033)
>>>         at
>>> org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1836)
>>>         at
>>> org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:809)
>>>         at
>>> org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
>>>         at
>>> org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
>>>         at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>>>         at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>>>         at
>>> org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:347)
>>>         at
>>> org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:244)
>>>         at
>>> org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:160)
>>>         at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
>>>         at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
>>>         at
>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>         at java.lang.Thread.run(Thread.java:662)
>>> 2014-08-12 14:01:21,844 WARN org.apache.nutch.parse.ParseSegment: Error
>>> parsing: http://www.my-ebenefits.com/PenguinRandomHouse/: failed(2,200):
>>> java.lang.NullPointerException
>>>
>>>
>>> What information do I need to provide for you to help me debug the issue?
>>>
>>> Thanks,
>>> Steve
>>>
>>
>>
>>
>> --
>>
>> Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
> 

Reply via email to