Re: nutch 1.6, bin/crawl fails on recrawl with java.io.IOException

Urs Hofer Thu, 09 May 2013 23:09:09 -0700

Dear Feng Lu

I'm not sure, but the problem is more the last exception:


>>> java.io.IOException: Job failed!
>>>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
>>>        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
>>>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368)
>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341)

The others might me annoying, but this one does stop the execution of the 
script…

Best
urs





Am 09.05.2013 um 17:14 schrieb feng lu <[email protected]>:

> Hi Urs
> 
> Did u use Nutch 1.6?
> 
> <
> 2013-05-09 03:57:56,967 WARN  fetcher.Fetcher - Attempting to finish item
> from unknown queue: org.apache.nutch.fetcher.
> Fetcher$FetchItem@307322f4
>> 
> 
> This cause by call the FetchItemQueues#finishFetchItem method, but current
> queues can not find the queueID of the FetchItem, one reason is that the
> queue is deleted by reap empty queues in FetchItemQueues#getFetchItem
> method when that queue is empty, because the FetchItem is unblocked when it
> crawl finished, but when after that throw an Exception , that FetchItem
> will be unblock again. but this time, that queue was emptyed. so it will
> throw this WARN.
> 
> <
> 2013-05-09 03:57:56,969 ERROR fetcher.Fetcher - fetcher caught:java.lang.
> NullPointerException
>> 
> 
> There are two places can log this exception, one is in Fetcher#output
> method, another is in Fetcher#run method. see your log order , maybe it is
> log by Fetcher#run. I can not found any reason that can cause this NPE. Can
> you reproduce this Exception and provide more detailed log.
> 
> Thanks
> 
> 
> 
> On Thu, May 9, 2013 at 6:45 PM, Tejas Patil <[email protected]>wrote:
> 
>> Hey Urs,
>> Please see the logs/hadoop.log file and share the stack trace of the
>> exception.
>> 
>> 
>> On Thu, May 9, 2013 at 3:36 AM, Urs Hofer <[email protected]> wrote:
>> 
>>> Dear List
>>> 
>>> i'm currently running the nutch crawl script with solr4.
>>> additionally, I'm using the urlmeta plugin and I'm parsing keyword and
>>> description
>>> metadata as well. the crawl script runs in local mode. currently, I'm
>>> seeding about 500 domains.
>>> 
>>> the crawl script runs without problems the first time. on a recrawl, it
>>> dies with the error
>>> 
>>> 2013-05-09 03:57:56,967 WARN  fetcher.Fetcher - Attempting to finish item
>>> from unknown queue: org.apache.nutch.fetcher.Fetcher$FetchItem@307322f4
>>> 2013-05-09 03:57:56,967 INFO  fetcher.Fetcher - -finishing thread
>>> FetcherThread, activeThreads=27
>>> 2013-05-09 03:57:56,969 INFO  fetcher.Fetcher - -finishing thread
>>> FetcherThread, activeThreads=1
>>> 2013-05-09 03:57:56,969 ERROR fetcher.Fetcher - fetcher
>>> caught:java.lang.NullPointerException
>>> 2013-05-09 03:57:56,969 INFO  fetcher.Fetcher - -finishing thread
>>> FetcherThread, activeThreads=2
>>> 2013-05-09 03:57:56,969 WARN  fetcher.Fetcher - Attempting to finish item
>>> from unknown queue: org.apache.nutch.fetcher.Fetcher$FetchItem@504e3c43
>>> 2013-05-09 03:57:56,969 INFO  fetcher.Fetcher - -finishing thread
>>> FetcherThread, activeThreads=0
>>> 2013-05-09 03:57:57,942 ERROR fetcher.Fetcher - Fetcher:
>>> java.io.IOException: Job failed!
>>>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
>>>        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
>>>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368)
>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341)
>>> 
>>> 
>>> after the fetcher part.
>>> 
>>> I have a very liberal regex-urlfilter configuration:
>>> 
>>> -^(file|ftp|mailto):
>>> 
>>> 
>> -\.(mp3|MP3|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>>> +.
>>> 
>>> But I am restricting the crawl to db.ignore.external.links = true
>>> Can it be because I've removed the line
>>> 
>>> # skip URLs containing certain characters as probable queries, etc.
>>> -.*[?*!@=].*
>>> 
>>> in regex-urlfilter? I got several seed entries like index.php?language=fr
>>> 
>>> Since I'm new to nutch, I don't know where to search and how to continue.
>>> Thanks for any help
>>> Best regards
>>> Urs Hofer
>> 
> 
> 
> 
> -- 
> Don't Grow Old, Grow Up... :-)

Re: nutch 1.6, bin/crawl fails on recrawl with java.io.IOException

Reply via email to