Re: nutch 1.6, bin/crawl fails on recrawl with java.io.IOException

AC Nutch Thu, 09 May 2013 23:31:36 -0700

If I'm not mistaken 140 threads is way way way on the high side. Unless you
have some massive servers, I can't see them handling that. I can barely get
my servers to handle more than 15ish. Perhaps try decreasing that and see
if that fixes the issue.


Alex


On Thu, May 9, 2013 at 7:17 PM, Urs Hofer <[email protected]> wrote:

> Dear Feng Lu
>
> I'm not sure, but the problem is more the last exception:
>
> >>> java.io.IOException: Job failed!
> >>>        at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
> >>>        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
> >>>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368)
> >>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341)
>
> The others might me annoying, but this one does stop the execution of the
> script…
>
> Best
> urs
>
>
>
>
>
> Am 09.05.2013 um 17:14 schrieb feng lu <[email protected]>:
>
> > Hi Urs
> >
> > Did u use Nutch 1.6?
> >
> > <
> > 2013-05-09 03:57:56,967 WARN  fetcher.Fetcher - Attempting to finish item
> > from unknown queue: org.apache.nutch.fetcher.
> > Fetcher$FetchItem@307322f4
> >>
> >
> > This cause by call the FetchItemQueues#finishFetchItem method, but
> current
> > queues can not find the queueID of the FetchItem, one reason is that the
> > queue is deleted by reap empty queues in FetchItemQueues#getFetchItem
> > method when that queue is empty, because the FetchItem is unblocked when
> it
> > crawl finished, but when after that throw an Exception , that FetchItem
> > will be unblock again. but this time, that queue was emptyed. so it will
> > throw this WARN.
> >
> > <
> > 2013-05-09 03:57:56,969 ERROR fetcher.Fetcher - fetcher caught:java.lang.
> > NullPointerException
> >>
> >
> > There are two places can log this exception, one is in Fetcher#output
> > method, another is in Fetcher#run method. see your log order , maybe it
> is
> > log by Fetcher#run. I can not found any reason that can cause this NPE.
> Can
> > you reproduce this Exception and provide more detailed log.
> >
> > Thanks
> >
> >
> >
> > On Thu, May 9, 2013 at 6:45 PM, Tejas Patil <[email protected]
> >wrote:
> >
> >> Hey Urs,
> >> Please see the logs/hadoop.log file and share the stack trace of the
> >> exception.
> >>
> >>
> >> On Thu, May 9, 2013 at 3:36 AM, Urs Hofer <[email protected]> wrote:
> >>
> >>> Dear List
> >>>
> >>> i'm currently running the nutch crawl script with solr4.
> >>> additionally, I'm using the urlmeta plugin and I'm parsing keyword and
> >>> description
> >>> metadata as well. the crawl script runs in local mode. currently, I'm
> >>> seeding about 500 domains.
> >>>
> >>> the crawl script runs without problems the first time. on a recrawl, it
> >>> dies with the error
> >>>
> >>> 2013-05-09 03:57:56,967 WARN  fetcher.Fetcher - Attempting to finish
> item
> >>> from unknown queue: org.apache.nutch.fetcher.Fetcher$FetchItem@307322f4
> >>> 2013-05-09 03:57:56,967 INFO  fetcher.Fetcher - -finishing thread
> >>> FetcherThread, activeThreads=27
> >>> 2013-05-09 03:57:56,969 INFO  fetcher.Fetcher - -finishing thread
> >>> FetcherThread, activeThreads=1
> >>> 2013-05-09 03:57:56,969 ERROR fetcher.Fetcher - fetcher
> >>> caught:java.lang.NullPointerException
> >>> 2013-05-09 03:57:56,969 INFO  fetcher.Fetcher - -finishing thread
> >>> FetcherThread, activeThreads=2
> >>> 2013-05-09 03:57:56,969 WARN  fetcher.Fetcher - Attempting to finish
> item
> >>> from unknown queue: org.apache.nutch.fetcher.Fetcher$FetchItem@504e3c43
> >>> 2013-05-09 03:57:56,969 INFO  fetcher.Fetcher - -finishing thread
> >>> FetcherThread, activeThreads=0
> >>> 2013-05-09 03:57:57,942 ERROR fetcher.Fetcher - Fetcher:
> >>> java.io.IOException: Job failed!
> >>>        at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
> >>>        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
> >>>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368)
> >>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341)
> >>>
> >>>
> >>> after the fetcher part.
> >>>
> >>> I have a very liberal regex-urlfilter configuration:
> >>>
> >>> -^(file|ftp|mailto):
> >>>
> >>>
> >>
> -\.(mp3|MP3|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
> >>> +.
> >>>
> >>> But I am restricting the crawl to db.ignore.external.links = true
> >>> Can it be because I've removed the line
> >>>
> >>> # skip URLs containing certain characters as probable queries, etc.
> >>> -.*[?*!@=].*
> >>>
> >>> in regex-urlfilter? I got several seed entries like
> index.php?language=fr
> >>>
> >>> Since I'm new to nutch, I don't know where to search and how to
> continue.
> >>> Thanks for any help
> >>> Best regards
> >>> Urs Hofer
> >>
> >
> >
> >
> > --
> > Don't Grow Old, Grow Up... :-)
>
>

Re: nutch 1.6, bin/crawl fails on recrawl with java.io.IOException

Reply via email to