Re: nutch 1.6, bin/crawl fails on recrawl with java.io.IOException

Tejas Patil Thu, 09 May 2013 23:57:25 -0700

Hey Urs,
Please see the logs/hadoop.log file and share the all the stack traces of
the exception.
The current stack trace you shared doesn't highlight the actual problem. It
just hints that the fetch job failed.



On Thu, May 9, 2013 at 11:31 PM, AC Nutch <[email protected]> wrote:

> If I'm not mistaken 140 threads is way way way on the high side. Unless you
> have some massive servers, I can't see them handling that. I can barely get
> my servers to handle more than 15ish. Perhaps try decreasing that and see
> if that fixes the issue.
>
> Alex
>
>
> On Thu, May 9, 2013 at 7:17 PM, Urs Hofer <[email protected]> wrote:
>
> > Dear Feng Lu
> >
> > I'm not sure, but the problem is more the last exception:
> >
> > >>> java.io.IOException: Job failed!
> > >>>        at
> > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
> > >>>        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
> > >>>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368)
> > >>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >>>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341)
> >
> > The others might me annoying, but this one does stop the execution of the
> > script…
> >
> > Best
> > urs
> >
> >
> >
> >
> >
> > Am 09.05.2013 um 17:14 schrieb feng lu <[email protected]>:
> >
> > > Hi Urs
> > >
> > > Did u use Nutch 1.6?
> > >
> > > <
> > > 2013-05-09 03:57:56,967 WARN  fetcher.Fetcher - Attempting to finish
> item
> > > from unknown queue: org.apache.nutch.fetcher.
> > > Fetcher$FetchItem@307322f4
> > >>
> > >
> > > This cause by call the FetchItemQueues#finishFetchItem method, but
> > current
> > > queues can not find the queueID of the FetchItem, one reason is that
> the
> > > queue is deleted by reap empty queues in FetchItemQueues#getFetchItem
> > > method when that queue is empty, because the FetchItem is unblocked
> when
> > it
> > > crawl finished, but when after that throw an Exception , that FetchItem
> > > will be unblock again. but this time, that queue was emptyed. so it
> will
> > > throw this WARN.
> > >
> > > <
> > > 2013-05-09 03:57:56,969 ERROR fetcher.Fetcher - fetcher
> caught:java.lang.
> > > NullPointerException
> > >>
> > >
> > > There are two places can log this exception, one is in Fetcher#output
> > > method, another is in Fetcher#run method. see your log order , maybe it
> > is
> > > log by Fetcher#run. I can not found any reason that can cause this NPE.
> > Can
> > > you reproduce this Exception and provide more detailed log.
> > >
> > > Thanks
> > >
> > >
> > >
> > > On Thu, May 9, 2013 at 6:45 PM, Tejas Patil <[email protected]
> > >wrote:
> > >
> > >> Hey Urs,
> > >> Please see the logs/hadoop.log file and share the stack trace of the
> > >> exception.
> > >>
> > >>
> > >> On Thu, May 9, 2013 at 3:36 AM, Urs Hofer <[email protected]> wrote:
> > >>
> > >>> Dear List
> > >>>
> > >>> i'm currently running the nutch crawl script with solr4.
> > >>> additionally, I'm using the urlmeta plugin and I'm parsing keyword
> and
> > >>> description
> > >>> metadata as well. the crawl script runs in local mode. currently, I'm
> > >>> seeding about 500 domains.
> > >>>
> > >>> the crawl script runs without problems the first time. on a recrawl,
> it
> > >>> dies with the error
> > >>>
> > >>> 2013-05-09 03:57:56,967 WARN  fetcher.Fetcher - Attempting to finish
> > item
> > >>> from unknown queue:
> org.apache.nutch.fetcher.Fetcher$FetchItem@307322f4
> > >>> 2013-05-09 03:57:56,967 INFO  fetcher.Fetcher - -finishing thread
> > >>> FetcherThread, activeThreads=27
> > >>> 2013-05-09 03:57:56,969 INFO  fetcher.Fetcher - -finishing thread
> > >>> FetcherThread, activeThreads=1
> > >>> 2013-05-09 03:57:56,969 ERROR fetcher.Fetcher - fetcher
> > >>> caught:java.lang.NullPointerException
> > >>> 2013-05-09 03:57:56,969 INFO  fetcher.Fetcher - -finishing thread
> > >>> FetcherThread, activeThreads=2
> > >>> 2013-05-09 03:57:56,969 WARN  fetcher.Fetcher - Attempting to finish
> > item
> > >>> from unknown queue:
> org.apache.nutch.fetcher.Fetcher$FetchItem@504e3c43
> > >>> 2013-05-09 03:57:56,969 INFO  fetcher.Fetcher - -finishing thread
> > >>> FetcherThread, activeThreads=0
> > >>> 2013-05-09 03:57:57,942 ERROR fetcher.Fetcher - Fetcher:
> > >>> java.io.IOException: Job failed!
> > >>>        at
> > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
> > >>>        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
> > >>>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368)
> > >>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >>>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341)
> > >>>
> > >>>
> > >>> after the fetcher part.
> > >>>
> > >>> I have a very liberal regex-urlfilter configuration:
> > >>>
> > >>> -^(file|ftp|mailto):
> > >>>
> > >>>
> > >>
> >
> -\.(mp3|MP3|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
> > >>> +.
> > >>>
> > >>> But I am restricting the crawl to db.ignore.external.links = true
> > >>> Can it be because I've removed the line
> > >>>
> > >>> # skip URLs containing certain characters as probable queries, etc.
> > >>> -.*[?*!@=].*
> > >>>
> > >>> in regex-urlfilter? I got several seed entries like
> > index.php?language=fr
> > >>>
> > >>> Since I'm new to nutch, I don't know where to search and how to
> > continue.
> > >>> Thanks for any help
> > >>> Best regards
> > >>> Urs Hofer
> > >>
> > >
> > >
> > >
> > > --
> > > Don't Grow Old, Grow Up... :-)
> >
> >
>

Re: nutch 1.6, bin/crawl fails on recrawl with java.io.IOException

Reply via email to