Maybe u problem caused by this issue [0] or you can refer to this
http://permalink.gmane.org/gmane.comp.search.nutch.devel/36673 hopes that can help you. [0] https://issues.apache.org/jira/browse/NUTCH-1182 On Fri, May 10, 2013 at 2:56 PM, Tejas Patil <[email protected]>wrote: > Hey Urs, > Please see the logs/hadoop.log file and share the all the stack traces of > the exception. > The current stack trace you shared doesn't highlight the actual problem. It > just hints that the fetch job failed. > > > On Thu, May 9, 2013 at 11:31 PM, AC Nutch <[email protected]> wrote: > > > If I'm not mistaken 140 threads is way way way on the high side. Unless > you > > have some massive servers, I can't see them handling that. I can barely > get > > my servers to handle more than 15ish. Perhaps try decreasing that and see > > if that fixes the issue. > > > > Alex > > > > > > On Thu, May 9, 2013 at 7:17 PM, Urs Hofer <[email protected]> wrote: > > > > > Dear Feng Lu > > > > > > I'm not sure, but the problem is more the last exception: > > > > > > >>> java.io.IOException: Job failed! > > > >>> at > > > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265) > > > >>> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332) > > > >>> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368) > > > >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > > >>> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341) > > > > > > The others might me annoying, but this one does stop the execution of > the > > > script… > > > > > > Best > > > urs > > > > > > > > > > > > > > > > > > Am 09.05.2013 um 17:14 schrieb feng lu <[email protected]>: > > > > > > > Hi Urs > > > > > > > > Did u use Nutch 1.6? > > > > > > > > < > > > > 2013-05-09 03:57:56,967 WARN fetcher.Fetcher - Attempting to finish > > item > > > > from unknown queue: org.apache.nutch.fetcher. > > > > Fetcher$FetchItem@307322f4 > > > >> > > > > > > > > This cause by call the FetchItemQueues#finishFetchItem method, but > > > current > > > > queues can not find the queueID of the FetchItem, one reason is that > > the > > > > queue is deleted by reap empty queues in FetchItemQueues#getFetchItem > > > > method when that queue is empty, because the FetchItem is unblocked > > when > > > it > > > > crawl finished, but when after that throw an Exception , that > FetchItem > > > > will be unblock again. but this time, that queue was emptyed. so it > > will > > > > throw this WARN. > > > > > > > > < > > > > 2013-05-09 03:57:56,969 ERROR fetcher.Fetcher - fetcher > > caught:java.lang. > > > > NullPointerException > > > >> > > > > > > > > There are two places can log this exception, one is in Fetcher#output > > > > method, another is in Fetcher#run method. see your log order , maybe > it > > > is > > > > log by Fetcher#run. I can not found any reason that can cause this > NPE. > > > Can > > > > you reproduce this Exception and provide more detailed log. > > > > > > > > Thanks > > > > > > > > > > > > > > > > On Thu, May 9, 2013 at 6:45 PM, Tejas Patil < > [email protected] > > > >wrote: > > > > > > > >> Hey Urs, > > > >> Please see the logs/hadoop.log file and share the stack trace of the > > > >> exception. > > > >> > > > >> > > > >> On Thu, May 9, 2013 at 3:36 AM, Urs Hofer <[email protected]> > wrote: > > > >> > > > >>> Dear List > > > >>> > > > >>> i'm currently running the nutch crawl script with solr4. > > > >>> additionally, I'm using the urlmeta plugin and I'm parsing keyword > > and > > > >>> description > > > >>> metadata as well. the crawl script runs in local mode. currently, > I'm > > > >>> seeding about 500 domains. > > > >>> > > > >>> the crawl script runs without problems the first time. on a > recrawl, > > it > > > >>> dies with the error > > > >>> > > > >>> 2013-05-09 03:57:56,967 WARN fetcher.Fetcher - Attempting to > finish > > > item > > > >>> from unknown queue: > > org.apache.nutch.fetcher.Fetcher$FetchItem@307322f4 > > > >>> 2013-05-09 03:57:56,967 INFO fetcher.Fetcher - -finishing thread > > > >>> FetcherThread, activeThreads=27 > > > >>> 2013-05-09 03:57:56,969 INFO fetcher.Fetcher - -finishing thread > > > >>> FetcherThread, activeThreads=1 > > > >>> 2013-05-09 03:57:56,969 ERROR fetcher.Fetcher - fetcher > > > >>> caught:java.lang.NullPointerException > > > >>> 2013-05-09 03:57:56,969 INFO fetcher.Fetcher - -finishing thread > > > >>> FetcherThread, activeThreads=2 > > > >>> 2013-05-09 03:57:56,969 WARN fetcher.Fetcher - Attempting to > finish > > > item > > > >>> from unknown queue: > > org.apache.nutch.fetcher.Fetcher$FetchItem@504e3c43 > > > >>> 2013-05-09 03:57:56,969 INFO fetcher.Fetcher - -finishing thread > > > >>> FetcherThread, activeThreads=0 > > > >>> 2013-05-09 03:57:57,942 ERROR fetcher.Fetcher - Fetcher: > > > >>> java.io.IOException: Job failed! > > > >>> at > > > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265) > > > >>> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332) > > > >>> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368) > > > >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > > >>> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341) > > > >>> > > > >>> > > > >>> after the fetcher part. > > > >>> > > > >>> I have a very liberal regex-urlfilter configuration: > > > >>> > > > >>> -^(file|ftp|mailto): > > > >>> > > > >>> > > > >> > > > > > > -\.(mp3|MP3|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ > > > >>> +. > > > >>> > > > >>> But I am restricting the crawl to db.ignore.external.links = true > > > >>> Can it be because I've removed the line > > > >>> > > > >>> # skip URLs containing certain characters as probable queries, etc. > > > >>> -.*[?*!@=].* > > > >>> > > > >>> in regex-urlfilter? I got several seed entries like > > > index.php?language=fr > > > >>> > > > >>> Since I'm new to nutch, I don't know where to search and how to > > > continue. > > > >>> Thanks for any help > > > >>> Best regards > > > >>> Urs Hofer > > > >> > > > > > > > > > > > > > > > > -- > > > > Don't Grow Old, Grow Up... :-) > > > > > > > > > -- Don't Grow Old, Grow Up... :-)

