Dear Feng Lu I'm not sure, but the problem is more the last exception:
>>> java.io.IOException: Job failed! >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265) >>> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332) >>> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341) The others might me annoying, but this one does stop the execution of the script… Best urs Am 09.05.2013 um 17:14 schrieb feng lu <[email protected]>: > Hi Urs > > Did u use Nutch 1.6? > > < > 2013-05-09 03:57:56,967 WARN fetcher.Fetcher - Attempting to finish item > from unknown queue: org.apache.nutch.fetcher. > Fetcher$FetchItem@307322f4 >> > > This cause by call the FetchItemQueues#finishFetchItem method, but current > queues can not find the queueID of the FetchItem, one reason is that the > queue is deleted by reap empty queues in FetchItemQueues#getFetchItem > method when that queue is empty, because the FetchItem is unblocked when it > crawl finished, but when after that throw an Exception , that FetchItem > will be unblock again. but this time, that queue was emptyed. so it will > throw this WARN. > > < > 2013-05-09 03:57:56,969 ERROR fetcher.Fetcher - fetcher caught:java.lang. > NullPointerException >> > > There are two places can log this exception, one is in Fetcher#output > method, another is in Fetcher#run method. see your log order , maybe it is > log by Fetcher#run. I can not found any reason that can cause this NPE. Can > you reproduce this Exception and provide more detailed log. > > Thanks > > > > On Thu, May 9, 2013 at 6:45 PM, Tejas Patil <[email protected]>wrote: > >> Hey Urs, >> Please see the logs/hadoop.log file and share the stack trace of the >> exception. >> >> >> On Thu, May 9, 2013 at 3:36 AM, Urs Hofer <[email protected]> wrote: >> >>> Dear List >>> >>> i'm currently running the nutch crawl script with solr4. >>> additionally, I'm using the urlmeta plugin and I'm parsing keyword and >>> description >>> metadata as well. the crawl script runs in local mode. currently, I'm >>> seeding about 500 domains. >>> >>> the crawl script runs without problems the first time. on a recrawl, it >>> dies with the error >>> >>> 2013-05-09 03:57:56,967 WARN fetcher.Fetcher - Attempting to finish item >>> from unknown queue: org.apache.nutch.fetcher.Fetcher$FetchItem@307322f4 >>> 2013-05-09 03:57:56,967 INFO fetcher.Fetcher - -finishing thread >>> FetcherThread, activeThreads=27 >>> 2013-05-09 03:57:56,969 INFO fetcher.Fetcher - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2013-05-09 03:57:56,969 ERROR fetcher.Fetcher - fetcher >>> caught:java.lang.NullPointerException >>> 2013-05-09 03:57:56,969 INFO fetcher.Fetcher - -finishing thread >>> FetcherThread, activeThreads=2 >>> 2013-05-09 03:57:56,969 WARN fetcher.Fetcher - Attempting to finish item >>> from unknown queue: org.apache.nutch.fetcher.Fetcher$FetchItem@504e3c43 >>> 2013-05-09 03:57:56,969 INFO fetcher.Fetcher - -finishing thread >>> FetcherThread, activeThreads=0 >>> 2013-05-09 03:57:57,942 ERROR fetcher.Fetcher - Fetcher: >>> java.io.IOException: Job failed! >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265) >>> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332) >>> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341) >>> >>> >>> after the fetcher part. >>> >>> I have a very liberal regex-urlfilter configuration: >>> >>> -^(file|ftp|mailto): >>> >>> >> -\.(mp3|MP3|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ >>> +. >>> >>> But I am restricting the crawl to db.ignore.external.links = true >>> Can it be because I've removed the line >>> >>> # skip URLs containing certain characters as probable queries, etc. >>> -.*[?*!@=].* >>> >>> in regex-urlfilter? I got several seed entries like index.php?language=fr >>> >>> Since I'm new to nutch, I don't know where to search and how to continue. >>> Thanks for any help >>> Best regards >>> Urs Hofer >> > > > > -- > Don't Grow Old, Grow Up... :-)

