Hey Urs, Please see the logs/hadoop.log file and share the stack trace of the exception.
On Thu, May 9, 2013 at 3:36 AM, Urs Hofer <[email protected]> wrote: > Dear List > > i'm currently running the nutch crawl script with solr4. > additionally, I'm using the urlmeta plugin and I'm parsing keyword and > description > metadata as well. the crawl script runs in local mode. currently, I'm > seeding about 500 domains. > > the crawl script runs without problems the first time. on a recrawl, it > dies with the error > > 2013-05-09 03:57:56,967 WARN fetcher.Fetcher - Attempting to finish item > from unknown queue: org.apache.nutch.fetcher.Fetcher$FetchItem@307322f4 > 2013-05-09 03:57:56,967 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=27 > 2013-05-09 03:57:56,969 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2013-05-09 03:57:56,969 ERROR fetcher.Fetcher - fetcher > caught:java.lang.NullPointerException > 2013-05-09 03:57:56,969 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=2 > 2013-05-09 03:57:56,969 WARN fetcher.Fetcher - Attempting to finish item > from unknown queue: org.apache.nutch.fetcher.Fetcher$FetchItem@504e3c43 > 2013-05-09 03:57:56,969 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=0 > 2013-05-09 03:57:57,942 ERROR fetcher.Fetcher - Fetcher: > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265) > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332) > at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341) > > > after the fetcher part. > > I have a very liberal regex-urlfilter configuration: > > -^(file|ftp|mailto): > > -\.(mp3|MP3|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ > +. > > But I am restricting the crawl to db.ignore.external.links = true > Can it be because I've removed the line > > # skip URLs containing certain characters as probable queries, etc. > -.*[?*!@=].* > > in regex-urlfilter? I got several seed entries like index.php?language=fr > > Since I'm new to nutch, I don't know where to search and how to continue. > Thanks for any help > Best regards > Urs Hofer

