Dear List
i'm currently running the nutch crawl script with solr4.
additionally, I'm using the urlmeta plugin and I'm parsing keyword and
description
metadata as well. the crawl script runs in local mode. currently, I'm seeding
about 500 domains.
the crawl script runs without problems the first time. on a recrawl, it dies
with the error
2013-05-09 03:57:56,967 WARN fetcher.Fetcher - Attempting to finish item from
unknown queue: org.apache.nutch.fetcher.Fetcher$FetchItem@307322f4
2013-05-09 03:57:56,967 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=27
2013-05-09 03:57:56,969 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-05-09 03:57:56,969 ERROR fetcher.Fetcher - fetcher
caught:java.lang.NullPointerException
2013-05-09 03:57:56,969 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=2
2013-05-09 03:57:56,969 WARN fetcher.Fetcher - Attempting to finish item from
unknown queue: org.apache.nutch.fetcher.Fetcher$FetchItem@504e3c43
2013-05-09 03:57:56,969 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2013-05-09 03:57:57,942 ERROR fetcher.Fetcher - Fetcher: java.io.IOException:
Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341)
after the fetcher part.
I have a very liberal regex-urlfilter configuration:
-^(file|ftp|mailto):
-\.(mp3|MP3|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
+.
But I am restricting the crawl to db.ignore.external.links = true
Can it be because I've removed the line
# skip URLs containing certain characters as probable queries, etc.
-.*[?*!@=].*
in regex-urlfilter? I got several seed entries like index.php?language=fr
Since I'm new to nutch, I don't know where to search and how to continue.
Thanks for any help
Best regards
Urs Hofer