Dear List

i'm currently running the nutch crawl script with solr4.
additionally, I'm using the urlmeta plugin and I'm parsing keyword and 
description
metadata as well. the crawl script runs in local mode. currently, I'm seeding 
about 500 domains.

the crawl script runs without problems the first time. on a recrawl, it dies 
with the error

2013-05-09 03:57:56,967 WARN  fetcher.Fetcher - Attempting to finish item from 
unknown queue: org.apache.nutch.fetcher.Fetcher$FetchItem@307322f4
2013-05-09 03:57:56,967 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=27
2013-05-09 03:57:56,969 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=1
2013-05-09 03:57:56,969 ERROR fetcher.Fetcher - fetcher 
caught:java.lang.NullPointerException
2013-05-09 03:57:56,969 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=2
2013-05-09 03:57:56,969 WARN  fetcher.Fetcher - Attempting to finish item from 
unknown queue: org.apache.nutch.fetcher.Fetcher$FetchItem@504e3c43
2013-05-09 03:57:56,969 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=0
2013-05-09 03:57:57,942 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: 
Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341)


after the fetcher part.

I have a very liberal regex-urlfilter configuration:

-^(file|ftp|mailto):
-\.(mp3|MP3|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
+.

But I am restricting the crawl to db.ignore.external.links = true
Can it be because I've removed the line

# skip URLs containing certain characters as probable queries, etc.
-.*[?*!@=].*

in regex-urlfilter? I got several seed entries like index.php?language=fr

Since I'm new to nutch, I don't know where to search and how to continue. 
Thanks for any help
Best regards
Urs Hofer

Reply via email to