Re: nutch 1.6, bin/crawl fails on recrawl with java.io.IOException

Tejas Patil Thu, 09 May 2013 03:45:54 -0700

Hey Urs,
Please see the logs/hadoop.log file and share the stack trace of the
exception.



On Thu, May 9, 2013 at 3:36 AM, Urs Hofer <[email protected]> wrote:

> Dear List
>
> i'm currently running the nutch crawl script with solr4.
> additionally, I'm using the urlmeta plugin and I'm parsing keyword and
> description
> metadata as well. the crawl script runs in local mode. currently, I'm
> seeding about 500 domains.
>
> the crawl script runs without problems the first time. on a recrawl, it
> dies with the error
>
> 2013-05-09 03:57:56,967 WARN  fetcher.Fetcher - Attempting to finish item
> from unknown queue: org.apache.nutch.fetcher.Fetcher$FetchItem@307322f4
> 2013-05-09 03:57:56,967 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=27
> 2013-05-09 03:57:56,969 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2013-05-09 03:57:56,969 ERROR fetcher.Fetcher - fetcher
> caught:java.lang.NullPointerException
> 2013-05-09 03:57:56,969 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=2
> 2013-05-09 03:57:56,969 WARN  fetcher.Fetcher - Attempting to finish item
> from unknown queue: org.apache.nutch.fetcher.Fetcher$FetchItem@504e3c43
> 2013-05-09 03:57:56,969 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=0
> 2013-05-09 03:57:57,942 ERROR fetcher.Fetcher - Fetcher:
> java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341)
>
>
> after the fetcher part.
>
> I have a very liberal regex-urlfilter configuration:
>
> -^(file|ftp|mailto):
>
> -\.(mp3|MP3|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
> +.
>
> But I am restricting the crawl to db.ignore.external.links = true
> Can it be because I've removed the line
>
> # skip URLs containing certain characters as probable queries, etc.
> -.*[?*!@=].*
>
> in regex-urlfilter? I got several seed entries like index.php?language=fr
>
> Since I'm new to nutch, I don't know where to search and how to continue.
> Thanks for any help
> Best regards
> Urs Hofer

Re: nutch 1.6, bin/crawl fails on recrawl with java.io.IOException

Reply via email to