Re: Exception in thread "main" java.io.IOException: Job failed!

Daniel Bourrion Thu, 23 Feb 2012 00:51:58 -0800

Many thanks Remi.

Finally, after un reboot og the computer (I send my question just beforeleaving my desk), Nutch started to crawl (amazing :))) )


But now, during the crawl process, I got that :

-----

LinkDb: adding segment:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222161934LinkDb: adding segment:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120223093525LinkDb: adding segment:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222153642LinkDb: adding segment:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222154459Exception in thread "main"org.apache.hadoop.mapred.InvalidInputException: Input path does notexist:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222160234/parse_dataInput path does not exist:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222160609/parse_dataInput path does not exist:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222153805/parse_dataInput path does not exist:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222155532/parse_dataInput path does not exist:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222160132/parse_dataInput path does not exist:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222153642/parse_dataInput path does not exist:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222154459/parse_dataatorg.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)atorg.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)atorg.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)atorg.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)atorg.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)

    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
    at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

-----

and nothing special in the logs :

last lines are :

2012-02-23 09:46:42,524 INFO crawl.CrawlDb - CrawlDb update: finishedat 2012-02-23 09:46:42, elapsed: 00:00:012012-02-23 09:46:42,590 INFO crawl.LinkDb - LinkDb: starting at2012-02-23 09:46:42

2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: linkdb: crawl/linkdb
2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: URL normalize: true
2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: URL filter: true

2012-02-23 09:46:42,593 INFO crawl.LinkDb - LinkDb: adding segment:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/201202230932202012-02-23 09:46:42,593 INFO crawl.LinkDb - LinkDb: adding segment:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/201202221602342012-02-23 09:46:42,593 INFO crawl.LinkDb - LinkDb: adding segment:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/201202230933022012-02-23 09:46:42,594 INFO crawl.LinkDb - LinkDb: adding segment:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/201202221606092012-02-23 09:46:42,594 INFO crawl.LinkDb - LinkDb: adding segment:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/201202221538052012-02-23 09:46:42,594 INFO crawl.LinkDb - LinkDb: adding segment:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/201202221555322012-02-23 09:46:42,594 INFO crawl.LinkDb - LinkDb: adding segment:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/201202230944272012-02-23 09:46:42,594 INFO crawl.LinkDb - LinkDb: adding segment:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/201202230936182012-02-23 09:46:42,595 INFO crawl.LinkDb - LinkDb: adding segment:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/201202230945522012-02-23 09:46:42,595 INFO crawl.LinkDb - LinkDb: adding segment:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/201202230945002012-02-23 09:46:42,595 INFO crawl.LinkDb - LinkDb: adding segment:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/201202221601322012-02-23 09:46:42,595 INFO crawl.LinkDb - LinkDb: adding segment:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/201202230936492012-02-23 09:46:42,596 INFO crawl.LinkDb - LinkDb: adding segment:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/201202230932102012-02-23 09:46:42,596 INFO crawl.LinkDb - LinkDb: adding segment:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/201202221619342012-02-23 09:46:42,596 INFO crawl.LinkDb - LinkDb: adding segment:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/201202230935252012-02-23 09:46:42,596 INFO crawl.LinkDb - LinkDb: adding segment:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/201202221536422012-02-23 09:46:42,597 INFO crawl.LinkDb - LinkDb: adding segment:file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120222154459



On 22/02/2012 16:36, remi tassing wrote:

Hey Daniel,

You can find more output log in logs/Hadoop files

Remi

On Wednesday, February 22, 2012, Daniel Bourrion<
[email protected]>  wrote:

Hi.
I'm a french librarian (that explains the bad english coming now... :) )

Newbie on Nutch, that looks exactly what i'm searching for (an opensource

solution that should crawl our specific domaine and have it's crawl results
pushed into Solr).

I've install a test nutch using http://wiki.apache.org/nutch/NutchTutorial


Got an error but I don't really know it nor understand where to try to

correct what causes that.

Here's a copy of the error messages - any help welcome.
Best

--------------------------------------------------
daniel@daniel-linux:~/Bureau/apache-nutch-1.4-bin/runtime/local$

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 5
Injector: starting at 2012-02-22 16:06:04
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-02-22 16:06:06, elapsed: 00:00:02
Generator: starting at 2012-02-22 16:06:06
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20120222160609
Generator: finished at 2012-02-22 16:06:10, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in

'http.robots.agents' property.

Fetcher: starting at 2012-02-22 16:06:10
Fetcher: segment: crawl/segments/20120222160609
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 2 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
fetching http://bu.univ-angers.fr/
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.face-ecran.fr/
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=2
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-02-22 16:06:13, elapsed: 00:00:03
ParseSegment: starting at 2012-02-22 16:06:13
ParseSegment: segment: crawl/segments/20120222160609
Parsing: http://bu.univ-angers.fr/
Parsing: http://www.face-ecran.fr/
Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
    at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
    at org.apache.nutch.crawl.Crawl.run(Crawl.java:138)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)


--------------------------------------------------

--
Avec mes salutations les plus cordiales.
__

Daniel Bourrion, conservateur des bibliothèques
Responsable de la bibliothèque numérique
Ligne directe : 02.44.68.80.50
SCD Université d'Angers - http://bu.univ-angers.fr
Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex

***********************************
" Et par le pouvoir d'un mot
Je recommence ma vie "
                       Paul Eluard
***********************************
blog perso : http://archives.face-ecran.fr/


--
Avec mes salutations les plus cordiales.
__

Daniel Bourrion, conservateur des bibliothèques
Responsable de la bibliothèque numérique
Ligne directe : 02.44.68.80.50
SCD Université d'Angers - http://bu.univ-angers.fr
Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex

***********************************
" Et par le pouvoir d'un mot
Je recommence ma vie "
                       Paul Eluard
***********************************
blog perso : http://archives.face-ecran.fr/

Re: Exception in thread "main" java.io.IOException: Job failed!

Reply via email to