Unfetched, unparsed or just a bad corrupt segment. Remove that segment and try again.
> Many thanks Remi. > > Finally, after un reboot og the computer (I send my question just before > leaving my desk), Nutch started to crawl (amazing :))) ) > > But now, during the crawl process, I got that : > > ----- > > LinkDb: adding segment: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/ > 20120222161934 LinkDb: adding segment: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/ > 20120223093525 LinkDb: adding segment: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/ > 20120222153642 LinkDb: adding segment: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/ > 20120222154459 Exception in thread "main" > org.apache.hadoop.mapred.InvalidInputException: Input path does not > exist: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/ > 20120222160234/parse_data Input path does not exist: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/ > 20120222160609/parse_data Input path does not exist: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/ > 20120222153805/parse_data Input path does not exist: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/ > 20120222155532/parse_data Input path does not exist: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/ > 20120222160132/parse_data Input path does not exist: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/ > 20120222153642/parse_data Input path does not exist: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/ > 20120222154459/parse_data at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:19 > 0) at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInp > utFormat.java:44) at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201 > ) at > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149) > at org.apache.nutch.crawl.Crawl.run(Crawl.java:143) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) > > ----- > > and nothing special in the logs : > > last lines are : > > > 2012-02-23 09:46:42,524 INFO crawl.CrawlDb - CrawlDb update: finished > at 2012-02-23 09:46:42, elapsed: 00:00:01 > 2012-02-23 09:46:42,590 INFO crawl.LinkDb - LinkDb: starting at > 2012-02-23 09:46:42 > 2012-02-23 09:46:42,591 INFO crawl.LinkDb - LinkDb: linkdb: crawl/linkdb > 2012-02-23 09:46:42,591 INFO crawl.LinkDb - LinkDb: URL normalize: true > 2012-02-23 09:46:42,591 INFO crawl.LinkDb - LinkDb: URL filter: true > 2012-02-23 09:46:42,593 INFO crawl.LinkDb - LinkDb: adding segment: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/ > 20120223093220 2012-02-23 09:46:42,593 INFO crawl.LinkDb - LinkDb: adding > segment: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments > /20120222160234 2012-02-23 09:46:42,593 INFO crawl.LinkDb - LinkDb: adding > segment: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments > /20120223093302 2012-02-23 09:46:42,594 INFO crawl.LinkDb - LinkDb: adding > segment: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments > /20120222160609 2012-02-23 09:46:42,594 INFO crawl.LinkDb - LinkDb: adding > segment: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments > /20120222153805 2012-02-23 09:46:42,594 INFO crawl.LinkDb - LinkDb: adding > segment: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments > /20120222155532 2012-02-23 09:46:42,594 INFO crawl.LinkDb - LinkDb: adding > segment: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments > /20120223094427 2012-02-23 09:46:42,594 INFO crawl.LinkDb - LinkDb: adding > segment: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments > /20120223093618 2012-02-23 09:46:42,595 INFO crawl.LinkDb - LinkDb: adding > segment: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments > /20120223094552 2012-02-23 09:46:42,595 INFO crawl.LinkDb - LinkDb: adding > segment: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments > /20120223094500 2012-02-23 09:46:42,595 INFO crawl.LinkDb - LinkDb: adding > segment: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments > /20120222160132 2012-02-23 09:46:42,595 INFO crawl.LinkDb - LinkDb: adding > segment: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments > /20120223093649 2012-02-23 09:46:42,596 INFO crawl.LinkDb - LinkDb: adding > segment: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments > /20120223093210 2012-02-23 09:46:42,596 INFO crawl.LinkDb - LinkDb: adding > segment: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments > /20120222161934 2012-02-23 09:46:42,596 INFO crawl.LinkDb - LinkDb: adding > segment: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments > /20120223093525 2012-02-23 09:46:42,596 INFO crawl.LinkDb - LinkDb: adding > segment: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments > /20120222153642 2012-02-23 09:46:42,597 INFO crawl.LinkDb - LinkDb: adding > segment: > file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments > /20120222154459 > > On 22/02/2012 16:36, remi tassing wrote: > > Hey Daniel, > > > > You can find more output log in logs/Hadoop files > > > > Remi > > > > On Wednesday, February 22, 2012, Daniel Bourrion< > > > > [email protected]> wrote: > >> Hi. > >> I'm a french librarian (that explains the bad english coming now... :) ) > >> > >> Newbie on Nutch, that looks exactly what i'm searching for (an > >> opensource > > > > solution that should crawl our specific domaine and have it's crawl > > results pushed into Solr). > > > >> I've install a test nutch using > >> http://wiki.apache.org/nutch/NutchTutorial > >> > >> > >> Got an error but I don't really know it nor understand where to try to > > > > correct what causes that. > > > >> Here's a copy of the error messages - any help welcome. > >> Best > >> > >> -------------------------------------------------- > >> daniel@daniel-linux:~/Bureau/apache-nutch-1.4-bin/runtime/local$ > > > > bin/nutch crawl urls -dir crawl -depth 3 -topN 5 > > > >> solrUrl is not set, indexing will be skipped... > >> crawl started in: crawl > >> rootUrlDir = urls > >> threads = 10 > >> depth = 3 > >> solrUrl=null > >> topN = 5 > >> Injector: starting at 2012-02-22 16:06:04 > >> Injector: crawlDb: crawl/crawldb > >> Injector: urlDir: urls > >> Injector: Converting injected urls to crawl db entries. > >> Injector: Merging injected urls into crawl db. > >> Injector: finished at 2012-02-22 16:06:06, elapsed: 00:00:02 > >> Generator: starting at 2012-02-22 16:06:06 > >> Generator: Selecting best-scoring urls due for fetch. > >> Generator: filtering: true > >> Generator: normalizing: true > >> Generator: topN: 5 > >> Generator: jobtracker is 'local', generating exactly one partition. > >> Generator: Partitioning selected urls for politeness. > >> Generator: segment: crawl/segments/20120222160609 > >> Generator: finished at 2012-02-22 16:06:10, elapsed: 00:00:03 > >> Fetcher: Your 'http.agent.name' value should be listed first in > > > > 'http.robots.agents' property. > > > >> Fetcher: starting at 2012-02-22 16:06:10 > >> Fetcher: segment: crawl/segments/20120222160609 > >> Using queue mode : byHost > >> Fetcher: threads: 10 > >> Fetcher: time-out divisor: 2 > >> QueueFeeder finished: total 2 records + hit by time limit :0 > >> Using queue mode : byHost > >> Using queue mode : byHost > >> Using queue mode : byHost > >> fetching http://bu.univ-angers.fr/ > >> Using queue mode : byHost > >> Using queue mode : byHost > >> fetching http://www.face-ecran.fr/ > >> -finishing thread FetcherThread, activeThreads=2 > >> -finishing thread FetcherThread, activeThreads=2 > >> Using queue mode : byHost > >> -finishing thread FetcherThread, activeThreads=2 > >> Using queue mode : byHost > >> -finishing thread FetcherThread, activeThreads=2 > >> Using queue mode : byHost > >> -finishing thread FetcherThread, activeThreads=2 > >> -finishing thread FetcherThread, activeThreads=2 > >> Using queue mode : byHost > >> Using queue mode : byHost > >> -finishing thread FetcherThread, activeThreads=2 > >> -finishing thread FetcherThread, activeThreads=2 > >> Fetcher: throughput threshold: -1 > >> Fetcher: throughput threshold retries: 5 > >> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0 > >> -finishing thread FetcherThread, activeThreads=1 > >> -finishing thread FetcherThread, activeThreads=0 > >> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > >> -activeThreads=0 > >> Fetcher: finished at 2012-02-22 16:06:13, elapsed: 00:00:03 > >> ParseSegment: starting at 2012-02-22 16:06:13 > >> ParseSegment: segment: crawl/segments/20120222160609 > >> Parsing: http://bu.univ-angers.fr/ > >> Parsing: http://www.face-ecran.fr/ > >> Exception in thread "main" java.io.IOException: Job failed! > >> > >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) > >> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157) > >> at org.apache.nutch.crawl.Crawl.run(Crawl.java:138) > >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) > >> > >> -------------------------------------------------- > >> > >> -- > >> Avec mes salutations les plus cordiales. > >> __ > >> > >> Daniel Bourrion, conservateur des bibliothèques > >> Responsable de la bibliothèque numérique > >> Ligne directe : 02.44.68.80.50 > >> SCD Université d'Angers - http://bu.univ-angers.fr > >> Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex > >> > >> *********************************** > >> " Et par le pouvoir d'un mot > >> Je recommence ma vie " > >> > >> Paul Eluard > >> > >> *********************************** > >> blog perso : http://archives.face-ecran.fr/

