Unfetched, unparsed or just a bad corrupt segment. Remove that segment and try 
again.

> Many thanks Remi.
> 
> Finally, after un reboot og the computer (I send my question just before
> leaving my desk), Nutch started to crawl (amazing :))) )
> 
> But now, during the crawl process, I got that :
> 
> -----
> 
> LinkDb: adding segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120222161934 LinkDb: adding segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120223093525 LinkDb: adding segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120222153642 LinkDb: adding segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120222154459 Exception in thread "main"
> org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120222160234/parse_data Input path does not exist:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120222160609/parse_data Input path does not exist:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120222153805/parse_data Input path does not exist:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120222155532/parse_data Input path does not exist:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120222160132/parse_data Input path does not exist:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120222153642/parse_data Input path does not exist:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120222154459/parse_data at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:19
> 0) at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInp
> utFormat.java:44) at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201
> ) at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>      at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>      at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
>      at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>      at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
> 
> -----
> 
> and nothing special in the logs :
> 
> last lines are :
> 
> 
> 2012-02-23 09:46:42,524 INFO  crawl.CrawlDb - CrawlDb update: finished
> at 2012-02-23 09:46:42, elapsed: 00:00:01
> 2012-02-23 09:46:42,590 INFO  crawl.LinkDb - LinkDb: starting at
> 2012-02-23 09:46:42
> 2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: linkdb: crawl/linkdb
> 2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: URL normalize: true
> 2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: URL filter: true
> 2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb: adding segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
> 20120223093220 2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120222160234 2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120223093302 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120222160609 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120222153805 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120222155532 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120223094427 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120223093618 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120223094552 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120223094500 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120222160132 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120223093649 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120223093210 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120222161934 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120223093525 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120222153642 2012-02-23 09:46:42,597 INFO  crawl.LinkDb - LinkDb: adding
> segment:
> file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
> /20120222154459
> 
> On 22/02/2012 16:36, remi tassing wrote:
> > Hey Daniel,
> > 
> > You can find more output log in logs/Hadoop files
> > 
> > Remi
> > 
> > On Wednesday, February 22, 2012, Daniel Bourrion<
> > 
> > [email protected]>  wrote:
> >> Hi.
> >> I'm a french librarian (that explains the bad english coming now... :) )
> >> 
> >> Newbie on Nutch, that looks exactly what i'm searching for (an
> >> opensource
> > 
> > solution that should crawl our specific domaine and have it's crawl
> > results pushed into Solr).
> > 
> >> I've install a test nutch using
> >> http://wiki.apache.org/nutch/NutchTutorial
> >> 
> >> 
> >> Got an error but I don't really know it nor understand where to try to
> > 
> > correct what causes that.
> > 
> >> Here's a copy of the error messages - any help welcome.
> >> Best
> >> 
> >> --------------------------------------------------
> >> daniel@daniel-linux:~/Bureau/apache-nutch-1.4-bin/runtime/local$
> > 
> > bin/nutch crawl urls -dir crawl -depth 3 -topN 5
> > 
> >> solrUrl is not set, indexing will be skipped...
> >> crawl started in: crawl
> >> rootUrlDir = urls
> >> threads = 10
> >> depth = 3
> >> solrUrl=null
> >> topN = 5
> >> Injector: starting at 2012-02-22 16:06:04
> >> Injector: crawlDb: crawl/crawldb
> >> Injector: urlDir: urls
> >> Injector: Converting injected urls to crawl db entries.
> >> Injector: Merging injected urls into crawl db.
> >> Injector: finished at 2012-02-22 16:06:06, elapsed: 00:00:02
> >> Generator: starting at 2012-02-22 16:06:06
> >> Generator: Selecting best-scoring urls due for fetch.
> >> Generator: filtering: true
> >> Generator: normalizing: true
> >> Generator: topN: 5
> >> Generator: jobtracker is 'local', generating exactly one partition.
> >> Generator: Partitioning selected urls for politeness.
> >> Generator: segment: crawl/segments/20120222160609
> >> Generator: finished at 2012-02-22 16:06:10, elapsed: 00:00:03
> >> Fetcher: Your 'http.agent.name' value should be listed first in
> > 
> > 'http.robots.agents' property.
> > 
> >> Fetcher: starting at 2012-02-22 16:06:10
> >> Fetcher: segment: crawl/segments/20120222160609
> >> Using queue mode : byHost
> >> Fetcher: threads: 10
> >> Fetcher: time-out divisor: 2
> >> QueueFeeder finished: total 2 records + hit by time limit :0
> >> Using queue mode : byHost
> >> Using queue mode : byHost
> >> Using queue mode : byHost
> >> fetching http://bu.univ-angers.fr/
> >> Using queue mode : byHost
> >> Using queue mode : byHost
> >> fetching http://www.face-ecran.fr/
> >> -finishing thread FetcherThread, activeThreads=2
> >> -finishing thread FetcherThread, activeThreads=2
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=2
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=2
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=2
> >> -finishing thread FetcherThread, activeThreads=2
> >> Using queue mode : byHost
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=2
> >> -finishing thread FetcherThread, activeThreads=2
> >> Fetcher: throughput threshold: -1
> >> Fetcher: throughput threshold retries: 5
> >> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
> >> -finishing thread FetcherThread, activeThreads=1
> >> -finishing thread FetcherThread, activeThreads=0
> >> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> >> -activeThreads=0
> >> Fetcher: finished at 2012-02-22 16:06:13, elapsed: 00:00:03
> >> ParseSegment: starting at 2012-02-22 16:06:13
> >> ParseSegment: segment: crawl/segments/20120222160609
> >> Parsing: http://bu.univ-angers.fr/
> >> Parsing: http://www.face-ecran.fr/
> >> Exception in thread "main" java.io.IOException: Job failed!
> >> 
> >>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
> >>     at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
> >>     at org.apache.nutch.crawl.Crawl.run(Crawl.java:138)
> >>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>     at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
> >> 
> >> --------------------------------------------------
> >> 
> >> --
> >> Avec mes salutations les plus cordiales.
> >> __
> >> 
> >> Daniel Bourrion, conservateur des bibliothèques
> >> Responsable de la bibliothèque numérique
> >> Ligne directe : 02.44.68.80.50
> >> SCD Université d'Angers - http://bu.univ-angers.fr
> >> Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex
> >> 
> >> ***********************************
> >> " Et par le pouvoir d'un mot
> >> Je recommence ma vie "
> >> 
> >>                        Paul Eluard
> >> 
> >> ***********************************
> >> blog perso : http://archives.face-ecran.fr/

Reply via email to