Re: Crawl fails - Input path does not exist

lewis john mcgibbney Tue, 12 Jul 2011 10:05:41 -0700

Hi Robertito,

Please refer to the following tutorial for current 1.3 tutorial.


All should be pretty straight forward however if there is anything causing
confusion please post back.

On Tue, Jul 12, 2011 at 1:46 PM, robertito <[email protected]> wrote:

> Hi,
>
> I'm a beginner using Nutch 1.3 on Windows 7 with Cygwin and followed the
> tutorial:
>
> http://wiki.apache.org/nutch/NutchTutorial
>
> I'm trying to crawl wikipedia.org as a start, and having a similar problem
> with the segments/content path that does not exist. The path does indeed
> not
> exist (nothing got fetched)
>
> Where do I have to adjust the disk space of my temporary directory?
>
> Just something else: There are two conf directories in Nutch's
> distribution.
> Which one is used? I'm updating the configuration files in both of them.
>
> Thank you!
> Regards,
> Robert
>
> Crawl Trace:
>
> $ runtime/local/bin/nutch crawl urls -solr http://127.0.0.1:8983/solr -dir
> crawl -depth 8 -topN 50000 -threads 16
> crawl started in: crawl
> rootUrlDir = urls
> threads = 16
> depth = 8
> solrUrl=http://127.0.0.1:8983/solr
> topN = 50000
> Injector: starting at 2011-07-12 14:34:16
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2011-07-12 14:34:19, elapsed: 00:00:03
> Generator: starting at 2011-07-12 14:34:19
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 50000
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/segments/20110712143422
> Generator: finished at 2011-07-12 14:34:23, elapsed: 00:00:04
> Fetcher: starting at 2011-07-12 14:34:23
> Fetcher: segment: crawl/segments/20110712143422
> Fetcher: threads: 16
> QueueFeeder finished: total 1 records + hit by time limit :0
> fetching http://www.wikipedia.org/
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2011-07-12 14:34:26, elapsed: 00:00:02
> ParseSegment: starting at 2011-07-12 14:34:26
> ParseSegment: segment: crawl/segments/20110712143422
> Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
> Input path does not exist:
> file:/C:/tools/nutch-1.3/crawl/segments/20110712143422/content
>        at
>
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>        at
>
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>        at
>
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>        at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>        at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>        at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
>        at org.apache.nutch.crawl.Crawl.run(Crawl.java:137)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Crawl-fails-Input-path-does-not-exist-tp996823p3162299.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: Crawl fails - Input path does not exist

Reply via email to