Re: Crawl fails - Input path does not exist

robertito Tue, 12 Jul 2011 06:02:08 -0700

Hi,

I'm a beginner using Nutch 1.3 on Windows 7 with Cygwin and followed the
tutorial:


http://wiki.apache.org/nutch/NutchTutorial

I'm trying to crawl wikipedia.org as a start, and having a similar problem
with the segments/content path that does not exist. The path does indeed not
exist (nothing got fetched)

Where do I have to adjust the disk space of my temporary directory?

Just something else: There are two conf directories in Nutch's distribution.
Which one is used? I'm updating the configuration files in both of them.

Thank you!
Regards,
Robert

Crawl Trace:

$ runtime/local/bin/nutch crawl urls -solr http://127.0.0.1:8983/solr -dir
crawl -depth 8 -topN 50000 -threads 16
crawl started in: crawl
rootUrlDir = urls
threads = 16
depth = 8
solrUrl=http://127.0.0.1:8983/solr
topN = 50000
Injector: starting at 2011-07-12 14:34:16
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-07-12 14:34:19, elapsed: 00:00:03
Generator: starting at 2011-07-12 14:34:19
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 50000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20110712143422
Generator: finished at 2011-07-12 14:34:23, elapsed: 00:00:04
Fetcher: starting at 2011-07-12 14:34:23
Fetcher: segment: crawl/segments/20110712143422
Fetcher: threads: 16
QueueFeeder finished: total 1 records + hit by time limit :0
fetching http://www.wikipedia.org/
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-07-12 14:34:26, elapsed: 00:00:02
ParseSegment: starting at 2011-07-12 14:34:26
ParseSegment: segment: crawl/segments/20110712143422
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist:
file:/C:/tools/nutch-1.3/crawl/segments/20110712143422/content
        at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
        at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
        at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
        at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
        at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:137)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Crawl-fails-Input-path-does-not-exist-tp996823p3162299.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawl fails - Input path does not exist

Reply via email to