Hi, I'm a beginner using Nutch 1.3 on Windows 7 with Cygwin and followed the tutorial:
http://wiki.apache.org/nutch/NutchTutorial I'm trying to crawl wikipedia.org as a start, and having a similar problem with the segments/content path that does not exist. The path does indeed not exist (nothing got fetched) Where do I have to adjust the disk space of my temporary directory? Just something else: There are two conf directories in Nutch's distribution. Which one is used? I'm updating the configuration files in both of them. Thank you! Regards, Robert Crawl Trace: $ runtime/local/bin/nutch crawl urls -solr http://127.0.0.1:8983/solr -dir crawl -depth 8 -topN 50000 -threads 16 crawl started in: crawl rootUrlDir = urls threads = 16 depth = 8 solrUrl=http://127.0.0.1:8983/solr topN = 50000 Injector: starting at 2011-07-12 14:34:16 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2011-07-12 14:34:19, elapsed: 00:00:03 Generator: starting at 2011-07-12 14:34:19 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 50000 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20110712143422 Generator: finished at 2011-07-12 14:34:23, elapsed: 00:00:04 Fetcher: starting at 2011-07-12 14:34:23 Fetcher: segment: crawl/segments/20110712143422 Fetcher: threads: 16 QueueFeeder finished: total 1 records + hit by time limit :0 fetching http://www.wikipedia.org/ -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2011-07-12 14:34:26, elapsed: 00:00:02 ParseSegment: starting at 2011-07-12 14:34:26 ParseSegment: segment: crawl/segments/20110712143422 Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/tools/nutch-1.3/crawl/segments/20110712143422/content at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156) at org.apache.nutch.crawl.Crawl.run(Crawl.java:137) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:54) -- View this message in context: http://lucene.472066.n3.nabble.com/Crawl-fails-Input-path-does-not-exist-tp996823p3162299.html Sent from the Nutch - User mailing list archive at Nabble.com.

