Hi. I am having the same problem (newbie to nutch too) Using nutch 1.4 on Windows 7 with Cygwin If I understand correctly, the crawling process should create segments and each one of those segments corresponds to a folder under NUTCH_HOME/runtime/local/crawl/segment_number. Then under each segment_number folder a parse_data folder should be created that apparently is not created. My linkdb folder is empty (NUTCH_HOME/runtime/local/linkdb)
Output follows $ bin/nutch crawl urls -dir crawl -depth 3 -topN 5 cygpath: can't convert empty path solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 solrUrl=null topN = 5 Injector: starting at 2012-03-12 13:38:06 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-03-12 13:38:09, elapsed: 00:00:02 Generator: starting at 2012-03-12 13:38:09 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20120312133811 Generator: finished at 2012-03-12 13:38:12, elapsed: 00:00:03 Fetcher: starting at 2012-03-12 13:38:12 Fetcher: segment: crawl/segments/20120312133811 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost fetching http://nutch.apache.org/ Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Fetcher: throughput threshold: -1 -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold retries: 5 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-03-12 13:38:17, elapsed: 00:00:04 ParseSegment: starting at 2012-03-12 13:38:17 ParseSegment: segment: crawl/segments/20120312133811 Parsing: http://nutch.apache.org/ ParseSegment: finished at 2012-03-12 13:38:18, elapsed: 00:00:01 CrawlDb update: starting at 2012-03-12 13:38:18 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20120312133811] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-03-12 13:38:19, elapsed: 00:00:01 Generator: starting at 2012-03-12 13:38:19 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20120312133822 Generator: finished at 2012-03-12 13:38:23, elapsed: 00:00:03 Fetcher: starting at 2012-03-12 13:38:23 Fetcher: segment: crawl/segments/20120312133822 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 5 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost fetching http://nutch.apache.org/wiki.html Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost fetching http://www.apache.org/ fetching http://www.eu.apachecon.com/c/aceu2009/ Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 fetch of http://www.eu.apachecon.com/c/aceu2009/ failed with: java.net.UnknownHostException: www.eu.apachecon.com -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=2 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1331552309462 now = 1331552304945 0. http://nutch.apache.org/mailing_lists.html * queue: http://www.apache.org maxThreads = 1 inProgress = 1 crawlDelay = 4000 minCrawlDelay = 0 nextFetchTime = 1331552303927 now = 1331552304949 0. http://www.apache.org/dyn/closer.cgi/nutch/ -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1331552309462 now = 1331552305950 0. http://nutch.apache.org/mailing_lists.html * queue: http://www.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 4000 minCrawlDelay = 0 nextFetchTime = 1331552309251 now = 1331552305953 0. http://www.apache.org/dyn/closer.cgi/nutch/ -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1331552309462 now = 1331552306955 0. http://nutch.apache.org/mailing_lists.html * queue: http://www.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 4000 minCrawlDelay = 0 nextFetchTime = 1331552309251 now = 1331552306957 0. http://www.apache.org/dyn/closer.cgi/nutch/ -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1331552309462 now = 1331552307958 0. http://nutch.apache.org/mailing_lists.html * queue: http://www.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 4000 minCrawlDelay = 0 nextFetchTime = 1331552309251 now = 1331552307959 0. http://www.apache.org/dyn/closer.cgi/nutch/ -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1331552309462 now = 1331552308961 0. http://nutch.apache.org/mailing_lists.html * queue: http://www.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 4000 minCrawlDelay = 0 nextFetchTime = 1331552309251 now = 1331552308963 0. http://www.apache.org/dyn/closer.cgi/nutch/ fetching http://www.apache.org/dyn/closer.cgi/nutch/ fetching http://nutch.apache.org/mailing_lists.html -finishing thread FetcherThread, activeThreads=9 -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=7 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-03-12 13:38:31, elapsed: 00:00:08 ParseSegment: starting at 2012-03-12 13:38:31 ParseSegment: segment: crawl/segments/20120312133822 Parsing: http://nutch.apache.org/mailing_lists.html Parsing: http://nutch.apache.org/wiki.html Parsing: http://www.apache.org/ Parsing: http://www.apache.org/dyn/closer.cgi/nutch/ ParseSegment: finished at 2012-03-12 13:38:33, elapsed: 00:00:01 CrawlDb update: starting at 2012-03-12 13:38:33 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20120312133822] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-03-12 13:38:34, elapsed: 00:00:01 Generator: starting at 2012-03-12 13:38:34 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20120312133836 Generator: finished at 2012-03-12 13:38:38, elapsed: 00:00:03 Fetcher: starting at 2012-03-12 13:38:38 Fetcher: segment: crawl/segments/20120312133836 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 5 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost fetching http://hadoop.apache.org/ Using queue mode : byHost fetching http://nutch.apache.org/index.html Using queue mode : byHost fetching http://www.apache.org/licenses/ Using queue mode : byHost Using queue mode : byHost fetching http://tika.apache.org/ Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1 * queue: http://www.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 4000 minCrawlDelay = 0 nextFetchTime = 1331552323207 now = 1331552319434 0. http://www.apache.org/foundation/sponsorship.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 * queue: http://www.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 4000 minCrawlDelay = 0 nextFetchTime = 1331552323207 now = 1331552320435 0. http://www.apache.org/foundation/sponsorship.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 * queue: http://www.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 4000 minCrawlDelay = 0 nextFetchTime = 1331552323207 now = 1331552321436 0. http://www.apache.org/foundation/sponsorship.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 * queue: http://www.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 4000 minCrawlDelay = 0 nextFetchTime = 1331552323207 now = 1331552322438 0. http://www.apache.org/foundation/sponsorship.html fetching http://www.apache.org/foundation/sponsorship.html -finishing thread FetcherThread, activeThreads=9 -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=7 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -activeThreads=2, spinWaiting=1, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-03-12 13:38:45, elapsed: 00:00:07 ParseSegment: starting at 2012-03-12 13:38:45 ParseSegment: segment: crawl/segments/20120312133836 Parsing: http://hadoop.apache.org/ Parsing: http://nutch.apache.org/index.html Parsing: http://tika.apache.org/ Parsing: http://www.apache.org/foundation/sponsorship.html Parsing: http://www.apache.org/licenses/ ParseSegment: finished at 2012-03-12 13:38:46, elapsed: 00:00:01 CrawlDb update: starting at 2012-03-12 13:38:46 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20120312133836] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-03-12 13:38:48, elapsed: 00:00:01 LinkDb: starting at 2012-03-12 13:38:48 LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312131223 LinkDb: adding segment: file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312132729 LinkDb: adding segment: file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312132952 LinkDb: adding segment: file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133110 LinkDb: adding segment: file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133255 LinkDb: adding segment: file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133409 LinkDb: adding segment: file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133811 LinkDb: adding segment: file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133822 LinkDb: adding segment: file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133836 Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312131223/parse_data Input path does not exist: file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312132729/parse_data Input path does not exist: file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312132952/parse_data Input path does not exist: file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133110/parse_data Input path does not exist: file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133255/parse_data Input path does not exist: file:/C:/cygwin/home/Pantelis/nutch1_4/runtime/local/crawl/segments/20120312133409/parse_data at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149) at org.apache.nutch.crawl.Crawl.run(Crawl.java:143) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) -- View this message in context: http://lucene.472066.n3.nabble.com/Exception-in-thread-main-java-io-IOException-Job-failed-tp3766765p3819113.html Sent from the Nutch - User mailing list archive at Nabble.com.

