Hi I am facing problem* *with* ApacheNutch1.3* . Output is as given below. Please help. Thanks in advance. * manoj@cmk-manoj-ossd:~/tools/software/nutch/runtime/local$ bin/nutch crawl urls -dir crawl -depth 3 -topN 5 solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 solrUrl=null topN = 5 Injector: starting at 2011-11-24 15:45:15 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2011-11-24 15:45:17, elapsed: 00:00:02 Generator: starting at 2011-11-24 15:45:17 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20111124154519 Generator: finished at 2011-11-24 15:45:21, elapsed: 00:00:03 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2011-11-24 15:45:21 Fetcher: segment: crawl/segments/20111124154519 Fetcher: threads: 10 QueueFeeder finished: total 1 records + hit by time limit :0 fetching http://nutch.apache.org/ -finishing thread FetcherThread, activeThreads=9 -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=7 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2011-11-24 15:45:43, elapsed: 00:00:22 ParseSegment: starting at 2011-11-24 15:45:43 ParseSegment: segment: crawl/segments/20111124154519 ParseSegment: finished at 2011-11-24 15:45:44, elapsed: 00:00:01 CrawlDb update: starting at 2011-11-24 15:45:44 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20111124154519] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2011-11-24 15:45:46, elapsed: 00:00:01 Generator: starting at 2011-11-24 15:45:46 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20111124154548 Generator: finished at 2011-11-24 15:45:49, elapsed: 00:00:03 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2011-11-24 15:45:49 Fetcher: segment: crawl/segments/20111124154548 Fetcher: threads: 10 QueueFeeder finished: total 5 records + hit by time limit :0 fetching http://nutch.apache.org/wiki.html fetching http://www.apache.org/ fetching http://www.eu.apachecon.com/c/aceu2009/ -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=2 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1322129755709 now = 1322129751077 0. http://nutch.apache.org/mailing_lists.html * queue: http://www.apache.org maxThreads = 1 inProgress = 1 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1322129750061 now = 1322129751078 0. http://www.apache.org/dyn/closer.cgi/nutch/ -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=2 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1322129755709 now = 1322129752078 0. http://nutch.apache.org/mailing_lists.html * queue: http://www.apache.org maxThreads = 1 inProgress = 1 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1322129750061 now = 1322129752079 0. http://www.apache.org/dyn/closer.cgi/nutch/ -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=2 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1322129755709 now = 1322129753080 0. http://nutch.apache.org/mailing_lists.html * queue: http://www.apache.org maxThreads = 1 inProgress = 1 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1322129750061 now = 1322129753080 0. http://www.apache.org/dyn/closer.cgi/nutch/ -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=2 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1322129755709 now = 1322129754081 0. http://nutch.apache.org/mailing_lists.html * queue: http://www.apache.org maxThreads = 1 inProgress = 1 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1322129750061 now = 1322129754081 0. http://www.apache.org/dyn/closer.cgi/nutch/ -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=2 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1322129755709 now = 1322129755083 0. http://nutch.apache.org/mailing_lists.html * queue: http://www.apache.org maxThreads = 1 inProgress = 1 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1322129750061 now = 1322129755083 0. http://www.apache.org/dyn/closer.cgi/nutch/ fetching http://nutch.apache.org/mailing_lists.html -activeThreads=10, spinWaiting=7, fetchQueues.totalSize=1 * queue: http://www.apache.org maxThreads = 1 inProgress = 1 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1322129750061 now = 1322129756083 0. http://www.apache.org/dyn/closer.cgi/nutch/ -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1 * queue: http://www.apache.org maxThreads = 1 inProgress = 1 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1322129750061 now = 1322129757084 0. http://www.apache.org/dyn/closer.cgi/nutch/ -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1 * queue: http://www.apache.org maxThreads = 1 inProgress = 1 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1322129750061 now = 1322129758085 0. http://www.apache.org/dyn/closer.cgi/nutch/ fetch of http://www.eu.apachecon.com/c/aceu2009/ failed with: java.net.UnknownHostException: www.eu.apachecon.com -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=1 * queue: http://www.apache.org maxThreads = 1 inProgress = 1 crawlDelay = 4000 minCrawlDelay = 0 nextFetchTime = 1322129750061 now = 1322129759085 0. http://www.apache.org/dyn/closer.cgi/nutch/ -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 * queue: http://www.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 4000 minCrawlDelay = 0 nextFetchTime = 1322129764028 now = 1322129760086 0. http://www.apache.org/dyn/closer.cgi/nutch/ -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 * queue: http://www.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 4000 minCrawlDelay = 0 nextFetchTime = 1322129764028 now = 1322129761086 0. http://www.apache.org/dyn/closer.cgi/nutch/ -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 * queue: http://www.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 4000 minCrawlDelay = 0 nextFetchTime = 1322129764028 now = 1322129762087 0. http://www.apache.org/dyn/closer.cgi/nutch/ -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 * queue: http://www.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 4000 minCrawlDelay = 0 nextFetchTime = 1322129764028 now = 1322129763088 0. http://www.apache.org/dyn/closer.cgi/nutch/ fetching http://www.apache.org/dyn/closer.cgi/nutch/ -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=7 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 -activeThreads=3, spinWaiting=2, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2011-11-24 15:46:06, elapsed: 00:00:17 ParseSegment: starting at 2011-11-24 15:46:06 ParseSegment: segment: crawl/segments/20111124154548 ParseSegment: finished at 2011-11-24 15:46:08, elapsed: 00:00:01 CrawlDb update: starting at 2011-11-24 15:46:08 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20111124154548] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2011-11-24 15:46:09, elapsed: 00:00:01 Generator: starting at 2011-11-24 15:46:09 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20111124154611 Generator: finished at 2011-11-24 15:46:12, elapsed: 00:00:03 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2011-11-24 15:46:12 Fetcher: segment: crawl/segments/20111124154611 Fetcher: threads: 10 fetching http://hadoop.apache.org/ fetching http://nutch.apache.org/index.html fetching http://www.apache.org/licenses/ fetching http://forrest.apache.org/ QueueFeeder finished: total 5 records + hit by time limit :0 -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1 * queue: http://www.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 4000 minCrawlDelay = 0 nextFetchTime = 1322129777960 now = 1322129774091 0. http://www.apache.org/foundation/sponsorship.html -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1 * queue: http://www.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 4000 minCrawlDelay = 0 nextFetchTime = 1322129777960 now = 1322129775092 0. http://www.apache.org/foundation/sponsorship.html -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1 * queue: http://www.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 4000 minCrawlDelay = 0 nextFetchTime = 1322129777960 now = 1322129776092 0. http://www.apache.org/foundation/sponsorship.html -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1 * queue: http://www.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 4000 minCrawlDelay = 0 nextFetchTime = 1322129777960 now = 1322129777093 0. http://www.apache.org/foundation/sponsorship.html fetching http://www.apache.org/foundation/sponsorship.html -finishing thread FetcherThread, activeThreads=9 -activeThreads=9, spinWaiting=6, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=7 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2011-11-24 15:46:23, elapsed: 00:00:11 ParseSegment: starting at 2011-11-24 15:46:23 ParseSegment: segment: crawl/segments/20111124154611 ParseSegment: finished at 2011-11-24 15:46:25, elapsed: 00:00:01 CrawlDb update: starting at 2011-11-24 15:46:25 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20111124154611] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2011-11-24 15:46:26, elapsed: 00:00:01 LinkDb: starting at 2011-11-24 15:46:26 LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124154548 LinkDb: adding segment: file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124154611 LinkDb: adding segment: file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124152057 LinkDb: adding segment: file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124154415 LinkDb: adding segment: file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124154519 Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124152057/parse_data Input path does not exist: file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124154415/parse_data at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149) at org.apache.nutch.crawl.Crawl.run(Crawl.java:142) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:54) manoj@cmk-manoj-ossd:~/tools/software/nutch/runtime/local$ vi urls/nutch manoj@cmk-manoj-ossd:~/tools/software/nutch/runtime/local$*
-- Thanks & Regards Manoj India Office : 022 27565303/4/5 Ext: 313 Mobile : +919323582145 http://twitter.com/aapkamanoj , http://aapkamanoj.blogspot.com/

