Hi Manoj, There should be piles of replies to this on the user@ archives.
Please have a look at your nutch-site.xml properties. Specifically parse_data On Thu, Nov 24, 2011 at 10:24 AM, मनोज <Manoj> <[email protected]>wrote: > Hi > I am facing problem* *with* ApacheNutch1.3* . Output is as given below. > Please help. Thanks in advance. > * > manoj@cmk-manoj-ossd:~/tools/software/nutch/runtime/local$ bin/nutch crawl > urls -dir crawl -depth 3 -topN 5 > solrUrl is not set, indexing will be skipped... > crawl started in: crawl > rootUrlDir = urls > threads = 10 > depth = 3 > solrUrl=null > topN = 5 > Injector: starting at 2011-11-24 15:45:15 > Injector: crawlDb: crawl/crawldb > Injector: urlDir: urls > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: finished at 2011-11-24 15:45:17, elapsed: 00:00:02 > Generator: starting at 2011-11-24 15:45:17 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: topN: 5 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls for politeness. > Generator: segment: crawl/segments/20111124154519 > Generator: finished at 2011-11-24 15:45:21, elapsed: 00:00:03 > Fetcher: Your 'http.agent.name' value should be listed first in > 'http.robots.agents' property. > Fetcher: starting at 2011-11-24 15:45:21 > Fetcher: segment: crawl/segments/20111124154519 > Fetcher: threads: 10 > QueueFeeder finished: total 1 records + hit by time limit :0 > fetching http://nutch.apache.org/ > -finishing thread FetcherThread, activeThreads=9 > -finishing thread FetcherThread, activeThreads=8 > -finishing thread FetcherThread, activeThreads=7 > -finishing thread FetcherThread, activeThreads=6 > -finishing thread FetcherThread, activeThreads=5 > -finishing thread FetcherThread, activeThreads=4 > -finishing thread FetcherThread, activeThreads=3 > -finishing thread FetcherThread, activeThreads=2 > -finishing thread FetcherThread, activeThreads=1 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -finishing thread FetcherThread, activeThreads=0 > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=0 > Fetcher: finished at 2011-11-24 15:45:43, elapsed: 00:00:22 > ParseSegment: starting at 2011-11-24 15:45:43 > ParseSegment: segment: crawl/segments/20111124154519 > ParseSegment: finished at 2011-11-24 15:45:44, elapsed: 00:00:01 > CrawlDb update: starting at 2011-11-24 15:45:44 > CrawlDb update: db: crawl/crawldb > CrawlDb update: segments: [crawl/segments/20111124154519] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: finished at 2011-11-24 15:45:46, elapsed: 00:00:01 > Generator: starting at 2011-11-24 15:45:46 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: topN: 5 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls for politeness. > Generator: segment: crawl/segments/20111124154548 > Generator: finished at 2011-11-24 15:45:49, elapsed: 00:00:03 > Fetcher: Your 'http.agent.name' value should be listed first in > 'http.robots.agents' property. > Fetcher: starting at 2011-11-24 15:45:49 > Fetcher: segment: crawl/segments/20111124154548 > Fetcher: threads: 10 > QueueFeeder finished: total 5 records + hit by time limit :0 > fetching http://nutch.apache.org/wiki.html > fetching http://www.apache.org/ > fetching http://www.eu.apachecon.com/c/aceu2009/ > -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=2 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1322129755709 > now = 1322129751077 > 0. http://nutch.apache.org/mailing_lists.html > * queue: http://www.apache.org > maxThreads = 1 > inProgress = 1 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1322129750061 > now = 1322129751078 > 0. http://www.apache.org/dyn/closer.cgi/nutch/ > -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=2 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1322129755709 > now = 1322129752078 > 0. http://nutch.apache.org/mailing_lists.html > * queue: http://www.apache.org > maxThreads = 1 > inProgress = 1 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1322129750061 > now = 1322129752079 > 0. http://www.apache.org/dyn/closer.cgi/nutch/ > -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=2 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1322129755709 > now = 1322129753080 > 0. http://nutch.apache.org/mailing_lists.html > * queue: http://www.apache.org > maxThreads = 1 > inProgress = 1 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1322129750061 > now = 1322129753080 > 0. http://www.apache.org/dyn/closer.cgi/nutch/ > -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=2 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1322129755709 > now = 1322129754081 > 0. http://nutch.apache.org/mailing_lists.html > * queue: http://www.apache.org > maxThreads = 1 > inProgress = 1 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1322129750061 > now = 1322129754081 > 0. http://www.apache.org/dyn/closer.cgi/nutch/ > -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=2 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1322129755709 > now = 1322129755083 > 0. http://nutch.apache.org/mailing_lists.html > * queue: http://www.apache.org > maxThreads = 1 > inProgress = 1 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1322129750061 > now = 1322129755083 > 0. http://www.apache.org/dyn/closer.cgi/nutch/ > fetching http://nutch.apache.org/mailing_lists.html > -activeThreads=10, spinWaiting=7, fetchQueues.totalSize=1 > * queue: http://www.apache.org > maxThreads = 1 > inProgress = 1 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1322129750061 > now = 1322129756083 > 0. http://www.apache.org/dyn/closer.cgi/nutch/ > -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1 > * queue: http://www.apache.org > maxThreads = 1 > inProgress = 1 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1322129750061 > now = 1322129757084 > 0. http://www.apache.org/dyn/closer.cgi/nutch/ > -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1 > * queue: http://www.apache.org > maxThreads = 1 > inProgress = 1 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1322129750061 > now = 1322129758085 > 0. http://www.apache.org/dyn/closer.cgi/nutch/ > fetch of http://www.eu.apachecon.com/c/aceu2009/ failed with: > java.net.UnknownHostException: www.eu.apachecon.com > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=1 > * queue: http://www.apache.org > maxThreads = 1 > inProgress = 1 > crawlDelay = 4000 > minCrawlDelay = 0 > nextFetchTime = 1322129750061 > now = 1322129759085 > 0. http://www.apache.org/dyn/closer.cgi/nutch/ > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 > * queue: http://www.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 4000 > minCrawlDelay = 0 > nextFetchTime = 1322129764028 > now = 1322129760086 > 0. http://www.apache.org/dyn/closer.cgi/nutch/ > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 > * queue: http://www.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 4000 > minCrawlDelay = 0 > nextFetchTime = 1322129764028 > now = 1322129761086 > 0. http://www.apache.org/dyn/closer.cgi/nutch/ > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 > * queue: http://www.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 4000 > minCrawlDelay = 0 > nextFetchTime = 1322129764028 > now = 1322129762087 > 0. http://www.apache.org/dyn/closer.cgi/nutch/ > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 > * queue: http://www.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 4000 > minCrawlDelay = 0 > nextFetchTime = 1322129764028 > now = 1322129763088 > 0. http://www.apache.org/dyn/closer.cgi/nutch/ > fetching http://www.apache.org/dyn/closer.cgi/nutch/ > -finishing thread FetcherThread, activeThreads=8 > -finishing thread FetcherThread, activeThreads=8 > -finishing thread FetcherThread, activeThreads=7 > -finishing thread FetcherThread, activeThreads=6 > -finishing thread FetcherThread, activeThreads=5 > -finishing thread FetcherThread, activeThreads=4 > -finishing thread FetcherThread, activeThreads=3 > -activeThreads=3, spinWaiting=2, fetchQueues.totalSize=0 > -finishing thread FetcherThread, activeThreads=2 > -finishing thread FetcherThread, activeThreads=1 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -finishing thread FetcherThread, activeThreads=0 > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=0 > Fetcher: finished at 2011-11-24 15:46:06, elapsed: 00:00:17 > ParseSegment: starting at 2011-11-24 15:46:06 > ParseSegment: segment: crawl/segments/20111124154548 > ParseSegment: finished at 2011-11-24 15:46:08, elapsed: 00:00:01 > CrawlDb update: starting at 2011-11-24 15:46:08 > CrawlDb update: db: crawl/crawldb > CrawlDb update: segments: [crawl/segments/20111124154548] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: finished at 2011-11-24 15:46:09, elapsed: 00:00:01 > Generator: starting at 2011-11-24 15:46:09 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: topN: 5 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls for politeness. > Generator: segment: crawl/segments/20111124154611 > Generator: finished at 2011-11-24 15:46:12, elapsed: 00:00:03 > Fetcher: Your 'http.agent.name' value should be listed first in > 'http.robots.agents' property. > Fetcher: starting at 2011-11-24 15:46:12 > Fetcher: segment: crawl/segments/20111124154611 > Fetcher: threads: 10 > fetching http://hadoop.apache.org/ > fetching http://nutch.apache.org/index.html > fetching http://www.apache.org/licenses/ > fetching http://forrest.apache.org/ > QueueFeeder finished: total 5 records + hit by time limit :0 > -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1 > * queue: http://www.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 4000 > minCrawlDelay = 0 > nextFetchTime = 1322129777960 > now = 1322129774091 > 0. http://www.apache.org/foundation/sponsorship.html > -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1 > * queue: http://www.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 4000 > minCrawlDelay = 0 > nextFetchTime = 1322129777960 > now = 1322129775092 > 0. http://www.apache.org/foundation/sponsorship.html > -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1 > * queue: http://www.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 4000 > minCrawlDelay = 0 > nextFetchTime = 1322129777960 > now = 1322129776092 > 0. http://www.apache.org/foundation/sponsorship.html > -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1 > * queue: http://www.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 4000 > minCrawlDelay = 0 > nextFetchTime = 1322129777960 > now = 1322129777093 > 0. http://www.apache.org/foundation/sponsorship.html > fetching http://www.apache.org/foundation/sponsorship.html > -finishing thread FetcherThread, activeThreads=9 > -activeThreads=9, spinWaiting=6, fetchQueues.totalSize=0 > -finishing thread FetcherThread, activeThreads=8 > -finishing thread FetcherThread, activeThreads=7 > -finishing thread FetcherThread, activeThreads=6 > -finishing thread FetcherThread, activeThreads=5 > -finishing thread FetcherThread, activeThreads=4 > -finishing thread FetcherThread, activeThreads=3 > -finishing thread FetcherThread, activeThreads=2 > -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0 > -finishing thread FetcherThread, activeThreads=1 > -finishing thread FetcherThread, activeThreads=0 > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=0 > Fetcher: finished at 2011-11-24 15:46:23, elapsed: 00:00:11 > ParseSegment: starting at 2011-11-24 15:46:23 > ParseSegment: segment: crawl/segments/20111124154611 > ParseSegment: finished at 2011-11-24 15:46:25, elapsed: 00:00:01 > CrawlDb update: starting at 2011-11-24 15:46:25 > CrawlDb update: db: crawl/crawldb > CrawlDb update: segments: [crawl/segments/20111124154611] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: finished at 2011-11-24 15:46:26, elapsed: 00:00:01 > LinkDb: starting at 2011-11-24 15:46:26 > LinkDb: linkdb: crawl/linkdb > LinkDb: URL normalize: true > LinkDb: URL filter: true > LinkDb: adding segment: > > file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124154548 > LinkDb: adding segment: > > file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124154611 > LinkDb: adding segment: > > file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124152057 > LinkDb: adding segment: > > file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124154415 > LinkDb: adding segment: > > file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124154519 > Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: > Input path does not exist: > > file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124152057/parse_data > Input path does not exist: > > file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124154415/parse_data > at > > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) > at > > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) > at > > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) > at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149) > at org.apache.nutch.crawl.Crawl.run(Crawl.java:142) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:54) > manoj@cmk-manoj-ossd:~/tools/software/nutch/runtime/local$ vi urls/nutch > manoj@cmk-manoj-ossd:~/tools/software/nutch/runtime/local$* > > > > -- > Thanks & Regards > > Manoj > > > India > > Office : 022 27565303/4/5 Ext: 313 > > Mobile : +919323582145 > http://twitter.com/aapkamanoj , http://aapkamanoj.blogspot.com/ > -- *Lewis*

