Hi Manoj,

There should be piles of replies to this on the user@ archives.

Please have a look at your nutch-site.xml properties. Specifically

parse_data

On Thu, Nov 24, 2011 at 10:24 AM, मनोज <Manoj> <[email protected]>wrote:

> Hi
> I am facing problem* *with* ApacheNutch1.3* . Output is as given below.
> Please help. Thanks in advance.
> *
> manoj@cmk-manoj-ossd:~/tools/software/nutch/runtime/local$ bin/nutch crawl
> urls -dir crawl -depth 3 -topN 5
> solrUrl is not set, indexing will be skipped...
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> solrUrl=null
> topN = 5
> Injector: starting at 2011-11-24 15:45:15
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2011-11-24 15:45:17, elapsed: 00:00:02
> Generator: starting at 2011-11-24 15:45:17
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 5
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/segments/20111124154519
> Generator: finished at 2011-11-24 15:45:21, elapsed: 00:00:03
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2011-11-24 15:45:21
> Fetcher: segment: crawl/segments/20111124154519
> Fetcher: threads: 10
> QueueFeeder finished: total 1 records + hit by time limit :0
> fetching http://nutch.apache.org/
> -finishing thread FetcherThread, activeThreads=9
> -finishing thread FetcherThread, activeThreads=8
> -finishing thread FetcherThread, activeThreads=7
> -finishing thread FetcherThread, activeThreads=6
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2011-11-24 15:45:43, elapsed: 00:00:22
> ParseSegment: starting at 2011-11-24 15:45:43
> ParseSegment: segment: crawl/segments/20111124154519
> ParseSegment: finished at 2011-11-24 15:45:44, elapsed: 00:00:01
> CrawlDb update: starting at 2011-11-24 15:45:44
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20111124154519]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2011-11-24 15:45:46, elapsed: 00:00:01
> Generator: starting at 2011-11-24 15:45:46
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 5
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/segments/20111124154548
> Generator: finished at 2011-11-24 15:45:49, elapsed: 00:00:03
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2011-11-24 15:45:49
> Fetcher: segment: crawl/segments/20111124154548
> Fetcher: threads: 10
> QueueFeeder finished: total 5 records + hit by time limit :0
> fetching http://nutch.apache.org/wiki.html
> fetching http://www.apache.org/
> fetching http://www.eu.apachecon.com/c/aceu2009/
> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=2
> * queue: http://nutch.apache.org
>  maxThreads    = 1
>  inProgress    = 0
>  crawlDelay    = 5000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129755709
>  now           = 1322129751077
>  0. http://nutch.apache.org/mailing_lists.html
> * queue: http://www.apache.org
>  maxThreads    = 1
>  inProgress    = 1
>  crawlDelay    = 5000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129750061
>  now           = 1322129751078
>  0. http://www.apache.org/dyn/closer.cgi/nutch/
> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=2
> * queue: http://nutch.apache.org
>  maxThreads    = 1
>  inProgress    = 0
>  crawlDelay    = 5000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129755709
>  now           = 1322129752078
>  0. http://nutch.apache.org/mailing_lists.html
> * queue: http://www.apache.org
>  maxThreads    = 1
>  inProgress    = 1
>  crawlDelay    = 5000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129750061
>  now           = 1322129752079
>  0. http://www.apache.org/dyn/closer.cgi/nutch/
> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=2
> * queue: http://nutch.apache.org
>  maxThreads    = 1
>  inProgress    = 0
>  crawlDelay    = 5000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129755709
>  now           = 1322129753080
>  0. http://nutch.apache.org/mailing_lists.html
> * queue: http://www.apache.org
>  maxThreads    = 1
>  inProgress    = 1
>  crawlDelay    = 5000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129750061
>  now           = 1322129753080
>  0. http://www.apache.org/dyn/closer.cgi/nutch/
> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=2
> * queue: http://nutch.apache.org
>  maxThreads    = 1
>  inProgress    = 0
>  crawlDelay    = 5000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129755709
>  now           = 1322129754081
>  0. http://nutch.apache.org/mailing_lists.html
> * queue: http://www.apache.org
>  maxThreads    = 1
>  inProgress    = 1
>  crawlDelay    = 5000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129750061
>  now           = 1322129754081
>  0. http://www.apache.org/dyn/closer.cgi/nutch/
> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=2
> * queue: http://nutch.apache.org
>  maxThreads    = 1
>  inProgress    = 0
>  crawlDelay    = 5000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129755709
>  now           = 1322129755083
>  0. http://nutch.apache.org/mailing_lists.html
> * queue: http://www.apache.org
>  maxThreads    = 1
>  inProgress    = 1
>  crawlDelay    = 5000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129750061
>  now           = 1322129755083
>  0. http://www.apache.org/dyn/closer.cgi/nutch/
> fetching http://nutch.apache.org/mailing_lists.html
> -activeThreads=10, spinWaiting=7, fetchQueues.totalSize=1
> * queue: http://www.apache.org
>  maxThreads    = 1
>  inProgress    = 1
>  crawlDelay    = 5000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129750061
>  now           = 1322129756083
>  0. http://www.apache.org/dyn/closer.cgi/nutch/
> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1
> * queue: http://www.apache.org
>  maxThreads    = 1
>  inProgress    = 1
>  crawlDelay    = 5000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129750061
>  now           = 1322129757084
>  0. http://www.apache.org/dyn/closer.cgi/nutch/
> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1
> * queue: http://www.apache.org
>  maxThreads    = 1
>  inProgress    = 1
>  crawlDelay    = 5000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129750061
>  now           = 1322129758085
>  0. http://www.apache.org/dyn/closer.cgi/nutch/
> fetch of http://www.eu.apachecon.com/c/aceu2009/ failed with:
> java.net.UnknownHostException: www.eu.apachecon.com
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=1
> * queue: http://www.apache.org
>  maxThreads    = 1
>  inProgress    = 1
>  crawlDelay    = 4000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129750061
>  now           = 1322129759085
>  0. http://www.apache.org/dyn/closer.cgi/nutch/
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> * queue: http://www.apache.org
>  maxThreads    = 1
>  inProgress    = 0
>  crawlDelay    = 4000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129764028
>  now           = 1322129760086
>  0. http://www.apache.org/dyn/closer.cgi/nutch/
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> * queue: http://www.apache.org
>  maxThreads    = 1
>  inProgress    = 0
>  crawlDelay    = 4000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129764028
>  now           = 1322129761086
>  0. http://www.apache.org/dyn/closer.cgi/nutch/
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> * queue: http://www.apache.org
>  maxThreads    = 1
>  inProgress    = 0
>  crawlDelay    = 4000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129764028
>  now           = 1322129762087
>  0. http://www.apache.org/dyn/closer.cgi/nutch/
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> * queue: http://www.apache.org
>  maxThreads    = 1
>  inProgress    = 0
>  crawlDelay    = 4000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129764028
>  now           = 1322129763088
>  0. http://www.apache.org/dyn/closer.cgi/nutch/
> fetching http://www.apache.org/dyn/closer.cgi/nutch/
> -finishing thread FetcherThread, activeThreads=8
> -finishing thread FetcherThread, activeThreads=8
> -finishing thread FetcherThread, activeThreads=7
> -finishing thread FetcherThread, activeThreads=6
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> -activeThreads=3, spinWaiting=2, fetchQueues.totalSize=0
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2011-11-24 15:46:06, elapsed: 00:00:17
> ParseSegment: starting at 2011-11-24 15:46:06
> ParseSegment: segment: crawl/segments/20111124154548
> ParseSegment: finished at 2011-11-24 15:46:08, elapsed: 00:00:01
> CrawlDb update: starting at 2011-11-24 15:46:08
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20111124154548]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2011-11-24 15:46:09, elapsed: 00:00:01
> Generator: starting at 2011-11-24 15:46:09
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 5
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/segments/20111124154611
> Generator: finished at 2011-11-24 15:46:12, elapsed: 00:00:03
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2011-11-24 15:46:12
> Fetcher: segment: crawl/segments/20111124154611
> Fetcher: threads: 10
> fetching http://hadoop.apache.org/
> fetching http://nutch.apache.org/index.html
> fetching http://www.apache.org/licenses/
> fetching http://forrest.apache.org/
> QueueFeeder finished: total 5 records + hit by time limit :0
> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1
> * queue: http://www.apache.org
>  maxThreads    = 1
>  inProgress    = 0
>  crawlDelay    = 4000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129777960
>  now           = 1322129774091
>  0. http://www.apache.org/foundation/sponsorship.html
> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1
> * queue: http://www.apache.org
>  maxThreads    = 1
>  inProgress    = 0
>  crawlDelay    = 4000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129777960
>  now           = 1322129775092
>  0. http://www.apache.org/foundation/sponsorship.html
> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1
> * queue: http://www.apache.org
>  maxThreads    = 1
>  inProgress    = 0
>  crawlDelay    = 4000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129777960
>  now           = 1322129776092
>  0. http://www.apache.org/foundation/sponsorship.html
> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1
> * queue: http://www.apache.org
>  maxThreads    = 1
>  inProgress    = 0
>  crawlDelay    = 4000
>  minCrawlDelay = 0
>  nextFetchTime = 1322129777960
>  now           = 1322129777093
>  0. http://www.apache.org/foundation/sponsorship.html
> fetching http://www.apache.org/foundation/sponsorship.html
> -finishing thread FetcherThread, activeThreads=9
> -activeThreads=9, spinWaiting=6, fetchQueues.totalSize=0
> -finishing thread FetcherThread, activeThreads=8
> -finishing thread FetcherThread, activeThreads=7
> -finishing thread FetcherThread, activeThreads=6
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=2
> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2011-11-24 15:46:23, elapsed: 00:00:11
> ParseSegment: starting at 2011-11-24 15:46:23
> ParseSegment: segment: crawl/segments/20111124154611
> ParseSegment: finished at 2011-11-24 15:46:25, elapsed: 00:00:01
> CrawlDb update: starting at 2011-11-24 15:46:25
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20111124154611]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2011-11-24 15:46:26, elapsed: 00:00:01
> LinkDb: starting at 2011-11-24 15:46:26
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment:
>
> file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124154548
> LinkDb: adding segment:
>
> file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124154611
> LinkDb: adding segment:
>
> file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124152057
> LinkDb: adding segment:
>
> file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124154415
> LinkDb: adding segment:
>
> file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124154519
> Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
> Input path does not exist:
>
> file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124152057/parse_data
> Input path does not exist:
>
> file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124154415/parse_data
>    at
>
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>    at
>
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>    at
>
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>    at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
>    at org.apache.nutch.crawl.Crawl.run(Crawl.java:142)
>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>    at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)
> manoj@cmk-manoj-ossd:~/tools/software/nutch/runtime/local$ vi urls/nutch
> manoj@cmk-manoj-ossd:~/tools/software/nutch/runtime/local$*
>
>
>
> --
> Thanks & Regards
>
> Manoj
>
>
> India
>
> Office :  022 27565303/4/5  Ext: 313
>
> Mobile : +919323582145
> http://twitter.com/aapkamanoj ,  http://aapkamanoj.blogspot.com/
>



-- 
*Lewis*

Reply via email to