Hi Mick, You should see the logs for more information. They are present in 'logs/hadoop.log'
On Sat, Apr 20, 2013 at 3:00 PM, micklai <[email protected]> wrote: > Hi, > > Env: > System ubuntu 12.04 > Tomcat: 7.0.39 > Solr: 3.6.2 > Nutch: 1.6 > > Try to deploy nutch 1.6 with solr 3.6.2 but failed with running the command > below: > bin/nutch crawl urls -solr http://localhost:8080/solr/ -dir crawl -depth 2 > -threads 5 -topN 5 > > For the details: > ============================================= > crawl started in: crawl > rootUrlDir = urls > threads = 5 > depth = 2 > solrUrl=http://localhost:8080/solr/ > topN = 5 > Injector: starting at 2013-04-21 02:21:12 > Injector: crawlDb: crawl/crawldb > Injector: urlDir: urls > Injector: Converting injected urls to crawl db entries. > Injector: total number of urls rejected by filters: 0 > Injector: total number of urls injected after normalization and filtering: > 1 > Injector: Merging injected urls into crawl db. > Injector: finished at 2013-04-21 02:21:27, elapsed: 00:00:14 > Generator: starting at 2013-04-21 02:21:27 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: topN: 5 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls for politeness. > Generator: segment: crawl/segments/20130421022135 > Generator: finished at 2013-04-21 02:21:42, elapsed: 00:00:15 > Fetcher: Your 'http.agent.name' value should be listed first in > 'http.robots.agents' property. > Fetcher: starting at 2013-04-21 02:21:42 > Fetcher: segment: crawl/segments/20130421022135 > Using queue mode : byHost > Fetcher: threads: 5 > Fetcher: time-out divisor: 2 > QueueFeeder finished: total 1 records + hit by time limit :0 > Using queue mode : byHost > Using queue mode : byHost > Using queue mode : byHost > Using queue mode : byHost > Using queue mode : byHost > Fetcher: throughput threshold: -1 > Fetcher: throughput threshold retries: 5 > fetching http://www.163.com/ > -finishing thread FetcherThread, activeThreads=4 > -finishing thread FetcherThread, activeThreads=3 > -finishing thread FetcherThread, activeThreads=2 > -finishing thread FetcherThread, activeThreads=1 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -finishing thread FetcherThread, activeThreads=0 > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=0 > Fetcher: finished at 2013-04-21 02:21:49, elapsed: 00:00:07 > ParseSegment: starting at 2013-04-21 02:21:49 > ParseSegment: segment: crawl/segments/20130421022135 > Parsed (24ms):http://www.163.com/ > ParseSegment: finished at 2013-04-21 02:21:56, elapsed: 00:00:07 > CrawlDb update: starting at 2013-04-21 02:21:56 > CrawlDb update: db: crawl/crawldb > CrawlDb update: segments: [crawl/segments/20130421022135] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: 404 purging: false > CrawlDb update: Merging segment data into db. > CrawlDb update: finished at 2013-04-21 02:22:09, elapsed: 00:00:13 > Generator: starting at 2013-04-21 02:22:09 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: topN: 5 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls for politeness. > Generator: segment: crawl/segments/20130421022217 > Generator: finished at 2013-04-21 02:22:25, elapsed: 00:00:15 > Fetcher: Your 'http.agent.name' value should be listed first in > 'http.robots.agents' property. > Fetcher: starting at 2013-04-21 02:22:25 > Fetcher: segment: crawl/segments/20130421022217 > Using queue mode : byHost > Fetcher: threads: 5 > Fetcher: time-out divisor: 2 > QueueFeeder finished: total 5 records + hit by time limit :0 > Using queue mode : byHost > Using queue mode : byHost > fetching http://m.163.com/ > Using queue mode : byHost > Using queue mode : byHost > fetching http://3g.163.com/links/4145 > fetching http://music.163.com/ > Using queue mode : byHost > Fetcher: throughput threshold: -1 > Fetcher: throughput threshold retries: 5 > fetching http://caipiao.163.com/mobile/client_cp.jsp > -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1 > * queue: http://m.163.com > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1366482150482 > now = 1366482146346 > 0. http://m.163.com/newsapp/ > -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1 > * queue: http://m.163.com > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1366482150482 > now = 1366482147348 > 0. http://m.163.com/newsapp/ > -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1 > * queue: http://m.163.com > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1366482150482 > now = 1366482148350 > 0. http://m.163.com/newsapp/ > -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1 > * queue: http://m.163.com > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1366482150482 > now = 1366482149352 > 0. http://m.163.com/newsapp/ > -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1 > * queue: http://m.163.com > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1366482150482 > now = 1366482150354 > 0. http://m.163.com/newsapp/ > fetching http://m.163.com/newsapp/ > -finishing thread FetcherThread, activeThreads=4 > -finishing thread FetcherThread, activeThreads=3 > -finishing thread FetcherThread, activeThreads=2 > -finishing thread FetcherThread, activeThreads=1 > -finishing thread FetcherThread, activeThreads=0 > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=0 > Fetcher: finished at 2013-04-21 02:22:38, elapsed: 00:00:13 > ParseSegment: starting at 2013-04-21 02:22:38 > ParseSegment: segment: crawl/segments/20130421022217 > Parsed (4ms):http://caipiao.163.com/mobile/client_cp.jsp > Parsed (3ms):http://m.163.com/newsapp/ > Parsed (1ms):http://music.163.com/ > ParseSegment: finished at 2013-04-21 02:22:45, elapsed: 00:00:07 > CrawlDb update: starting at 2013-04-21 02:22:45 > CrawlDb update: db: crawl/crawldb > CrawlDb update: segments: [crawl/segments/20130421022217] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > *CrawlDb update: 404 purging: false* > CrawlDb update: Merging segment data into db. > CrawlDb update: finished at 2013-04-21 02:22:58, elapsed: 00:00:13 > LinkDb: starting at 2013-04-21 02:22:58 > LinkDb: linkdb: crawl/linkdb > LinkDb: URL normalize: true > LinkDb: URL filter: true > LinkDb: internal links will be ignored. > LinkDb: adding segment: > file:/home/lailx/search_engine/nutch/crawl/segments/20130421022135 > LinkDb: adding segment: > file:/home/lailx/search_engine/nutch/crawl/segments/20130421022217 > LinkDb: finished at 2013-04-21 02:23:08, elapsed: 00:00:10 > SolrIndexer: starting at 2013-04-21 02:23:08 > *SolrIndexer: deleting gone documents: false > SolrIndexer: URL filtering: false > SolrIndexer: URL normalizing: false* > Indexing 4 documents > *java.io.IOException: Job failed!* > SolrDeleteDuplicates: starting at 2013-04-21 02:23:39 > SolrDeleteDuplicates: Solr url: http://localhost:8080/solr/ > *Exception in thread "main" java.io.IOException: Job failed!* > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265) > at > > org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) > at > > org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) > at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) > ============================================= > But nutch works well, it works with the command below: > bin/nutch crawl urls -dir crawl -depth 2 -topN 5 > > And I also had modified the [Solr_Home]/conf/schema.xml to make nutch > integrated with solr, solr also works well by accessing > "localhost:8080/solr". > > Hope you can help in this problem, wait for your reply. > Thanks. > > Br, > Mick > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Exception-in-thread-main-java-io-IOException-Job-failed-tp4057620.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>

