Hi, Env: System ubuntu 12.04 Tomcat: 7.0.39 Solr: 3.6.2 Nutch: 1.6
Try to deploy nutch 1.6 with solr 3.6.2 but failed with running the command below: bin/nutch crawl urls -solr http://localhost:8080/solr/ -dir crawl -depth 2 -threads 5 -topN 5 For the details: ============================================= crawl started in: crawl rootUrlDir = urls threads = 5 depth = 2 solrUrl=http://localhost:8080/solr/ topN = 5 Injector: starting at 2013-04-21 02:21:12 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 0 Injector: total number of urls injected after normalization and filtering: 1 Injector: Merging injected urls into crawl db. Injector: finished at 2013-04-21 02:21:27, elapsed: 00:00:14 Generator: starting at 2013-04-21 02:21:27 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20130421022135 Generator: finished at 2013-04-21 02:21:42, elapsed: 00:00:15 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2013-04-21 02:21:42 Fetcher: segment: crawl/segments/20130421022135 Using queue mode : byHost Fetcher: threads: 5 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 fetching http://www.163.com/ -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2013-04-21 02:21:49, elapsed: 00:00:07 ParseSegment: starting at 2013-04-21 02:21:49 ParseSegment: segment: crawl/segments/20130421022135 Parsed (24ms):http://www.163.com/ ParseSegment: finished at 2013-04-21 02:21:56, elapsed: 00:00:07 CrawlDb update: starting at 2013-04-21 02:21:56 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20130421022135] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2013-04-21 02:22:09, elapsed: 00:00:13 Generator: starting at 2013-04-21 02:22:09 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20130421022217 Generator: finished at 2013-04-21 02:22:25, elapsed: 00:00:15 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2013-04-21 02:22:25 Fetcher: segment: crawl/segments/20130421022217 Using queue mode : byHost Fetcher: threads: 5 Fetcher: time-out divisor: 2 QueueFeeder finished: total 5 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost fetching http://m.163.com/ Using queue mode : byHost Using queue mode : byHost fetching http://3g.163.com/links/4145 fetching http://music.163.com/ Using queue mode : byHost Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 fetching http://caipiao.163.com/mobile/client_cp.jsp -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1 * queue: http://m.163.com maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1366482150482 now = 1366482146346 0. http://m.163.com/newsapp/ -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1 * queue: http://m.163.com maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1366482150482 now = 1366482147348 0. http://m.163.com/newsapp/ -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1 * queue: http://m.163.com maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1366482150482 now = 1366482148350 0. http://m.163.com/newsapp/ -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1 * queue: http://m.163.com maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1366482150482 now = 1366482149352 0. http://m.163.com/newsapp/ -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1 * queue: http://m.163.com maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1366482150482 now = 1366482150354 0. http://m.163.com/newsapp/ fetching http://m.163.com/newsapp/ -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2013-04-21 02:22:38, elapsed: 00:00:13 ParseSegment: starting at 2013-04-21 02:22:38 ParseSegment: segment: crawl/segments/20130421022217 Parsed (4ms):http://caipiao.163.com/mobile/client_cp.jsp Parsed (3ms):http://m.163.com/newsapp/ Parsed (1ms):http://music.163.com/ ParseSegment: finished at 2013-04-21 02:22:45, elapsed: 00:00:07 CrawlDb update: starting at 2013-04-21 02:22:45 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20130421022217] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true *CrawlDb update: 404 purging: false* CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2013-04-21 02:22:58, elapsed: 00:00:13 LinkDb: starting at 2013-04-21 02:22:58 LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: internal links will be ignored. LinkDb: adding segment: file:/home/lailx/search_engine/nutch/crawl/segments/20130421022135 LinkDb: adding segment: file:/home/lailx/search_engine/nutch/crawl/segments/20130421022217 LinkDb: finished at 2013-04-21 02:23:08, elapsed: 00:00:10 SolrIndexer: starting at 2013-04-21 02:23:08 *SolrIndexer: deleting gone documents: false SolrIndexer: URL filtering: false SolrIndexer: URL normalizing: false* Indexing 4 documents *java.io.IOException: Job failed!* SolrDeleteDuplicates: starting at 2013-04-21 02:23:39 SolrDeleteDuplicates: Solr url: http://localhost:8080/solr/ *Exception in thread "main" java.io.IOException: Job failed!* at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) ============================================= But nutch works well, it works with the command below: bin/nutch crawl urls -dir crawl -depth 2 -topN 5 And I also had modified the [Solr_Home]/conf/schema.xml to make nutch integrated with solr, solr also works well by accessing "localhost:8080/solr". Hope you can help in this problem, wait for your reply. Thanks. Br, Mick -- View this message in context: http://lucene.472066.n3.nabble.com/Exception-in-thread-main-java-io-IOException-Job-failed-tp4057620.html Sent from the Nutch - User mailing list archive at Nabble.com.

