[Exception in thread "main" java.io.IOException: Job failed!]

micklai Sat, 20 Apr 2013 13:04:58 -0700

Hi,

Env:
System ubuntu 12.04
Tomcat: 7.0.39
Solr: 3.6.2
Nutch: 1.6


Try to deploy nutch 1.6 with solr 3.6.2 but failed with running the command
below:
bin/nutch crawl urls -solr http://localhost:8080/solr/ -dir crawl -depth 2
-threads 5 -topN 5

For the details:
=============================================
crawl started in: crawl
rootUrlDir = urls
threads = 5
depth = 2
solrUrl=http://localhost:8080/solr/
topN = 5
Injector: starting at 2013-04-21 02:21:12
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-04-21 02:21:27, elapsed: 00:00:14
Generator: starting at 2013-04-21 02:21:27
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20130421022135
Generator: finished at 2013-04-21 02:21:42, elapsed: 00:00:15
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2013-04-21 02:21:42
Fetcher: segment: crawl/segments/20130421022135
Using queue mode : byHost
Fetcher: threads: 5
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
fetching http://www.163.com/
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-04-21 02:21:49, elapsed: 00:00:07
ParseSegment: starting at 2013-04-21 02:21:49
ParseSegment: segment: crawl/segments/20130421022135
Parsed (24ms):http://www.163.com/
ParseSegment: finished at 2013-04-21 02:21:56, elapsed: 00:00:07
CrawlDb update: starting at 2013-04-21 02:21:56
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20130421022135]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2013-04-21 02:22:09, elapsed: 00:00:13
Generator: starting at 2013-04-21 02:22:09
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20130421022217
Generator: finished at 2013-04-21 02:22:25, elapsed: 00:00:15
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2013-04-21 02:22:25
Fetcher: segment: crawl/segments/20130421022217
Using queue mode : byHost
Fetcher: threads: 5
Fetcher: time-out divisor: 2
QueueFeeder finished: total 5 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://m.163.com/
Using queue mode : byHost
Using queue mode : byHost
fetching http://3g.163.com/links/4145
fetching http://music.163.com/
Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
fetching http://caipiao.163.com/mobile/client_cp.jsp
-activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1
* queue: http://m.163.com
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1366482150482
  now           = 1366482146346
  0. http://m.163.com/newsapp/
-activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1
* queue: http://m.163.com
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1366482150482
  now           = 1366482147348
  0. http://m.163.com/newsapp/
-activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1
* queue: http://m.163.com
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1366482150482
  now           = 1366482148350
  0. http://m.163.com/newsapp/
-activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1
* queue: http://m.163.com
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1366482150482
  now           = 1366482149352
  0. http://m.163.com/newsapp/
-activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1
* queue: http://m.163.com
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1366482150482
  now           = 1366482150354
  0. http://m.163.com/newsapp/
fetching http://m.163.com/newsapp/
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-04-21 02:22:38, elapsed: 00:00:13
ParseSegment: starting at 2013-04-21 02:22:38
ParseSegment: segment: crawl/segments/20130421022217
Parsed (4ms):http://caipiao.163.com/mobile/client_cp.jsp
Parsed (3ms):http://m.163.com/newsapp/
Parsed (1ms):http://music.163.com/
ParseSegment: finished at 2013-04-21 02:22:45, elapsed: 00:00:07
CrawlDb update: starting at 2013-04-21 02:22:45
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20130421022217]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
*CrawlDb update: 404 purging: false*
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2013-04-21 02:22:58, elapsed: 00:00:13
LinkDb: starting at 2013-04-21 02:22:58
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment:
file:/home/lailx/search_engine/nutch/crawl/segments/20130421022135
LinkDb: adding segment:
file:/home/lailx/search_engine/nutch/crawl/segments/20130421022217
LinkDb: finished at 2013-04-21 02:23:08, elapsed: 00:00:10
SolrIndexer: starting at 2013-04-21 02:23:08
*SolrIndexer: deleting gone documents: false
SolrIndexer: URL filtering: false
SolrIndexer: URL normalizing: false*
Indexing 4 documents
*java.io.IOException: Job failed!*
SolrDeleteDuplicates: starting at 2013-04-21 02:23:39
SolrDeleteDuplicates: Solr url: http://localhost:8080/solr/
*Exception in thread "main" java.io.IOException: Job failed!*
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
        at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
        at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
=============================================
But nutch works well, it works with the command below:
bin/nutch crawl urls -dir crawl -depth 2 -topN 5

And I also had modified the [Solr_Home]/conf/schema.xml to make nutch
integrated with solr, solr also works well by accessing
"localhost:8080/solr".

Hope you can help in this problem, wait for your reply.
Thanks.

Br,
Mick



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Exception-in-thread-main-java-io-IOException-Job-failed-tp4057620.html
Sent from the Nutch - User mailing list archive at Nabble.com.

[Exception in thread "main" java.io.IOException: Job failed!]

Reply via email to