Hi Mick,

You should see the logs for more information. They are present in
'logs/hadoop.log'


On Sat, Apr 20, 2013 at 3:00 PM, micklai <[email protected]> wrote:

> Hi,
>
> Env:
> System ubuntu 12.04
> Tomcat: 7.0.39
> Solr: 3.6.2
> Nutch: 1.6
>
> Try to deploy nutch 1.6 with solr 3.6.2 but failed with running the command
> below:
> bin/nutch crawl urls -solr http://localhost:8080/solr/ -dir crawl -depth 2
> -threads 5 -topN 5
>
> For the details:
> =============================================
> crawl started in: crawl
> rootUrlDir = urls
> threads = 5
> depth = 2
> solrUrl=http://localhost:8080/solr/
> topN = 5
> Injector: starting at 2013-04-21 02:21:12
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: total number of urls rejected by filters: 0
> Injector: total number of urls injected after normalization and filtering:
> 1
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2013-04-21 02:21:27, elapsed: 00:00:14
> Generator: starting at 2013-04-21 02:21:27
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 5
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/segments/20130421022135
> Generator: finished at 2013-04-21 02:21:42, elapsed: 00:00:15
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2013-04-21 02:21:42
> Fetcher: segment: crawl/segments/20130421022135
> Using queue mode : byHost
> Fetcher: threads: 5
> Fetcher: time-out divisor: 2
> QueueFeeder finished: total 1 records + hit by time limit :0
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold retries: 5
> fetching http://www.163.com/
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2013-04-21 02:21:49, elapsed: 00:00:07
> ParseSegment: starting at 2013-04-21 02:21:49
> ParseSegment: segment: crawl/segments/20130421022135
> Parsed (24ms):http://www.163.com/
> ParseSegment: finished at 2013-04-21 02:21:56, elapsed: 00:00:07
> CrawlDb update: starting at 2013-04-21 02:21:56
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20130421022135]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: 404 purging: false
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2013-04-21 02:22:09, elapsed: 00:00:13
> Generator: starting at 2013-04-21 02:22:09
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 5
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/segments/20130421022217
> Generator: finished at 2013-04-21 02:22:25, elapsed: 00:00:15
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2013-04-21 02:22:25
> Fetcher: segment: crawl/segments/20130421022217
> Using queue mode : byHost
> Fetcher: threads: 5
> Fetcher: time-out divisor: 2
> QueueFeeder finished: total 5 records + hit by time limit :0
> Using queue mode : byHost
> Using queue mode : byHost
> fetching http://m.163.com/
> Using queue mode : byHost
> Using queue mode : byHost
> fetching http://3g.163.com/links/4145
> fetching http://music.163.com/
> Using queue mode : byHost
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold retries: 5
> fetching http://caipiao.163.com/mobile/client_cp.jsp
> -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1
> * queue: http://m.163.com
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1366482150482
>   now           = 1366482146346
>   0. http://m.163.com/newsapp/
> -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1
> * queue: http://m.163.com
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1366482150482
>   now           = 1366482147348
>   0. http://m.163.com/newsapp/
> -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1
> * queue: http://m.163.com
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1366482150482
>   now           = 1366482148350
>   0. http://m.163.com/newsapp/
> -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1
> * queue: http://m.163.com
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1366482150482
>   now           = 1366482149352
>   0. http://m.163.com/newsapp/
> -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1
> * queue: http://m.163.com
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1366482150482
>   now           = 1366482150354
>   0. http://m.163.com/newsapp/
> fetching http://m.163.com/newsapp/
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2013-04-21 02:22:38, elapsed: 00:00:13
> ParseSegment: starting at 2013-04-21 02:22:38
> ParseSegment: segment: crawl/segments/20130421022217
> Parsed (4ms):http://caipiao.163.com/mobile/client_cp.jsp
> Parsed (3ms):http://m.163.com/newsapp/
> Parsed (1ms):http://music.163.com/
> ParseSegment: finished at 2013-04-21 02:22:45, elapsed: 00:00:07
> CrawlDb update: starting at 2013-04-21 02:22:45
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20130421022217]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> *CrawlDb update: 404 purging: false*
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2013-04-21 02:22:58, elapsed: 00:00:13
> LinkDb: starting at 2013-04-21 02:22:58
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: internal links will be ignored.
> LinkDb: adding segment:
> file:/home/lailx/search_engine/nutch/crawl/segments/20130421022135
> LinkDb: adding segment:
> file:/home/lailx/search_engine/nutch/crawl/segments/20130421022217
> LinkDb: finished at 2013-04-21 02:23:08, elapsed: 00:00:10
> SolrIndexer: starting at 2013-04-21 02:23:08
> *SolrIndexer: deleting gone documents: false
> SolrIndexer: URL filtering: false
> SolrIndexer: URL normalizing: false*
> Indexing 4 documents
> *java.io.IOException: Job failed!*
> SolrDeleteDuplicates: starting at 2013-04-21 02:23:39
> SolrDeleteDuplicates: Solr url: http://localhost:8080/solr/
> *Exception in thread "main" java.io.IOException: Job failed!*
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
>         at
>
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
>         at
>
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
>         at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
> =============================================
> But nutch works well, it works with the command below:
> bin/nutch crawl urls -dir crawl -depth 2 -topN 5
>
> And I also had modified the [Solr_Home]/conf/schema.xml to make nutch
> integrated with solr, solr also works well by accessing
> "localhost:8080/solr".
>
> Hope you can help in this problem, wait for your reply.
> Thanks.
>
> Br,
> Mick
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Exception-in-thread-main-java-io-IOException-Job-failed-tp4057620.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Kiran Chitturi

<http://www.linkedin.com/in/kiranchitturi>

Reply via email to