Sorry, here is the link http://wiki.apache.org/nutch/NutchTutorial


On Sat, Apr 20, 2013 at 4:11 PM, kiran chitturi
<[email protected]>wrote:

> Also, please go through the tutorial here [1]. We updated it with more
> info on commands and everything
>
>
> [1]
>
>
> On Sat, Apr 20, 2013 at 3:00 PM, micklai <[email protected]> wrote:
>
>> Hi,
>>
>> Env:
>> System ubuntu 12.04
>> Tomcat: 7.0.39
>> Solr: 3.6.2
>> Nutch: 1.6
>>
>> Try to deploy nutch 1.6 with solr 3.6.2 but failed with running the
>> command
>> below:
>> bin/nutch crawl urls -solr http://localhost:8080/solr/ -dir crawl -depth
>> 2
>> -threads 5 -topN 5
>>
>> For the details:
>> =============================================
>> crawl started in: crawl
>> rootUrlDir = urls
>> threads = 5
>> depth = 2
>> solrUrl=http://localhost:8080/solr/
>> topN = 5
>> Injector: starting at 2013-04-21 02:21:12
>> Injector: crawlDb: crawl/crawldb
>> Injector: urlDir: urls
>> Injector: Converting injected urls to crawl db entries.
>> Injector: total number of urls rejected by filters: 0
>> Injector: total number of urls injected after normalization and
>> filtering: 1
>> Injector: Merging injected urls into crawl db.
>> Injector: finished at 2013-04-21 02:21:27, elapsed: 00:00:14
>> Generator: starting at 2013-04-21 02:21:27
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: filtering: true
>> Generator: normalizing: true
>> Generator: topN: 5
>> Generator: jobtracker is 'local', generating exactly one partition.
>> Generator: Partitioning selected urls for politeness.
>> Generator: segment: crawl/segments/20130421022135
>> Generator: finished at 2013-04-21 02:21:42, elapsed: 00:00:15
>> Fetcher: Your 'http.agent.name' value should be listed first in
>> 'http.robots.agents' property.
>> Fetcher: starting at 2013-04-21 02:21:42
>> Fetcher: segment: crawl/segments/20130421022135
>> Using queue mode : byHost
>> Fetcher: threads: 5
>> Fetcher: time-out divisor: 2
>> QueueFeeder finished: total 1 records + hit by time limit :0
>> Using queue mode : byHost
>> Using queue mode : byHost
>> Using queue mode : byHost
>> Using queue mode : byHost
>> Using queue mode : byHost
>> Fetcher: throughput threshold: -1
>> Fetcher: throughput threshold retries: 5
>> fetching http://www.163.com/
>> -finishing thread FetcherThread, activeThreads=4
>> -finishing thread FetcherThread, activeThreads=3
>> -finishing thread FetcherThread, activeThreads=2
>> -finishing thread FetcherThread, activeThreads=1
>> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>> -finishing thread FetcherThread, activeThreads=0
>> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>> -activeThreads=0
>> Fetcher: finished at 2013-04-21 02:21:49, elapsed: 00:00:07
>> ParseSegment: starting at 2013-04-21 02:21:49
>> ParseSegment: segment: crawl/segments/20130421022135
>> Parsed (24ms):http://www.163.com/
>> ParseSegment: finished at 2013-04-21 02:21:56, elapsed: 00:00:07
>> CrawlDb update: starting at 2013-04-21 02:21:56
>> CrawlDb update: db: crawl/crawldb
>> CrawlDb update: segments: [crawl/segments/20130421022135]
>> CrawlDb update: additions allowed: true
>> CrawlDb update: URL normalizing: true
>> CrawlDb update: URL filtering: true
>> CrawlDb update: 404 purging: false
>> CrawlDb update: Merging segment data into db.
>> CrawlDb update: finished at 2013-04-21 02:22:09, elapsed: 00:00:13
>> Generator: starting at 2013-04-21 02:22:09
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: filtering: true
>> Generator: normalizing: true
>> Generator: topN: 5
>> Generator: jobtracker is 'local', generating exactly one partition.
>> Generator: Partitioning selected urls for politeness.
>> Generator: segment: crawl/segments/20130421022217
>> Generator: finished at 2013-04-21 02:22:25, elapsed: 00:00:15
>> Fetcher: Your 'http.agent.name' value should be listed first in
>> 'http.robots.agents' property.
>> Fetcher: starting at 2013-04-21 02:22:25
>> Fetcher: segment: crawl/segments/20130421022217
>> Using queue mode : byHost
>> Fetcher: threads: 5
>> Fetcher: time-out divisor: 2
>> QueueFeeder finished: total 5 records + hit by time limit :0
>> Using queue mode : byHost
>> Using queue mode : byHost
>> fetching http://m.163.com/
>> Using queue mode : byHost
>> Using queue mode : byHost
>> fetching http://3g.163.com/links/4145
>> fetching http://music.163.com/
>> Using queue mode : byHost
>> Fetcher: throughput threshold: -1
>> Fetcher: throughput threshold retries: 5
>> fetching http://caipiao.163.com/mobile/client_cp.jsp
>> -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1
>> * queue: http://m.163.com
>>   maxThreads    = 1
>>   inProgress    = 0
>>   crawlDelay    = 5000
>>   minCrawlDelay = 0
>>   nextFetchTime = 1366482150482
>>   now           = 1366482146346
>>   0. http://m.163.com/newsapp/
>> -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1
>> * queue: http://m.163.com
>>   maxThreads    = 1
>>   inProgress    = 0
>>   crawlDelay    = 5000
>>   minCrawlDelay = 0
>>   nextFetchTime = 1366482150482
>>   now           = 1366482147348
>>   0. http://m.163.com/newsapp/
>> -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1
>> * queue: http://m.163.com
>>   maxThreads    = 1
>>   inProgress    = 0
>>   crawlDelay    = 5000
>>   minCrawlDelay = 0
>>   nextFetchTime = 1366482150482
>>   now           = 1366482148350
>>   0. http://m.163.com/newsapp/
>> -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1
>> * queue: http://m.163.com
>>   maxThreads    = 1
>>   inProgress    = 0
>>   crawlDelay    = 5000
>>   minCrawlDelay = 0
>>   nextFetchTime = 1366482150482
>>   now           = 1366482149352
>>   0. http://m.163.com/newsapp/
>> -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1
>> * queue: http://m.163.com
>>   maxThreads    = 1
>>   inProgress    = 0
>>   crawlDelay    = 5000
>>   minCrawlDelay = 0
>>   nextFetchTime = 1366482150482
>>   now           = 1366482150354
>>   0. http://m.163.com/newsapp/
>> fetching http://m.163.com/newsapp/
>> -finishing <http://m.163.com/newsapp/-finishing> thread FetcherThread,
>> activeThreads=4
>> -finishing thread FetcherThread, activeThreads=3
>> -finishing thread FetcherThread, activeThreads=2
>> -finishing thread FetcherThread, activeThreads=1
>> -finishing thread FetcherThread, activeThreads=0
>> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>> -activeThreads=0
>> Fetcher: finished at 2013-04-21 02:22:38, elapsed: 00:00:13
>> ParseSegment: starting at 2013-04-21 02:22:38
>> ParseSegment: segment: crawl/segments/20130421022217
>> Parsed (4ms):http://caipiao.163.com/mobile/client_cp.jsp
>> Parsed (3ms):http://m.163.com/newsapp/
>> Parsed (1ms):http://music.163.com/
>> ParseSegment: finished at 2013-04-21 02:22:45, elapsed: 00:00:07
>> CrawlDb update: starting at 2013-04-21 02:22:45
>> CrawlDb update: db: crawl/crawldb
>> CrawlDb update: segments: [crawl/segments/20130421022217]
>> CrawlDb update: additions allowed: true
>> CrawlDb update: URL normalizing: true
>> CrawlDb update: URL filtering: true
>> *CrawlDb update: 404 purging: false*
>> CrawlDb update: Merging segment data into db.
>> CrawlDb update: finished at 2013-04-21 02:22:58, elapsed: 00:00:13
>> LinkDb: starting at 2013-04-21 02:22:58
>> LinkDb: linkdb: crawl/linkdb
>> LinkDb: URL normalize: true
>> LinkDb: URL filter: true
>> LinkDb: internal links will be ignored.
>> LinkDb: adding segment:
>> file:/home/lailx/search_engine/nutch/crawl/segments/20130421022135
>> LinkDb: adding segment:
>> file:/home/lailx/search_engine/nutch/crawl/segments/20130421022217
>> LinkDb: finished at 2013-04-21 02:23:08, elapsed: 00:00:10
>> SolrIndexer: starting at 2013-04-21 02:23:08
>> *SolrIndexer: deleting gone documents: false
>> SolrIndexer: URL filtering: false
>> SolrIndexer: URL normalizing: false*
>> Indexing 4 documents
>> *java.io.IOException: Job failed!*
>> SolrDeleteDuplicates: starting at 2013-04-21 02:23:39
>> SolrDeleteDuplicates: Solr url: http://localhost:8080/solr/
>> *Exception in thread "main" java.io.IOException: Job failed!*
>>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
>>         at
>>
>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
>>         at
>>
>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
>>         at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>> =============================================
>> But nutch works well, it works with the command below:
>> bin/nutch crawl urls -dir crawl -depth 2 -topN 5
>>
>> And I also had modified the [Solr_Home]/conf/schema.xml to make nutch
>> integrated with solr, solr also works well by accessing
>> "localhost:8080/solr".
>>
>> Hope you can help in this problem, wait for your reply.
>> Thanks.
>>
>> Br,
>> Mick
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Exception-in-thread-main-java-io-IOException-Job-failed-tp4057620.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> Kiran Chitturi
>
> <http://www.linkedin.com/in/kiranchitturi>
>
>
>


-- 
Kiran Chitturi

<http://www.linkedin.com/in/kiranchitturi>

Reply via email to