Sorry, here is the link http://wiki.apache.org/nutch/NutchTutorial
On Sat, Apr 20, 2013 at 4:11 PM, kiran chitturi <[email protected]>wrote: > Also, please go through the tutorial here [1]. We updated it with more > info on commands and everything > > > [1] > > > On Sat, Apr 20, 2013 at 3:00 PM, micklai <[email protected]> wrote: > >> Hi, >> >> Env: >> System ubuntu 12.04 >> Tomcat: 7.0.39 >> Solr: 3.6.2 >> Nutch: 1.6 >> >> Try to deploy nutch 1.6 with solr 3.6.2 but failed with running the >> command >> below: >> bin/nutch crawl urls -solr http://localhost:8080/solr/ -dir crawl -depth >> 2 >> -threads 5 -topN 5 >> >> For the details: >> ============================================= >> crawl started in: crawl >> rootUrlDir = urls >> threads = 5 >> depth = 2 >> solrUrl=http://localhost:8080/solr/ >> topN = 5 >> Injector: starting at 2013-04-21 02:21:12 >> Injector: crawlDb: crawl/crawldb >> Injector: urlDir: urls >> Injector: Converting injected urls to crawl db entries. >> Injector: total number of urls rejected by filters: 0 >> Injector: total number of urls injected after normalization and >> filtering: 1 >> Injector: Merging injected urls into crawl db. >> Injector: finished at 2013-04-21 02:21:27, elapsed: 00:00:14 >> Generator: starting at 2013-04-21 02:21:27 >> Generator: Selecting best-scoring urls due for fetch. >> Generator: filtering: true >> Generator: normalizing: true >> Generator: topN: 5 >> Generator: jobtracker is 'local', generating exactly one partition. >> Generator: Partitioning selected urls for politeness. >> Generator: segment: crawl/segments/20130421022135 >> Generator: finished at 2013-04-21 02:21:42, elapsed: 00:00:15 >> Fetcher: Your 'http.agent.name' value should be listed first in >> 'http.robots.agents' property. >> Fetcher: starting at 2013-04-21 02:21:42 >> Fetcher: segment: crawl/segments/20130421022135 >> Using queue mode : byHost >> Fetcher: threads: 5 >> Fetcher: time-out divisor: 2 >> QueueFeeder finished: total 1 records + hit by time limit :0 >> Using queue mode : byHost >> Using queue mode : byHost >> Using queue mode : byHost >> Using queue mode : byHost >> Using queue mode : byHost >> Fetcher: throughput threshold: -1 >> Fetcher: throughput threshold retries: 5 >> fetching http://www.163.com/ >> -finishing thread FetcherThread, activeThreads=4 >> -finishing thread FetcherThread, activeThreads=3 >> -finishing thread FetcherThread, activeThreads=2 >> -finishing thread FetcherThread, activeThreads=1 >> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 >> -finishing thread FetcherThread, activeThreads=0 >> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 >> -activeThreads=0 >> Fetcher: finished at 2013-04-21 02:21:49, elapsed: 00:00:07 >> ParseSegment: starting at 2013-04-21 02:21:49 >> ParseSegment: segment: crawl/segments/20130421022135 >> Parsed (24ms):http://www.163.com/ >> ParseSegment: finished at 2013-04-21 02:21:56, elapsed: 00:00:07 >> CrawlDb update: starting at 2013-04-21 02:21:56 >> CrawlDb update: db: crawl/crawldb >> CrawlDb update: segments: [crawl/segments/20130421022135] >> CrawlDb update: additions allowed: true >> CrawlDb update: URL normalizing: true >> CrawlDb update: URL filtering: true >> CrawlDb update: 404 purging: false >> CrawlDb update: Merging segment data into db. >> CrawlDb update: finished at 2013-04-21 02:22:09, elapsed: 00:00:13 >> Generator: starting at 2013-04-21 02:22:09 >> Generator: Selecting best-scoring urls due for fetch. >> Generator: filtering: true >> Generator: normalizing: true >> Generator: topN: 5 >> Generator: jobtracker is 'local', generating exactly one partition. >> Generator: Partitioning selected urls for politeness. >> Generator: segment: crawl/segments/20130421022217 >> Generator: finished at 2013-04-21 02:22:25, elapsed: 00:00:15 >> Fetcher: Your 'http.agent.name' value should be listed first in >> 'http.robots.agents' property. >> Fetcher: starting at 2013-04-21 02:22:25 >> Fetcher: segment: crawl/segments/20130421022217 >> Using queue mode : byHost >> Fetcher: threads: 5 >> Fetcher: time-out divisor: 2 >> QueueFeeder finished: total 5 records + hit by time limit :0 >> Using queue mode : byHost >> Using queue mode : byHost >> fetching http://m.163.com/ >> Using queue mode : byHost >> Using queue mode : byHost >> fetching http://3g.163.com/links/4145 >> fetching http://music.163.com/ >> Using queue mode : byHost >> Fetcher: throughput threshold: -1 >> Fetcher: throughput threshold retries: 5 >> fetching http://caipiao.163.com/mobile/client_cp.jsp >> -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1 >> * queue: http://m.163.com >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1366482150482 >> now = 1366482146346 >> 0. http://m.163.com/newsapp/ >> -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1 >> * queue: http://m.163.com >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1366482150482 >> now = 1366482147348 >> 0. http://m.163.com/newsapp/ >> -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1 >> * queue: http://m.163.com >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1366482150482 >> now = 1366482148350 >> 0. http://m.163.com/newsapp/ >> -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1 >> * queue: http://m.163.com >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1366482150482 >> now = 1366482149352 >> 0. http://m.163.com/newsapp/ >> -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1 >> * queue: http://m.163.com >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1366482150482 >> now = 1366482150354 >> 0. http://m.163.com/newsapp/ >> fetching http://m.163.com/newsapp/ >> -finishing <http://m.163.com/newsapp/-finishing> thread FetcherThread, >> activeThreads=4 >> -finishing thread FetcherThread, activeThreads=3 >> -finishing thread FetcherThread, activeThreads=2 >> -finishing thread FetcherThread, activeThreads=1 >> -finishing thread FetcherThread, activeThreads=0 >> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 >> -activeThreads=0 >> Fetcher: finished at 2013-04-21 02:22:38, elapsed: 00:00:13 >> ParseSegment: starting at 2013-04-21 02:22:38 >> ParseSegment: segment: crawl/segments/20130421022217 >> Parsed (4ms):http://caipiao.163.com/mobile/client_cp.jsp >> Parsed (3ms):http://m.163.com/newsapp/ >> Parsed (1ms):http://music.163.com/ >> ParseSegment: finished at 2013-04-21 02:22:45, elapsed: 00:00:07 >> CrawlDb update: starting at 2013-04-21 02:22:45 >> CrawlDb update: db: crawl/crawldb >> CrawlDb update: segments: [crawl/segments/20130421022217] >> CrawlDb update: additions allowed: true >> CrawlDb update: URL normalizing: true >> CrawlDb update: URL filtering: true >> *CrawlDb update: 404 purging: false* >> CrawlDb update: Merging segment data into db. >> CrawlDb update: finished at 2013-04-21 02:22:58, elapsed: 00:00:13 >> LinkDb: starting at 2013-04-21 02:22:58 >> LinkDb: linkdb: crawl/linkdb >> LinkDb: URL normalize: true >> LinkDb: URL filter: true >> LinkDb: internal links will be ignored. >> LinkDb: adding segment: >> file:/home/lailx/search_engine/nutch/crawl/segments/20130421022135 >> LinkDb: adding segment: >> file:/home/lailx/search_engine/nutch/crawl/segments/20130421022217 >> LinkDb: finished at 2013-04-21 02:23:08, elapsed: 00:00:10 >> SolrIndexer: starting at 2013-04-21 02:23:08 >> *SolrIndexer: deleting gone documents: false >> SolrIndexer: URL filtering: false >> SolrIndexer: URL normalizing: false* >> Indexing 4 documents >> *java.io.IOException: Job failed!* >> SolrDeleteDuplicates: starting at 2013-04-21 02:23:39 >> SolrDeleteDuplicates: Solr url: http://localhost:8080/solr/ >> *Exception in thread "main" java.io.IOException: Job failed!* >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265) >> at >> >> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) >> at >> >> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) >> at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) >> ============================================= >> But nutch works well, it works with the command below: >> bin/nutch crawl urls -dir crawl -depth 2 -topN 5 >> >> And I also had modified the [Solr_Home]/conf/schema.xml to make nutch >> integrated with solr, solr also works well by accessing >> "localhost:8080/solr". >> >> Hope you can help in this problem, wait for your reply. >> Thanks. >> >> Br, >> Mick >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Exception-in-thread-main-java-io-IOException-Job-failed-tp4057620.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> > > > > -- > Kiran Chitturi > > <http://www.linkedin.com/in/kiranchitturi> > > > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>

