Markus, I was able to set it up Nutch 1.7 on Hadoop 2.2 finally , I am using the following command to start it up
bin/hadoop jar /home/general/workspace/nutch/runtime/deploy/apache-nutch-1.8-SNAPSHOT.job org.apache.nutch.crawl.Crawl urls -dir crawldirectory -depth 1000 -topN 30000 And the intial log output I get from Hadoop is as follows, however I am not sure if the crawl is happening at all or if its happening at a dead slow pace. If I use the same single url to crawl as I use in my local crawl from eclipse to test I get the proper crawling speed however the hadoop job is as I mentioned eithe rnot running at all or is running at a slow pace. Please see the log below , you will see a WARN message that the Solr URL is not set , this is because I internally log the data into Solr and I dont have a Solr URL to mention into the arguments , so that warning message can be ignored. Please see the log below. ---------------------------------------------------------------------------------------------------------- 13/12/01 14:15:45 WARN crawl.Crawl: solrUrl is not set, indexing will be skipped... 13/12/01 14:15:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 13/12/01 14:15:46 INFO crawl.Crawl: crawl started in: crawldirectory 13/12/01 14:15:46 INFO crawl.Crawl: rootUrlDir = urls 13/12/01 14:15:46 INFO crawl.Crawl: threads = 30 13/12/01 14:15:46 INFO crawl.Crawl: depth = 1000 13/12/01 14:15:46 INFO crawl.Crawl: solrUrl=null 13/12/01 14:15:46 INFO crawl.Crawl: topN = 30000 13/12/01 14:15:46 INFO crawl.Injector: Injector: starting at 2013-12-01 14:15:46 13/12/01 14:15:46 INFO crawl.Injector: Injector: crawlDb: crawldirectory/crawldb 13/12/01 14:15:46 INFO crawl.Injector: Injector: urlDir: urls 13/12/01 14:15:46 INFO Configuration.deprecation: mapred.temp.dir is deprecated. Instead, use mapreduce.cluster.temp.dir 13/12/01 14:15:46 INFO crawl.Injector: Injector: Converting injected urls to crawl db entries. 13/12/01 14:15:46 INFO client.RMProxy: Connecting to ResourceManager at / 0.0.0.0:8032 13/12/01 14:15:46 INFO client.RMProxy: Connecting to ResourceManager at / 0.0.0.0:8032 13/12/01 14:15:47 INFO mapred.FileInputFormat: Total input paths to process : 1 13/12/01 14:15:47 INFO mapreduce.JobSubmitter: number of splits:2 13/12/01 14:15:47 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 13/12/01 14:15:47 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 13/12/01 14:15:47 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 13/12/01 14:15:48 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name 13/12/01 14:15:48 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 13/12/01 14:15:48 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 13/12/01 14:15:48 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 13/12/01 14:15:48 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 13/12/01 14:15:48 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 13/12/01 14:15:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1385868843066_1747 13/12/01 14:15:48 INFO impl.YarnClientImpl: Submitted application application_1385868843066_1747 to ResourceManager at /0.0.0.0:8032 13/12/01 14:15:48 INFO mapreduce.Job: The url to track the job: http://localhost.localdomain:8088/proxy/application_1385868843066_1747/ 13/12/01 14:15:48 INFO mapreduce.Job: Running job: job_1385868843066_1747 On Mon, Nov 25, 2013 at 6:51 AM, Markus Jelsma <[email protected]>wrote: > I'm not sure it works, i think i've seen some issues with it. You can try > though > > > -----Original message----- > > From:S.L <[email protected]> > > Sent: Monday 25th November 2013 2:38 > > To: [email protected] > > Subject: Nutch 1.7 and Hadoop Release 2.2.0 > > > > Hi All, > > > > I am trying to set up a single node Hadoop cluster and noticed that the > > Hadoop 2.2 release has been made available as a GA release candidate. > > > > As I am using Nutch 1.7 , I was wondering if it was compatible with > Hadoop > > 2.2 , please let me know . > > > > Thanks in advance! >

