Markus,

I was able to set it up Nutch 1.7 on Hadoop 2.2 finally , I am using the
following command to start it up

bin/hadoop jar
/home/general/workspace/nutch/runtime/deploy/apache-nutch-1.8-SNAPSHOT.job
org.apache.nutch.crawl.Crawl urls -dir crawldirectory -depth 1000 -topN
30000

And the intial log output I get from Hadoop is as follows, however I am not
sure if the crawl is happening at all or if its happening at a dead slow
pace.

If I use the same single url to crawl as I use in my local crawl from
eclipse to test I get the proper crawling speed however the hadoop job is
as I mentioned eithe rnot running at all or is running at a slow pace.

Please see the log below , you will see a WARN message that the Solr URL is
not set , this is because I internally log the data into Solr  and I dont
have a  Solr URL to mention into the arguments , so that warning message
can be ignored.

Please see the log below.

----------------------------------------------------------------------------------------------------------


13/12/01 14:15:45 WARN crawl.Crawl: solrUrl is not set, indexing will be
skipped...
13/12/01 14:15:45 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
13/12/01 14:15:46 INFO crawl.Crawl: crawl started in: crawldirectory
13/12/01 14:15:46 INFO crawl.Crawl: rootUrlDir = urls
13/12/01 14:15:46 INFO crawl.Crawl: threads = 30
13/12/01 14:15:46 INFO crawl.Crawl: depth = 1000
13/12/01 14:15:46 INFO crawl.Crawl: solrUrl=null
13/12/01 14:15:46 INFO crawl.Crawl: topN = 30000
13/12/01 14:15:46 INFO crawl.Injector: Injector: starting at 2013-12-01
14:15:46
13/12/01 14:15:46 INFO crawl.Injector: Injector: crawlDb:
crawldirectory/crawldb
13/12/01 14:15:46 INFO crawl.Injector: Injector: urlDir: urls
13/12/01 14:15:46 INFO Configuration.deprecation: mapred.temp.dir is
deprecated. Instead, use mapreduce.cluster.temp.dir
13/12/01 14:15:46 INFO crawl.Injector: Injector: Converting injected urls
to crawl db entries.
13/12/01 14:15:46 INFO client.RMProxy: Connecting to ResourceManager at /
0.0.0.0:8032
13/12/01 14:15:46 INFO client.RMProxy: Connecting to ResourceManager at /
0.0.0.0:8032
13/12/01 14:15:47 INFO mapred.FileInputFormat: Total input paths to process
: 1
13/12/01 14:15:47 INFO mapreduce.JobSubmitter: number of splits:2
13/12/01 14:15:47 INFO Configuration.deprecation: user.name is deprecated.
Instead, use mapreduce.job.user.name
13/12/01 14:15:47 INFO Configuration.deprecation: mapred.jar is deprecated.
Instead, use mapreduce.job.jar
13/12/01 14:15:47 INFO Configuration.deprecation: mapred.output.value.class
is deprecated. Instead, use mapreduce.job.output.value.class
13/12/01 14:15:48 INFO Configuration.deprecation: mapred.job.name is
deprecated. Instead, use mapreduce.job.name
13/12/01 14:15:48 INFO Configuration.deprecation: mapred.input.dir is
deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
13/12/01 14:15:48 INFO Configuration.deprecation: mapred.output.dir is
deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
13/12/01 14:15:48 INFO Configuration.deprecation: mapred.map.tasks is
deprecated. Instead, use mapreduce.job.maps
13/12/01 14:15:48 INFO Configuration.deprecation: mapred.output.key.class
is deprecated. Instead, use mapreduce.job.output.key.class
13/12/01 14:15:48 INFO Configuration.deprecation: mapred.working.dir is
deprecated. Instead, use mapreduce.job.working.dir
13/12/01 14:15:48 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1385868843066_1747
13/12/01 14:15:48 INFO impl.YarnClientImpl: Submitted application
application_1385868843066_1747 to ResourceManager at /0.0.0.0:8032
13/12/01 14:15:48 INFO mapreduce.Job: The url to track the job:
http://localhost.localdomain:8088/proxy/application_1385868843066_1747/
13/12/01 14:15:48 INFO mapreduce.Job: Running job: job_1385868843066_1747



On Mon, Nov 25, 2013 at 6:51 AM, Markus Jelsma
<[email protected]>wrote:

> I'm not sure it works, i think i've seen some issues with it. You can try
> though
>
>
> -----Original message-----
> > From:S.L <[email protected]>
> > Sent: Monday 25th November 2013 2:38
> > To: [email protected]
> > Subject: Nutch 1.7 and Hadoop Release 2.2.0
> >
> > Hi All,
> >
> > I am trying to set up a single node Hadoop cluster and noticed that the
> > Hadoop 2.2 release has been made available as a GA release candidate.
> >
> > As I am using Nutch 1.7 , I was wondering if it was compatible with
> Hadoop
> > 2.2 , please let me know .
> >
> > Thanks in advance!
>

Reply via email to