RE: Nutch 1.7 and Hadoop Release 2.2.0

Markus Jelsma Tue, 03 Dec 2013 01:25:30 -0800
Hmm, so it seems to work afterall! I see a lot of deprecated warnings but that 
can be easily fixed. It is most about job set up. The logs are in your Hadoop 
directory under the same name as the job name.
 
-----Original message-----
> From:S.L <[email protected]>
> Sent: Tuesday 3rd December 2013 0:34
> To: [email protected]; [email protected]
> Subject: Re: Nutch 1.7 and Hadoop Release 2.2.0
> 
> I am running it on a single machine.  I have no idea how to get to  the nutch 
> logs and see what's going on.
> 
> Sent from my HTC Inspire™ 4G on AT&T
> 
> ----- Reply message -----
> From: "Paul Inventado" <[email protected]>
> To: <[email protected]>
> Subject: Nutch 1.7 and Hadoop Release 2.2.0
> Date: Mon, Dec 2, 2013 2:21 am
> 
> 
> Wow you got it working! I can see that you're using Nutch 1.8? Are you
> running on a single machine or a distributed cluster?
> 
> 
> On Mon, Dec 2, 2013 at 3:22 AM, S.L <[email protected]> wrote:
> 
> > Markus,
> >
> > I was able to set it up Nutch 1.7 on Hadoop 2.2 finally , I am using the
> > following command to start it up
> >
> > bin/hadoop jar
> > /home/general/workspace/nutch/runtime/deploy/apache-nutch-1.8-SNAPSHOT.job
> > org.apache.nutch.crawl.Crawl urls -dir crawldirectory -depth 1000 -topN
> > 30000
> >
> > And the intial log output I get from Hadoop is as follows, however I am not
> > sure if the crawl is happening at all or if its happening at a dead slow
> > pace.
> >
> > If I use the same single url to crawl as I use in my local crawl from
> > eclipse to test I get the proper crawling speed however the hadoop job is
> > as I mentioned eithe rnot running at all or is running at a slow pace.
> >
> > Please see the log below , you will see a WARN message that the Solr URL is
> > not set , this is because I internally log the data into Solr  and I dont
> > have a  Solr URL to mention into the arguments , so that warning message
> > can be ignored.
> >
> > Please see the log below.
> >
> >
> > ----------------------------------------------------------------------------------------------------------
> >
> >
> > 13/12/01 14:15:45 WARN crawl.Crawl: solrUrl is not set, indexing will be
> > skipped...
> > 13/12/01 14:15:45 WARN util.NativeCodeLoader: Unable to load native-hadoop
> > library for your platform... using builtin-java classes where applicable
> > 13/12/01 14:15:46 INFO crawl.Crawl: crawl started in: crawldirectory
> > 13/12/01 14:15:46 INFO crawl.Crawl: rootUrlDir = urls
> > 13/12/01 14:15:46 INFO crawl.Crawl: threads = 30
> > 13/12/01 14:15:46 INFO crawl.Crawl: depth = 1000
> > 13/12/01 14:15:46 INFO crawl.Crawl: solrUrl=null
> > 13/12/01 14:15:46 INFO crawl.Crawl: topN = 30000
> > 13/12/01 14:15:46 INFO crawl.Injector: Injector: starting at 2013-12-01
> > 14:15:46
> > 13/12/01 14:15:46 INFO crawl.Injector: Injector: crawlDb:
> > crawldirectory/crawldb
> > 13/12/01 14:15:46 INFO crawl.Injector: Injector: urlDir: urls
> > 13/12/01 14:15:46 INFO Configuration.deprecation: mapred.temp.dir is
> > deprecated. Instead, use mapreduce.cluster.temp.dir
> > 13/12/01 14:15:46 INFO crawl.Injector: Injector: Converting injected urls
> > to crawl db entries.
> > 13/12/01 14:15:46 INFO client.RMProxy: Connecting to ResourceManager at /
> > 0.0.0.0:8032
> > 13/12/01 14:15:46 INFO client.RMProxy: Connecting to ResourceManager at /
> > 0.0.0.0:8032
> > 13/12/01 14:15:47 INFO mapred.FileInputFormat: Total input paths to process
> > : 1
> > 13/12/01 14:15:47 INFO mapreduce.JobSubmitter: number of splits:2
> > 13/12/01 14:15:47 INFO Configuration.deprecation: user.name is deprecated.
> > Instead, use mapreduce.job.user.name
> > 13/12/01 14:15:47 INFO Configuration.deprecation: mapred.jar is deprecated.
> > Instead, use mapreduce.job.jar
> > 13/12/01 14:15:47 INFO Configuration.deprecation: mapred.output.value.class
> > is deprecated. Instead, use mapreduce.job.output.value.class
> > 13/12/01 14:15:48 INFO Configuration.deprecation: mapred.job.name is
> > deprecated. Instead, use mapreduce.job.name
> > 13/12/01 14:15:48 INFO Configuration.deprecation: mapred.input.dir is
> > deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
> > 13/12/01 14:15:48 INFO Configuration.deprecation: mapred.output.dir is
> > deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
> > 13/12/01 14:15:48 INFO Configuration.deprecation: mapred.map.tasks is
> > deprecated. Instead, use mapreduce.job.maps
> > 13/12/01 14:15:48 INFO Configuration.deprecation: mapred.output.key.class
> > is deprecated. Instead, use mapreduce.job.output.key.class
> > 13/12/01 14:15:48 INFO Configuration.deprecation: mapred.working.dir is
> > deprecated. Instead, use mapreduce.job.working.dir
> > 13/12/01 14:15:48 INFO mapreduce.JobSubmitter: Submitting tokens for job:
> > job_1385868843066_1747
> > 13/12/01 14:15:48 INFO impl.YarnClientImpl: Submitted application
> > application_1385868843066_1747 to ResourceManager at /0.0.0.0:8032
> > 13/12/01 14:15:48 INFO mapreduce.Job: The url to track the job:
> > http://localhost.localdomain:8088/proxy/application_1385868843066_1747/
> > 13/12/01 14:15:48 INFO mapreduce.Job: Running job: job_1385868843066_1747
> >
> >
> >
> > On Mon, Nov 25, 2013 at 6:51 AM, Markus Jelsma
> > <[email protected]>wrote:
> >
> > > I'm not sure it works, i think i've seen some issues with it. You can try
> > > though
> > >
> > >
> > > -----Original message-----
> > > > From:S.L <[email protected]>
> > > > Sent: Monday 25th November 2013 2:38
> > > > To: [email protected]
> > > > Subject: Nutch 1.7 and Hadoop Release 2.2.0
> > > >
> > > > Hi All,
> > > >
> > > > I am trying to set up a single node Hadoop cluster and noticed that the
> > > > Hadoop 2.2 release has been made available as a GA release candidate.
> > > >
> > > > As I am using Nutch 1.7 , I was wondering if it was compatible with
> > > Hadoop
> > > > 2.2 , please let me know .
> > > >
> > > > Thanks in advance!
> > >
> >
> 
> 
> 
> -- 
> Paul Inventado
> Waagle
> [email protected]
> www.waagle.com
> 
> ************************************************
> All material herein is intended for information purposes only and has been
> compiled from sources deemed reliable. Though information is believed to be
> correct, it is presented subject to errors, omissions, changes or
> withdrawal without notice. The information in this electronic mail message
> is the sender's business confidential and may be legally privileged. It is
> intended solely for the addressee(s). Access to this internet electronic
> mail message by anyone else is unauthorized. If you are not the intended
> recipient, any disclosure, copying, distribution or any action taken or
> omitted to be taken in reliance on it is prohibited and may be unlawful.
> The sender believes that this E-mail and any attachments were free of any
> virus, worm, Trojan horse, and/or malicious code when sent. This message
> and its attachments could have been infected during transmission. By
> reading the message and opening any attachments, the recipient accepts full
> responsibility for taking protective and remedial action about viruses and
> other defects. The sender's employer is not liable for any loss or damage
> arising in any way from this message or its attachments. Owned and operated
> by Waagle, Inc.
> ************************************************
>
RE: Nutch 1.7 and Hadoop Release 2.2.0

Reply via email to