Wow you got it working! I can see that you're using Nutch 1.8? Are you
running on a single machine or a distributed cluster?


On Mon, Dec 2, 2013 at 3:22 AM, S.L <[email protected]> wrote:

> Markus,
>
> I was able to set it up Nutch 1.7 on Hadoop 2.2 finally , I am using the
> following command to start it up
>
> bin/hadoop jar
> /home/general/workspace/nutch/runtime/deploy/apache-nutch-1.8-SNAPSHOT.job
> org.apache.nutch.crawl.Crawl urls -dir crawldirectory -depth 1000 -topN
> 30000
>
> And the intial log output I get from Hadoop is as follows, however I am not
> sure if the crawl is happening at all or if its happening at a dead slow
> pace.
>
> If I use the same single url to crawl as I use in my local crawl from
> eclipse to test I get the proper crawling speed however the hadoop job is
> as I mentioned eithe rnot running at all or is running at a slow pace.
>
> Please see the log below , you will see a WARN message that the Solr URL is
> not set , this is because I internally log the data into Solr  and I dont
> have a  Solr URL to mention into the arguments , so that warning message
> can be ignored.
>
> Please see the log below.
>
>
> ----------------------------------------------------------------------------------------------------------
>
>
> 13/12/01 14:15:45 WARN crawl.Crawl: solrUrl is not set, indexing will be
> skipped...
> 13/12/01 14:15:45 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 13/12/01 14:15:46 INFO crawl.Crawl: crawl started in: crawldirectory
> 13/12/01 14:15:46 INFO crawl.Crawl: rootUrlDir = urls
> 13/12/01 14:15:46 INFO crawl.Crawl: threads = 30
> 13/12/01 14:15:46 INFO crawl.Crawl: depth = 1000
> 13/12/01 14:15:46 INFO crawl.Crawl: solrUrl=null
> 13/12/01 14:15:46 INFO crawl.Crawl: topN = 30000
> 13/12/01 14:15:46 INFO crawl.Injector: Injector: starting at 2013-12-01
> 14:15:46
> 13/12/01 14:15:46 INFO crawl.Injector: Injector: crawlDb:
> crawldirectory/crawldb
> 13/12/01 14:15:46 INFO crawl.Injector: Injector: urlDir: urls
> 13/12/01 14:15:46 INFO Configuration.deprecation: mapred.temp.dir is
> deprecated. Instead, use mapreduce.cluster.temp.dir
> 13/12/01 14:15:46 INFO crawl.Injector: Injector: Converting injected urls
> to crawl db entries.
> 13/12/01 14:15:46 INFO client.RMProxy: Connecting to ResourceManager at /
> 0.0.0.0:8032
> 13/12/01 14:15:46 INFO client.RMProxy: Connecting to ResourceManager at /
> 0.0.0.0:8032
> 13/12/01 14:15:47 INFO mapred.FileInputFormat: Total input paths to process
> : 1
> 13/12/01 14:15:47 INFO mapreduce.JobSubmitter: number of splits:2
> 13/12/01 14:15:47 INFO Configuration.deprecation: user.name is deprecated.
> Instead, use mapreduce.job.user.name
> 13/12/01 14:15:47 INFO Configuration.deprecation: mapred.jar is deprecated.
> Instead, use mapreduce.job.jar
> 13/12/01 14:15:47 INFO Configuration.deprecation: mapred.output.value.class
> is deprecated. Instead, use mapreduce.job.output.value.class
> 13/12/01 14:15:48 INFO Configuration.deprecation: mapred.job.name is
> deprecated. Instead, use mapreduce.job.name
> 13/12/01 14:15:48 INFO Configuration.deprecation: mapred.input.dir is
> deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
> 13/12/01 14:15:48 INFO Configuration.deprecation: mapred.output.dir is
> deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
> 13/12/01 14:15:48 INFO Configuration.deprecation: mapred.map.tasks is
> deprecated. Instead, use mapreduce.job.maps
> 13/12/01 14:15:48 INFO Configuration.deprecation: mapred.output.key.class
> is deprecated. Instead, use mapreduce.job.output.key.class
> 13/12/01 14:15:48 INFO Configuration.deprecation: mapred.working.dir is
> deprecated. Instead, use mapreduce.job.working.dir
> 13/12/01 14:15:48 INFO mapreduce.JobSubmitter: Submitting tokens for job:
> job_1385868843066_1747
> 13/12/01 14:15:48 INFO impl.YarnClientImpl: Submitted application
> application_1385868843066_1747 to ResourceManager at /0.0.0.0:8032
> 13/12/01 14:15:48 INFO mapreduce.Job: The url to track the job:
> http://localhost.localdomain:8088/proxy/application_1385868843066_1747/
> 13/12/01 14:15:48 INFO mapreduce.Job: Running job: job_1385868843066_1747
>
>
>
> On Mon, Nov 25, 2013 at 6:51 AM, Markus Jelsma
> <[email protected]>wrote:
>
> > I'm not sure it works, i think i've seen some issues with it. You can try
> > though
> >
> >
> > -----Original message-----
> > > From:S.L <[email protected]>
> > > Sent: Monday 25th November 2013 2:38
> > > To: [email protected]
> > > Subject: Nutch 1.7 and Hadoop Release 2.2.0
> > >
> > > Hi All,
> > >
> > > I am trying to set up a single node Hadoop cluster and noticed that the
> > > Hadoop 2.2 release has been made available as a GA release candidate.
> > >
> > > As I am using Nutch 1.7 , I was wondering if it was compatible with
> > Hadoop
> > > 2.2 , please let me know .
> > >
> > > Thanks in advance!
> >
>



-- 
Paul Inventado
Waagle
[email protected]
www.waagle.com

************************************************
All material herein is intended for information purposes only and has been
compiled from sources deemed reliable. Though information is believed to be
correct, it is presented subject to errors, omissions, changes or
withdrawal without notice. The information in this electronic mail message
is the sender's business confidential and may be legally privileged. It is
intended solely for the addressee(s). Access to this internet electronic
mail message by anyone else is unauthorized. If you are not the intended
recipient, any disclosure, copying, distribution or any action taken or
omitted to be taken in reliance on it is prohibited and may be unlawful.
The sender believes that this E-mail and any attachments were free of any
virus, worm, Trojan horse, and/or malicious code when sent. This message
and its attachments could have been infected during transmission. By
reading the message and opening any attachments, the recipient accepts full
responsibility for taking protective and remedial action about viruses and
other defects. The sender's employer is not liable for any loss or damage
arising in any way from this message or its attachments. Owned and operated
by Waagle, Inc.
************************************************

Reply via email to