Hi Divjot,
Thanks for the reply! I checked the HBase tutorial but still am a bit
confused. When I set up the standalone build, hbase-site.xml resides in
the hbase conf/. But it seems that with the fully distributed + nutch
deployment, I need to specify configurations in Nutch's hbase-site.xml,
which gets deployed into the job JAR.
My question is: what should I configure in Nutch's hbase-site.xml? Do I
need to also include HDFS URI? Does the CloudEra HBase build override
any default settings (as it should...)?
Thank you!
Michael
On 08/16/2017 09:14 PM, Divjot Singh wrote:
Hi Michael
You can used the following tutorial
https://wiki.apache.org/nutch/Nutch2Tutorial
Also update hbase-site.xml in the conf folder to add the zookeeper
quorum if your hbase is on another cluster.
Thanks
Divjot
On 17-Aug-2017 5:23 AM, "Michael Chen"
<[email protected]
<mailto:[email protected]>> wrote:
Hi Divjot,
I have a cluster running with CloudEra Manager (Hadoop, HBase,
Solr, ZooKeeper). Do you know if I need to modify the
hbase-site.xml before "ant runtime"? What configurations did you
have to do manually for Nutch (and others)?
Thanks in advance!
Michael
On 08/14/2017 07:29 PM, Divjot Singh wrote:
Hi Michael
I am using the latest Cloudera release and it's working fine.
You can use
any Linux distro you are comfortable with. Centos is mostly
used for server
deployments and it's quite stable.
Thanks
Divjot
On 15-Aug-2017 2:09 AM, "Michael Chen"
<[email protected]
<mailto:[email protected]>>
wrote:
Hi Divjot,
Thanks for the information! I was wondering if there is a
specific version
of cloudera manager and CDH that works best with Nutch 2.x
(HBase 1.2.3,
Hadoop 2.5.2)?
Also, is there a specific reason to use Centos 7 instead of
Amazon Linux or
Red Hat?
I’ll try to get started with the setup. Thanks!
Michael
From: Divjot Singh
Sent: Tuesday, August 8, 2017 04:06
To: [email protected] <mailto:[email protected]>
Subject: Re: Best practice for Nutch 2.x on AWS?
Hi
We have a setup of Hbase on an AWS cluster with centos 7. The
setup was
done using cloudera-manager. Nutch can be then run in
standalone mode or
over yarn by running the deployment jar in deploy folder.
I have not tested with S3 directly but your can always backup
the hbase
data daily to S3.
Hope this helps.Let me know if you have further queries.
Divjot
On Sun, Aug 6, 2017 at 5:59 AM, Michael Chen <
[email protected]
<mailto:[email protected]>> wrote:
Hi,
I'm trying to set up Nutch 2.x on AWS EC2 clusters, and I
was wondering if
anyone know of a "best set up" for it. The hadoop and
hbase version in
current EMR releases doesn't seem to work with Nutch 2.x.
Does it sound
like a good idea to manually set up Hadoop clusters and
then run Nutch on
it? Will I be able to use S3 as data storage so that I can
keep the data
when EC2 instance stops?
Any suggestions would be very much helpful!
Thanks in advance,
Michael