Hi Divjot,

Thanks for the reply! I checked the HBase tutorial but still am a bit confused. When I set up the standalone build, hbase-site.xml resides in the hbase conf/. But it seems that with the fully distributed + nutch deployment, I need to specify configurations in Nutch's hbase-site.xml, which gets deployed into the job JAR.

My question is: what should I configure in Nutch's hbase-site.xml? Do I need to also include HDFS URI? Does the CloudEra HBase build override any default settings (as it should...)?

Thank you!
Michael


On 08/16/2017 09:14 PM, Divjot Singh wrote:
Hi Michael

You can used the following tutorial
https://wiki.apache.org/nutch/Nutch2Tutorial

Also update hbase-site.xml in the conf folder to add the zookeeper quorum if your hbase is on another cluster.

Thanks
Divjot


On 17-Aug-2017 5:23 AM, "Michael Chen" <[email protected] <mailto:[email protected]>> wrote:

    Hi Divjot,

    I have a cluster running with CloudEra Manager (Hadoop, HBase,
    Solr, ZooKeeper). Do you know if I need to modify the
    hbase-site.xml before "ant runtime"? What configurations did you
    have to do manually for Nutch (and others)?

    Thanks in advance!


    Michael


    On 08/14/2017 07:29 PM, Divjot Singh wrote:

        Hi Michael

        I am using the latest Cloudera release and it's working fine.
        You can use
        any Linux distro you are comfortable with. Centos is mostly
        used for server
        deployments and it's quite stable.

        Thanks
        Divjot


        On 15-Aug-2017 2:09 AM, "Michael Chen"
        <[email protected]
        <mailto:[email protected]>>
        wrote:

        Hi Divjot,

        Thanks for the information! I was wondering if there is a
        specific version
        of cloudera manager and CDH that works best with Nutch 2.x
        (HBase 1.2.3,
        Hadoop 2.5.2)?

        Also, is there a specific reason to use Centos 7 instead of
        Amazon Linux or
        Red Hat?

        I’ll try to get started with the setup. Thanks!

        Michael

        From: Divjot Singh
        Sent: Tuesday, August 8, 2017 04:06
        To: [email protected] <mailto:[email protected]>
        Subject: Re: Best practice for Nutch 2.x on AWS?

        Hi

        We have a setup of Hbase on an AWS cluster with centos 7. The
        setup was
        done using cloudera-manager. Nutch can be then run in
        standalone mode or
        over yarn by running the deployment jar in deploy folder.

        I have not tested with S3 directly but your can always backup
        the hbase
        data daily to S3.

        Hope this helps.Let me know if you have further queries.

        Divjot


        On Sun, Aug 6, 2017 at 5:59 AM, Michael Chen <
        [email protected]
        <mailto:[email protected]>> wrote:

            Hi,

            I'm trying to set up Nutch 2.x on AWS EC2 clusters, and I
            was wondering if
            anyone know of a "best set up" for it. The hadoop and
            hbase version in
            current EMR releases doesn't seem to work with Nutch 2.x.
            Does it sound
            like a good idea to manually set up Hadoop clusters and
            then run Nutch on
            it? Will I be able to use S3 as data storage so that I can
            keep the data
            when EC2 instance stops?

            Any suggestions would be very much helpful!

            Thanks in advance,

            Michael





Reply via email to