Re: Best practice for Nutch 2.x on AWS?

Michael Chen Wed, 16 Aug 2017 23:24:08 -0700

Hi Divjot,

You're right. I checked the webapp and rootdir is already defined by"hbase-site.xml" outside of Nutch, probably by CloudEra, though it isstrange why CloudEra didn't take care of quorum too...

I just set up Solr 6.6.0 for lack of a good guide for the CloudEra Solr4.10.3. It's running on HDFS standalone mode. Everything seems good butIndexJob does not index properly. HBase data is good so I assume it'sonly indexing that went wrong.

Solr-mapping is reflected properly in stdout. However, I noticed MRreported 0 input and output records...


Would you have an idea of what might have gone wrong?

Thanks a bunch!

Michael


On 08/16/2017 11:12 PM, Divjot Singh wrote:

Hi

You just need to add the zookeeper quorum of the hbase server you toare connecting to in hbase-site.xml no need for hdfs uri. If yourcluster is configured correctly and you are able to create tables inhbase then nutch should work fine once it gets the hbase server urlfrom hbase-site.xml.


Thanks
Divjot

On 17-Aug-2017 10:25 AM, "Michael Chen"<[email protected]<mailto:[email protected]>> wrote:


    Hi Divjot,

    Thanks for the reply! I checked the HBase tutorial but still am a
    bit confused. When I set up the standalone build, hbase-site.xml
    resides in the hbase conf/. But it seems that with the fully
    distributed + nutch deployment, I need to specify configurations
    in Nutch's hbase-site.xml, which gets deployed into the job JAR.

    My question is: what should I configure in Nutch's hbase-site.xml?
    Do I need to also include HDFS URI? Does the CloudEra HBase build
    override any default settings (as it should...)?

    Thank you!
    Michael



    On 08/16/2017 09:14 PM, Divjot Singh wrote:

    Hi Michael

    You can used the following tutorial
    https://wiki.apache.org/nutch/Nutch2Tutorial
    <https://wiki.apache.org/nutch/Nutch2Tutorial>

    Also update hbase-site.xml in the conf folder to add the
    zookeeper quorum if your hbase is on another cluster.

    Thanks
    Divjot


    On 17-Aug-2017 5:23 AM, "Michael Chen"
    <[email protected]
    <mailto:[email protected]>> wrote:

        Hi Divjot,

        I have a cluster running with CloudEra Manager (Hadoop,
        HBase, Solr, ZooKeeper). Do you know if I need to modify the
        hbase-site.xml before "ant runtime"? What configurations did
        you have to do manually for Nutch (and others)?

        Thanks in advance!


        Michael


        On 08/14/2017 07:29 PM, Divjot Singh wrote:

            Hi Michael

            I am using the latest Cloudera release and it's working
            fine. You can use
            any Linux distro you are comfortable with. Centos is
            mostly used for server
            deployments and it's quite stable.

            Thanks
            Divjot


            On 15-Aug-2017 2:09 AM, "Michael Chen"
            <[email protected]
            <mailto:[email protected]>>
            wrote:

            Hi Divjot,

            Thanks for the information! I was wondering if there is a
            specific version
            of cloudera manager and CDH that works best with Nutch
            2.x (HBase 1.2.3,
            Hadoop 2.5.2)?

            Also, is there a specific reason to use Centos 7 instead
            of Amazon Linux or
            Red Hat?

            I’ll try to get started with the setup. Thanks!

            Michael

            From: Divjot Singh
            Sent: Tuesday, August 8, 2017 04:06
            To: [email protected] <mailto:[email protected]>
            Subject: Re: Best practice for Nutch 2.x on AWS?

            Hi

            We have a setup of Hbase on an AWS cluster with centos 7.
            The setup was
            done using cloudera-manager. Nutch can be then run in
            standalone mode or
            over yarn by running the deployment jar in deploy folder.

            I have not tested with S3 directly but your can always
            backup the hbase
            data daily to S3.

            Hope this helps.Let me know if you have further queries.

            Divjot


            On Sun, Aug 6, 2017 at 5:59 AM, Michael Chen <
            [email protected]
            <mailto:[email protected]>> wrote:

                Hi,

                I'm trying to set up Nutch 2.x on AWS EC2 clusters,
                and I was wondering if
                anyone know of a "best set up" for it. The hadoop and
                hbase version in
                current EMR releases doesn't seem to work with Nutch
                2.x. Does it sound
                like a good idea to manually set up Hadoop clusters
                and then run Nutch on
                it? Will I be able to use S3 as data storage so that
                I can keep the data
                when EC2 instance stops?

                Any suggestions would be very much helpful!

                Thanks in advance,

                Michael

Re: Best practice for Nutch 2.x on AWS?

Reply via email to