Re: Best practice for Nutch 2.x on AWS?

Michael Chen Thu, 17 Aug 2017 00:17:46 -0700

Fixed the problem... It was most likely a table match problem: it isnecessary to specify -crawlId during indexing. Also the "Total 0document is added" is probably a bug... The MR input output record ismore reliable. :)


On 08/16/2017 11:30 PM, Divjot Singh wrote:

Hi Michael

I haven't used Solr for indexing. So I won't be able to help you onthat one.


Divjot

On 17-Aug-2017 11:53 AM, "Michael Chen"<yiningchen2...@u.northwestern.edu<mailto:yiningchen2...@u.northwestern.edu>> wrote:


    Hi Divjot,

    You're right. I checked the webapp and rootdir is already defined
    by "hbase-site.xml" outside of Nutch, probably by CloudEra, though
    it is strange why CloudEra didn't take care of quorum too...

    I just set up Solr 6.6.0 for lack of a good guide for the CloudEra
    Solr 4.10.3. It's running on HDFS standalone mode. Everything
    seems good but IndexJob does not index properly. HBase data is
    good so I assume it's only indexing that went wrong.

    Solr-mapping is reflected properly in stdout. However, I noticed
    MR reported 0 input and output records...

    Would you have an idea of what might have gone wrong?

    Thanks a bunch!

    Michael


    On 08/16/2017 11:12 PM, Divjot Singh wrote:

    Hi

    You just need to add the zookeeper quorum of the hbase server you
    to are connecting to in hbase-site.xml no need for hdfs uri. If
    your cluster is configured correctly and you are able to create
    tables in hbase then nutch should work fine once it gets the
    hbase server url from hbase-site.xml.

    Thanks
    Divjot

    On 17-Aug-2017 10:25 AM, "Michael Chen"
    <yiningchen2...@u.northwestern.edu
    <mailto:yiningchen2...@u.northwestern.edu>> wrote:

        Hi Divjot,

        Thanks for the reply! I checked the HBase tutorial but still
        am a bit confused. When I set up the standalone build,
        hbase-site.xml resides in the hbase conf/. But it seems that
        with the fully distributed + nutch deployment, I need to
        specify configurations in Nutch's hbase-site.xml, which gets
        deployed into the job JAR.

        My question is: what should I configure in Nutch's
        hbase-site.xml? Do I need to also include HDFS URI? Does the
        CloudEra HBase build override any default settings (as it
        should...)?

        Thank you!
        Michael



        On 08/16/2017 09:14 PM, Divjot Singh wrote:

        Hi Michael

        You can used the following tutorial
        https://wiki.apache.org/nutch/Nutch2Tutorial
        <https://wiki.apache.org/nutch/Nutch2Tutorial>

        Also update hbase-site.xml in the conf folder to add the
        zookeeper quorum if your hbase is on another cluster.

        Thanks
        Divjot


        On 17-Aug-2017 5:23 AM, "Michael Chen"
        <yiningchen2...@u.northwestern.edu
        <mailto:yiningchen2...@u.northwestern.edu>> wrote:

            Hi Divjot,

            I have a cluster running with CloudEra Manager (Hadoop,
            HBase, Solr, ZooKeeper). Do you know if I need to modify
            the hbase-site.xml before "ant runtime"? What
            configurations did you have to do manually for Nutch
            (and others)?

            Thanks in advance!


            Michael


            On 08/14/2017 07:29 PM, Divjot Singh wrote:

                Hi Michael

                I am using the latest Cloudera release and it's
                working fine. You can use
                any Linux distro you are comfortable with. Centos is
                mostly used for server
                deployments and it's quite stable.

                Thanks
                Divjot


                On 15-Aug-2017 2:09 AM, "Michael Chen"
                <yiningchen2...@u.northwestern.edu
                <mailto:yiningchen2...@u.northwestern.edu>>
                wrote:

                Hi Divjot,

                Thanks for the information! I was wondering if there
                is a specific version
                of cloudera manager and CDH that works best with
                Nutch 2.x (HBase 1.2.3,
                Hadoop 2.5.2)?

                Also, is there a specific reason to use Centos 7
                instead of Amazon Linux or
                Red Hat?

                I’ll try to get started with the setup. Thanks!

                Michael

                From: Divjot Singh
                Sent: Tuesday, August 8, 2017 04:06
                To: user@nutch.apache.org <mailto:user@nutch.apache.org>
                Subject: Re: Best practice for Nutch 2.x on AWS?

                Hi

                We have a setup of Hbase on an AWS cluster with
                centos 7. The setup was
                done using cloudera-manager. Nutch can be then run
                in standalone mode or
                over yarn by running the deployment jar in deploy
                folder.

                I have not tested with S3 directly but your can
                always backup the hbase
                data daily to S3.

                Hope this helps.Let me know if you have further queries.

                Divjot


                On Sun, Aug 6, 2017 at 5:59 AM, Michael Chen <
                yiningchen2...@u.northwestern.edu
                <mailto:yiningchen2...@u.northwestern.edu>> wrote:

                    Hi,

                    I'm trying to set up Nutch 2.x on AWS EC2
                    clusters, and I was wondering if
                    anyone know of a "best set up" for it. The
                    hadoop and hbase version in
                    current EMR releases doesn't seem to work with
                    Nutch 2.x. Does it sound
                    like a good idea to manually set up Hadoop
                    clusters and then run Nutch on
                    it? Will I be able to use S3 as data storage so
                    that I can keep the data
                    when EC2 instance stops?

                    Any suggestions would be very much helpful!

                    Thanks in advance,

                    Michael

Re: Best practice for Nutch 2.x on AWS?

Reply via email to