Re: Best practice for Nutch 2.x on AWS?

Divjot Singh Thu, 17 Aug 2017 00:34:02 -0700

Glad you found the solution.

On Thu, Aug 17, 2017 at 12:46 PM, Michael Chen <
[email protected]> wrote:


> Fixed the problem... It was most likely a table match problem: it is
> necessary to specify -crawlId during indexing. Also the "Total 0 document
> is added" is probably a bug... The MR input output record is more reliable.
> :)
>
>
> On 08/16/2017 11:30 PM, Divjot Singh wrote:
>
>> Hi Michael
>>
>> I haven't used Solr for indexing. So I won't be able to help you on that
>> one.
>>
>> Divjot
>>
>>
>> On 17-Aug-2017 11:53 AM, "Michael Chen" <[email protected]
>> .edu <mailto:[email protected]>> wrote:
>>
>>     Hi Divjot,
>>
>>     You're right. I checked the webapp and rootdir is already defined
>>     by "hbase-site.xml" outside of Nutch, probably by CloudEra, though
>>     it is strange why CloudEra didn't take care of quorum too...
>>
>>     I just set up Solr 6.6.0 for lack of a good guide for the CloudEra
>>     Solr 4.10.3. It's running on HDFS standalone mode. Everything
>>     seems good but IndexJob does not index properly. HBase data is
>>     good so I assume it's only indexing that went wrong.
>>
>>     Solr-mapping is reflected properly in stdout. However, I noticed
>>     MR reported 0 input and output records...
>>
>>     Would you have an idea of what might have gone wrong?
>>
>>     Thanks a bunch!
>>
>>     Michael
>>
>>
>>     On 08/16/2017 11:12 PM, Divjot Singh wrote:
>>
>>>     Hi
>>>
>>>     You just need to add the zookeeper quorum of the hbase server you
>>>     to are connecting to in hbase-site.xml no need for hdfs uri. If
>>>     your cluster is configured correctly and you are able to create
>>>     tables in hbase then nutch should work fine once it gets the
>>>     hbase server url from hbase-site.xml.
>>>
>>>     Thanks
>>>     Divjot
>>>
>>>     On 17-Aug-2017 10:25 AM, "Michael Chen"
>>>     <[email protected]
>>>     <mailto:[email protected]>> wrote:
>>>
>>>         Hi Divjot,
>>>
>>>         Thanks for the reply! I checked the HBase tutorial but still
>>>         am a bit confused. When I set up the standalone build,
>>>         hbase-site.xml resides in the hbase conf/. But it seems that
>>>         with the fully distributed + nutch deployment, I need to
>>>         specify configurations in Nutch's hbase-site.xml, which gets
>>>         deployed into the job JAR.
>>>
>>>         My question is: what should I configure in Nutch's
>>>         hbase-site.xml? Do I need to also include HDFS URI? Does the
>>>         CloudEra HBase build override any default settings (as it
>>>         should...)?
>>>
>>>         Thank you!
>>>         Michael
>>>
>>>
>>>
>>>         On 08/16/2017 09:14 PM, Divjot Singh wrote:
>>>
>>>>         Hi Michael
>>>>
>>>>         You can used the following tutorial
>>>>         https://wiki.apache.org/nutch/Nutch2Tutorial
>>>>         <https://wiki.apache.org/nutch/Nutch2Tutorial>
>>>>
>>>>         Also update hbase-site.xml in the conf folder to add the
>>>>         zookeeper quorum if your hbase is on another cluster.
>>>>
>>>>         Thanks
>>>>         Divjot
>>>>
>>>>
>>>>         On 17-Aug-2017 5:23 AM, "Michael Chen"
>>>>         <[email protected]
>>>>         <mailto:[email protected]>> wrote:
>>>>
>>>>             Hi Divjot,
>>>>
>>>>             I have a cluster running with CloudEra Manager (Hadoop,
>>>>             HBase, Solr, ZooKeeper). Do you know if I need to modify
>>>>             the hbase-site.xml before "ant runtime"? What
>>>>             configurations did you have to do manually for Nutch
>>>>             (and others)?
>>>>
>>>>             Thanks in advance!
>>>>
>>>>
>>>>             Michael
>>>>
>>>>
>>>>             On 08/14/2017 07:29 PM, Divjot Singh wrote:
>>>>
>>>>                 Hi Michael
>>>>
>>>>                 I am using the latest Cloudera release and it's
>>>>                 working fine. You can use
>>>>                 any Linux distro you are comfortable with. Centos is
>>>>                 mostly used for server
>>>>                 deployments and it's quite stable.
>>>>
>>>>                 Thanks
>>>>                 Divjot
>>>>
>>>>
>>>>                 On 15-Aug-2017 2:09 AM, "Michael Chen"
>>>>                 <[email protected]
>>>>                 <mailto:[email protected]>>
>>>>                 wrote:
>>>>
>>>>                 Hi Divjot,
>>>>
>>>>                 Thanks for the information! I was wondering if there
>>>>                 is a specific version
>>>>                 of cloudera manager and CDH that works best with
>>>>                 Nutch 2.x (HBase 1.2.3,
>>>>                 Hadoop 2.5.2)?
>>>>
>>>>                 Also, is there a specific reason to use Centos 7
>>>>                 instead of Amazon Linux or
>>>>                 Red Hat?
>>>>
>>>>                 I’ll try to get started with the setup. Thanks!
>>>>
>>>>                 Michael
>>>>
>>>>                 From: Divjot Singh
>>>>                 Sent: Tuesday, August 8, 2017 04:06
>>>>                 To: [email protected] <mailto:[email protected]
>>>> >
>>>>                 Subject: Re: Best practice for Nutch 2.x on AWS?
>>>>
>>>>                 Hi
>>>>
>>>>                 We have a setup of Hbase on an AWS cluster with
>>>>                 centos 7. The setup was
>>>>                 done using cloudera-manager. Nutch can be then run
>>>>                 in standalone mode or
>>>>                 over yarn by running the deployment jar in deploy
>>>>                 folder.
>>>>
>>>>                 I have not tested with S3 directly but your can
>>>>                 always backup the hbase
>>>>                 data daily to S3.
>>>>
>>>>                 Hope this helps.Let me know if you have further queries.
>>>>
>>>>                 Divjot
>>>>
>>>>
>>>>                 On Sun, Aug 6, 2017 at 5:59 AM, Michael Chen <
>>>>                 [email protected]
>>>>                 <mailto:[email protected]>> wrote:
>>>>
>>>>                     Hi,
>>>>
>>>>                     I'm trying to set up Nutch 2.x on AWS EC2
>>>>                     clusters, and I was wondering if
>>>>                     anyone know of a "best set up" for it. The
>>>>                     hadoop and hbase version in
>>>>                     current EMR releases doesn't seem to work with
>>>>                     Nutch 2.x. Does it sound
>>>>                     like a good idea to manually set up Hadoop
>>>>                     clusters and then run Nutch on
>>>>                     it? Will I be able to use S3 as data storage so
>>>>                     that I can keep the data
>>>>                     when EC2 instance stops?
>>>>
>>>>                     Any suggestions would be very much helpful!
>>>>
>>>>                     Thanks in advance,
>>>>
>>>>                     Michael
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>

Re: Best practice for Nutch 2.x on AWS?

Reply via email to