Glad you found the solution. On Thu, Aug 17, 2017 at 12:46 PM, Michael Chen < [email protected]> wrote:
> Fixed the problem... It was most likely a table match problem: it is > necessary to specify -crawlId during indexing. Also the "Total 0 document > is added" is probably a bug... The MR input output record is more reliable. > :) > > > On 08/16/2017 11:30 PM, Divjot Singh wrote: > >> Hi Michael >> >> I haven't used Solr for indexing. So I won't be able to help you on that >> one. >> >> Divjot >> >> >> On 17-Aug-2017 11:53 AM, "Michael Chen" <[email protected] >> .edu <mailto:[email protected]>> wrote: >> >> Hi Divjot, >> >> You're right. I checked the webapp and rootdir is already defined >> by "hbase-site.xml" outside of Nutch, probably by CloudEra, though >> it is strange why CloudEra didn't take care of quorum too... >> >> I just set up Solr 6.6.0 for lack of a good guide for the CloudEra >> Solr 4.10.3. It's running on HDFS standalone mode. Everything >> seems good but IndexJob does not index properly. HBase data is >> good so I assume it's only indexing that went wrong. >> >> Solr-mapping is reflected properly in stdout. However, I noticed >> MR reported 0 input and output records... >> >> Would you have an idea of what might have gone wrong? >> >> Thanks a bunch! >> >> Michael >> >> >> On 08/16/2017 11:12 PM, Divjot Singh wrote: >> >>> Hi >>> >>> You just need to add the zookeeper quorum of the hbase server you >>> to are connecting to in hbase-site.xml no need for hdfs uri. If >>> your cluster is configured correctly and you are able to create >>> tables in hbase then nutch should work fine once it gets the >>> hbase server url from hbase-site.xml. >>> >>> Thanks >>> Divjot >>> >>> On 17-Aug-2017 10:25 AM, "Michael Chen" >>> <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Hi Divjot, >>> >>> Thanks for the reply! I checked the HBase tutorial but still >>> am a bit confused. When I set up the standalone build, >>> hbase-site.xml resides in the hbase conf/. But it seems that >>> with the fully distributed + nutch deployment, I need to >>> specify configurations in Nutch's hbase-site.xml, which gets >>> deployed into the job JAR. >>> >>> My question is: what should I configure in Nutch's >>> hbase-site.xml? Do I need to also include HDFS URI? Does the >>> CloudEra HBase build override any default settings (as it >>> should...)? >>> >>> Thank you! >>> Michael >>> >>> >>> >>> On 08/16/2017 09:14 PM, Divjot Singh wrote: >>> >>>> Hi Michael >>>> >>>> You can used the following tutorial >>>> https://wiki.apache.org/nutch/Nutch2Tutorial >>>> <https://wiki.apache.org/nutch/Nutch2Tutorial> >>>> >>>> Also update hbase-site.xml in the conf folder to add the >>>> zookeeper quorum if your hbase is on another cluster. >>>> >>>> Thanks >>>> Divjot >>>> >>>> >>>> On 17-Aug-2017 5:23 AM, "Michael Chen" >>>> <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> Hi Divjot, >>>> >>>> I have a cluster running with CloudEra Manager (Hadoop, >>>> HBase, Solr, ZooKeeper). Do you know if I need to modify >>>> the hbase-site.xml before "ant runtime"? What >>>> configurations did you have to do manually for Nutch >>>> (and others)? >>>> >>>> Thanks in advance! >>>> >>>> >>>> Michael >>>> >>>> >>>> On 08/14/2017 07:29 PM, Divjot Singh wrote: >>>> >>>> Hi Michael >>>> >>>> I am using the latest Cloudera release and it's >>>> working fine. You can use >>>> any Linux distro you are comfortable with. Centos is >>>> mostly used for server >>>> deployments and it's quite stable. >>>> >>>> Thanks >>>> Divjot >>>> >>>> >>>> On 15-Aug-2017 2:09 AM, "Michael Chen" >>>> <[email protected] >>>> <mailto:[email protected]>> >>>> wrote: >>>> >>>> Hi Divjot, >>>> >>>> Thanks for the information! I was wondering if there >>>> is a specific version >>>> of cloudera manager and CDH that works best with >>>> Nutch 2.x (HBase 1.2.3, >>>> Hadoop 2.5.2)? >>>> >>>> Also, is there a specific reason to use Centos 7 >>>> instead of Amazon Linux or >>>> Red Hat? >>>> >>>> I’ll try to get started with the setup. Thanks! >>>> >>>> Michael >>>> >>>> From: Divjot Singh >>>> Sent: Tuesday, August 8, 2017 04:06 >>>> To: [email protected] <mailto:[email protected] >>>> > >>>> Subject: Re: Best practice for Nutch 2.x on AWS? >>>> >>>> Hi >>>> >>>> We have a setup of Hbase on an AWS cluster with >>>> centos 7. The setup was >>>> done using cloudera-manager. Nutch can be then run >>>> in standalone mode or >>>> over yarn by running the deployment jar in deploy >>>> folder. >>>> >>>> I have not tested with S3 directly but your can >>>> always backup the hbase >>>> data daily to S3. >>>> >>>> Hope this helps.Let me know if you have further queries. >>>> >>>> Divjot >>>> >>>> >>>> On Sun, Aug 6, 2017 at 5:59 AM, Michael Chen < >>>> [email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> Hi, >>>> >>>> I'm trying to set up Nutch 2.x on AWS EC2 >>>> clusters, and I was wondering if >>>> anyone know of a "best set up" for it. The >>>> hadoop and hbase version in >>>> current EMR releases doesn't seem to work with >>>> Nutch 2.x. Does it sound >>>> like a good idea to manually set up Hadoop >>>> clusters and then run Nutch on >>>> it? Will I be able to use S3 as data storage so >>>> that I can keep the data >>>> when EC2 instance stops? >>>> >>>> Any suggestions would be very much helpful! >>>> >>>> Thanks in advance, >>>> >>>> Michael >>>> >>>> >>>> >>>> >>>> >>> >>> >> >> >

