Re: Best practice for Nutch 2.x on AWS?

Sebastian Nagel Tue, 15 Aug 2017 02:50:47 -0700

Hi Michael,

> Will I be able to use S3 as data storage so that I can keep the data when EC2 
> instance stops?

I don't know whether this is easily possible for 2.x and HBase. But Nutch 1.x 
can read and write
data directly from S3 (via S3A file system [1]). Only operations on the CrawlDb 
need a little
modification: data current to old, resp. temp folder to current, and S3 does 
not support moves.
But this is easily worked-around by copying between S3 and HDFS.

Best,
Sebastian

[1] https://wiki.apache.org/hadoop/AmazonS3

On 08/06/2017 02:29 AM, Michael Chen wrote:
> Hi,
> 
> I'm trying to set up Nutch 2.x on AWS EC2 clusters, and I was wondering if 
> anyone know of a "best
> set up" for it. The hadoop and hbase version in current EMR releases doesn't 
> seem to work with Nutch
> 2.x. Does it sound like a good idea to manually set up Hadoop clusters and 
> then run Nutch on it?
> Will I be able to use S3 as data storage so that I can keep the data when EC2 
> instance stops?
> 
> Any suggestions would be very much helpful!
> 
> Thanks in advance,
> 
> Michael
>

Re: Best practice for Nutch 2.x on AWS?

Reply via email to