Hi Michael, > Will I be able to use S3 as data storage so that I can keep the data when EC2 > instance stops?
I don't know whether this is easily possible for 2.x and HBase. But Nutch 1.x can read and write data directly from S3 (via S3A file system [1]). Only operations on the CrawlDb need a little modification: data current to old, resp. temp folder to current, and S3 does not support moves. But this is easily worked-around by copying between S3 and HDFS. Best, Sebastian [1] https://wiki.apache.org/hadoop/AmazonS3 On 08/06/2017 02:29 AM, Michael Chen wrote: > Hi, > > I'm trying to set up Nutch 2.x on AWS EC2 clusters, and I was wondering if > anyone know of a "best > set up" for it. The hadoop and hbase version in current EMR releases doesn't > seem to work with Nutch > 2.x. Does it sound like a good idea to manually set up Hadoop clusters and > then run Nutch on it? > Will I be able to use S3 as data storage so that I can keep the data when EC2 > instance stops? > > Any suggestions would be very much helpful! > > Thanks in advance, > > Michael >

