Hi Tien, Please see answers inline On Sat, Dec 5, 2015 at 9:34 AM, <[email protected]> wrote:
> > I'm setting the AWS cluster for Nutch 1.10 to crawl about 100M+ pages from > www. > OK, if I were you i would upgrade Nutch to 1.11... which has literally just been released. > > Could some one please advice about choosing aws instance, storage: > - We don't use EMR > Why not? Did you see Julien screencast on this? https://t.co/c9BsaXhN80 > - Which aws instance type is best for us? > If you use EMR you can make sure that you have a persistent headnode with the remainder being spot instances. This will reduce your cost overhead significantly however you do run the constant risk of loosing your cluster at any given time. I would therefore recommend backing up the data in HDFS (possibly to s3) literally every hour or so. > - Should we use EBS for storage? > Yes you can... however the EMR and s3 approach has worked well for us for a good while now. You can use the s3cmd tool which makes shifting data to and from s3 a piece of cake. > > - Which is best for Nutch 1.10: hadoop 1.x or hadoop 2.x? > Nutch 1.11 runs off of Nutch 2.4.0... this lines up perfectly with the most recent AMI http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-hadoop-version.html hth Lewis

