Thanks Lewis for the info. Tien
On Tue, Dec 8, 2015 at 11:40 AM, Lewis John Mcgibbney < [email protected]> wrote: > Hi Tien, > Please see answers inline > > On Sat, Dec 5, 2015 at 9:34 AM, <[email protected]> wrote: > > > > > I'm setting the AWS cluster for Nutch 1.10 to crawl about 100M+ pages > from > > www. > > > OK, if I were you i would upgrade Nutch to 1.11... which has literally just > been released. > > > Sure, we will use Nutch 1.11 > > > > Could some one please advice about choosing aws instance, storage: > > - We don't use EMR > > > > Why not? Did you see Julien screencast on this? > https://t.co/c9BsaXhN80 > > Yes, i watched it > > > - Which aws instance type is best for us? > > > > If you use EMR you can make sure that you have a persistent headnode with > the remainder being spot instances. This will reduce your cost overhead > significantly however you do run the constant risk of loosing your cluster > at any given time. I would therefore recommend backing up the data in HDFS > (possibly to s3) literally every hour or so. > In long term we will not use AWS, that why we don't use EMR & s3 > > - Should we use EBS for storage? > > > > Yes you can... however the EMR and s3 approach has worked well for us for a > good while now. You can use the s3cmd tool which makes shifting data to and > from s3 a piece of cake. > > > > > > - Which is best for Nutch 1.10: hadoop 1.x or hadoop 2.x? > > > > Nutch 1.11 runs off of Nutch 2.4.0... this lines up perfectly with the most > recent AMI > > http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-hadoop-version.html > hth > Lewis >

