Hi Tien,
Please see answers inline

On Sat, Dec 5, 2015 at 9:34 AM, <[email protected]> wrote:

>
> I'm setting the AWS cluster for Nutch 1.10 to crawl about 100M+ pages from
> www.
>

OK, if I were you i would upgrade Nutch to 1.11... which has literally just
been released.


>
> Could some one please advice about choosing aws instance, storage:
> - We don't use EMR
>

Why not? Did you see Julien screencast on this?
https://t.co/c9BsaXhN80


> - Which aws instance type is best for us?
>

If you use EMR you can make sure that you have a persistent headnode with
the remainder being spot instances. This will reduce your cost overhead
significantly however you do run the constant risk of loosing your cluster
at any given time. I would therefore recommend backing up the data in HDFS
(possibly to s3) literally every hour or so.


> - Should we use EBS for storage?
>

Yes you can... however the EMR and s3 approach has worked well for us for a
good while now. You can use the s3cmd tool which makes shifting data to and
from s3 a piece of cake.


>
> - Which is best for Nutch 1.10: hadoop 1.x or hadoop 2.x?
>

Nutch 1.11 runs off of Nutch 2.4.0... this lines up perfectly with the most
recent AMI
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-hadoop-version.html
hth
Lewis

Reply via email to