Thanks Lewis for the info.

Tien

On Tue, Dec 8, 2015 at 11:40 AM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Tien,
> Please see answers inline
>
> On Sat, Dec 5, 2015 at 9:34 AM, <[email protected]> wrote:
>
> >
> > I'm setting the AWS cluster for Nutch 1.10 to crawl about 100M+ pages
> from
> > www.
> >
>
OK, if I were you i would upgrade Nutch to 1.11... which has literally just
> been released.
>
>
> Sure, we will use Nutch 1.11

> >
> > Could some one please advice about choosing aws instance, storage:
> > - We don't use EMR
> >
>
> Why not? Did you see Julien screencast on this?
> https://t.co/c9BsaXhN80
>
> Yes, i watched it

>
> > - Which aws instance type is best for us?
> >
>
> If you use EMR you can make sure that you have a persistent headnode with
> the remainder being spot instances. This will reduce your cost overhead
> significantly however you do run the constant risk of loosing your cluster
> at any given time. I would therefore recommend backing up the data in HDFS
> (possibly to s3) literally every hour or so.
>

In long term we will not use AWS, that why we don't use EMR & s3


> > - Should we use EBS for storage?
> >
>
> Yes you can... however the EMR and s3 approach has worked well for us for a
> good while now. You can use the s3cmd tool which makes shifting data to and
> from s3 a piece of cake.
>
>
> >
> > - Which is best for Nutch 1.10: hadoop 1.x or hadoop 2.x?
> >
>
> Nutch 1.11 runs off of Nutch 2.4.0... this lines up perfectly with the most
> recent AMI
>
> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-hadoop-version.html
> hth
> Lewis
>

Reply via email to