Re: Amazon EC2 and HTTP 503 errors: RequestLimitExceeded

Paolo Castagna Tue, 01 Nov 2011 02:53:23 -0700

Hi Andrei,
thanks for your promptly replies and your help. I appreciate.
I share my comments and opinions inline.

On 31 October 2011 14:32, Andrei Savu <[email protected]> wrote:
> Answers inline.
>>
>> I was trying to start an Hadoop cluster of 20 datanodes|tasktrackers.
>>
>> What is the current upper bound?
>
> We haven't done any testing to find out but it seems like when starting a
> cluster with ~20 nodes jclouds makes too many requests to AWS. We should be
> able to overcame this limitation by changing settings.

That would be wonderful.

I am not aiming at launching clusters of hundreds of nodes using Whirr, but
in the range of 10s of nodes, if possible at all, seems very reasonable to me.

A too low limit on the number of nodes Whirr can provision would, in my
humble opinion, significantly harm the utility and potential of the project.
It's a great and useful project, very easy to use and, when it works, it works
beautifully. Kudos to all of you, thanks.

But, the cluster size issue can be a big problem in terms of adoption and
in my opinion it should be addressed (if at all possible).

>>> I have created a new JIRA issue so that we can add this automatically
>>> when the image-id is known:
>>> https://issues.apache.org/jira/browse/WHIRR-416
>>
>> I am looking forward to see if this will fix my problem and increase the
>> number of nodes of Hadoop clusters one can use via Whirr.
>
> I hope we are going to be able to get this in for 0.8.0.

Ack.

>>> What if you start a smaller size cluster but with more powerful machines?
>>
>> An option, but not a good one in the context of MapReduce, isn't it? :-)
>> m1.large are powerful (and expensive) enough for what I want to do.
>>
>
> How about m1.xlarge? (twice as powerful - and *only* twice as expensive).

I might try that as well.

Indeed, I was thinking to do the opposite: use twice or more m1.small.
My MapReduce jobs are typically very simple and do not require a lot of RAM.
I am aware that this might not be the right thing to do... but I am curious and
I want to experience it myself. IO might be poor ... I know.

> How are you using Apache Whirr? What's the end result?

We use MapReduce mainly in our ingestion pipeline, currently we use
Amazon EMR jobs. We get some data in RDF format, we build Lucene
indexes, TDB indexes [1], etc. Others use MapReduce jobs to gather
stats or analytics over their datasets (again, mainly via EMR jobs currently).

We deal mainly with RDF data and often it is "human curated" data. Dataset
sizes follow a sort of power law distribution where you have many datasets
of small-medium size and just a few large-huge. In RDF world large-huge
is in the order of billions of triples/quads (i.e. "records"|lines).

I find Apache Whirr very good for testing and I prefer to have full control on
my software stack and I achieve this using open source software. This way,
I can choose to be on the bleeding edge, upgrade when I need it, etc. When
I have a problem, I can go as deep as I need to find out what went wrong.
Support, in my experience, is faster, better quality and more transparent for
open source (and Apache) projects such as Hadoop and Whirr.

I am not sure if with Amazon EMR it is possible to launch a job on an already
running cluster, probably it is. It is certainly possible to do so
using Whirr and
this is very useful when testing/developing considering you pay machines on
EC2 by the hour. It is useful in production when you have many short running
jobs.

I did not find a way with Amazon EMR jobs to actually browse HDFS via the
Namenode UI as you can do with Hadoop provisioned by Whirr. With Amazon
EMR jobs you can connect to the Namenode UI but the browsing does not
work out-of-the-box for me.

For testing, while I develop, I use MiniDFCluster and MiniMRCluster, they helps
even if it can be slow. However, you are never 100% sure until you test with a
real cluster. I find very useful, once I am almost there, to have a
small cluster
running and iterate quickly to fix small issues. At that point Whirr
is what I use.

Personally, I rather prefer to invest my time on stuff which does not lock me
into a particular cloud provider (Whirr has this property).

> You feedback is extremely important for our future roadmap.

So, far I've used Whirr for Hadoop clusters only, but I am really happy to see
that there is support for Cassandra, HBase, ElasticSearch and ZooKeeper.
I might use these as well in a not too distant future.

Paolo

 [1] https://github.com/castagna/tdbloader3

Re: Amazon EC2 and HTTP 503 errors: RequestLimitExceeded

Reply via email to