Christian,

Is there anything preventing you from using EMR, which will manage your
cluster for you? Creating large clusters would take mins on EMR instead of
hours. Also, EMR supports growing your cluster easily and recently added
support for shrinking your cluster gracefully (even while jobs are running).

~ Jonathan

On Thu, Nov 5, 2015 at 9:48 AM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> Yeah, as Shivaram mentioned, this issue is well-known. It's documented in
> SPARK-5189 <https://issues.apache.org/jira/browse/SPARK-5189> and a bunch
> of related issues. Unfortunately, it's hard to resolve this issue in
> spark-ec2 without rewriting large parts of the project. But if you take a
> crack at it and succeed I'm sure a lot of people will be happy.
>
> I've started a separate project <https://github.com/nchammas/flintrock> --
> which Shivaram also mentioned -- which aims to solve the problem of long
> launch times and other issues
> <https://github.com/nchammas/flintrock#motivation> with spark-ec2. It's
> still very young and lacks several critical features, but we are making
> steady progress.
>
> Nick
>
> On Thu, Nov 5, 2015 at 12:30 PM Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> It is a known limitation that spark-ec2 is very slow for large
>> clusters and as you mention most of this is due to the use of rsync to
>> transfer things from the master to all the slaves.
>>
>> Nick cc'd has been working on an alternative approach at
>> https://github.com/nchammas/flintrock that is more scalable.
>>
>> Thanks
>> Shivaram
>>
>> On Thu, Nov 5, 2015 at 8:12 AM, Christian <engr...@gmail.com> wrote:
>> > For starters, thanks for the awesome product!
>> >
>> > When creating ec2-clusters of 20-40 nodes, things work great. When we
>> create
>> > a cluster with the provided spark-ec2 script, it takes hours. When
>> creating
>> > a 200 node cluster, it takes 2 1/2 hours and for a 500 node cluster it
>> takes
>> > over 5 hours. One other problem we are having is that some nodes don't
>> come
>> > up when the other ones do, the process seems to just move on, skipping
>> the
>> > rsync and any installs on those ones.
>> >
>> > My guess as to why it takes so long to set up a large cluster is
>> because of
>> > the use of rsync. What if instead of using rsync, you synched to s3 and
>> then
>> > did a pdsh to pull it down on all of the machines. This is a big deal
>> for us
>> > and if we can come up with a good plan, we might be able help out with
>> the
>> > required changes.
>> >
>> > Are there any suggestions on how to deal with some of the nodes not
>> being
>> > ready when the process starts?
>> >
>> > Thanks for your time,
>> > Christian
>> >
>>
>

Reply via email to