Re: Spark EC2 script on Large clusters

Sabarish Sasidharan Thu, 05 Nov 2015 18:02:47 -0800

Qubole is one option where you can use spots and get a couple other
benefits. We use Qubole at Manthan for our Spark workloads.


For ensuring all the nodes are ready, you could use
yarn.minregisteredresourcesratio config property to ensure the execution
doesn't start till the requisite containers have been allocated.

Regards
Sab
On 06-Nov-2015 12:22 am, "Christian" <engr...@gmail.com> wrote:

> Let me rephrase. Emr cost is about twice as much as the spot price, making
> it almost 2/3 of the overall cost.
> On Thu, Nov 5, 2015 at 11:50 AM Christian <engr...@gmail.com> wrote:
>
>> Hi Johnathan,
>>
>> We are using EMR now and it's costing way too much. We do spot pricing
>> and the emr addon cost is about 2/3 the price of the actual spot instance.
>> On Thu, Nov 5, 2015 at 11:31 AM Jonathan Kelly <jonathaka...@gmail.com>
>> wrote:
>>
>>> Christian,
>>>
>>> Is there anything preventing you from using EMR, which will manage your
>>> cluster for you? Creating large clusters would take mins on EMR instead of
>>> hours. Also, EMR supports growing your cluster easily and recently added
>>> support for shrinking your cluster gracefully (even while jobs are running).
>>>
>>> ~ Jonathan
>>>
>>> On Thu, Nov 5, 2015 at 9:48 AM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Yeah, as Shivaram mentioned, this issue is well-known. It's documented
>>>> in SPARK-5189 <https://issues.apache.org/jira/browse/SPARK-5189> and a
>>>> bunch of related issues. Unfortunately, it's hard to resolve this issue in
>>>> spark-ec2 without rewriting large parts of the project. But if you take a
>>>> crack at it and succeed I'm sure a lot of people will be happy.
>>>>
>>>> I've started a separate project <https://github.com/nchammas/flintrock>
>>>>  -- which Shivaram also mentioned -- which aims to solve the problem
>>>> of long launch times and other issues
>>>> <https://github.com/nchammas/flintrock#motivation> with spark-ec2.
>>>> It's still very young and lacks several critical features, but we are
>>>> making steady progress.
>>>>
>>>> Nick
>>>>
>>>> On Thu, Nov 5, 2015 at 12:30 PM Shivaram Venkataraman <
>>>> shiva...@eecs.berkeley.edu> wrote:
>>>>
>>>>> It is a known limitation that spark-ec2 is very slow for large
>>>>> clusters and as you mention most of this is due to the use of rsync to
>>>>> transfer things from the master to all the slaves.
>>>>>
>>>>> Nick cc'd has been working on an alternative approach at
>>>>> https://github.com/nchammas/flintrock that is more scalable.
>>>>>
>>>>> Thanks
>>>>> Shivaram
>>>>>
>>>>> On Thu, Nov 5, 2015 at 8:12 AM, Christian <engr...@gmail.com> wrote:
>>>>> > For starters, thanks for the awesome product!
>>>>> >
>>>>> > When creating ec2-clusters of 20-40 nodes, things work great. When
>>>>> we create
>>>>> > a cluster with the provided spark-ec2 script, it takes hours. When
>>>>> creating
>>>>> > a 200 node cluster, it takes 2 1/2 hours and for a 500 node cluster
>>>>> it takes
>>>>> > over 5 hours. One other problem we are having is that some nodes
>>>>> don't come
>>>>> > up when the other ones do, the process seems to just move on,
>>>>> skipping the
>>>>> > rsync and any installs on those ones.
>>>>> >
>>>>> > My guess as to why it takes so long to set up a large cluster is
>>>>> because of
>>>>> > the use of rsync. What if instead of using rsync, you synched to s3
>>>>> and then
>>>>> > did a pdsh to pull it down on all of the machines. This is a big
>>>>> deal for us
>>>>> > and if we can come up with a good plan, we might be able help out
>>>>> with the
>>>>> > required changes.
>>>>> >
>>>>> > Are there any suggestions on how to deal with some of the nodes not
>>>>> being
>>>>> > ready when the process starts?
>>>>> >
>>>>> > Thanks for your time,
>>>>> > Christian
>>>>> >
>>>>>
>>>>
>>>

Re: Spark EC2 script on Large clusters

Reply via email to