Re: Mesos fine-grained multi-user mode failed to allocate tasks

Rahul Palamuttam Thu, 14 Jul 2016 20:51:07 -0700

Hallelujah!

We'll definitely take a look at cook. 
Right now we're observing in both fine grained and coarse grained jobs take 
quite a bit of time to even be staged by mesos.


We're sitting there waiting on the interpreter/shell for quite a few minutes.

> On Jul 14, 2016, at 7:49 PM, David Greenberg <[email protected]> wrote:
> 
> By true multitenancy, I mean preemption, so that if a new user connects to 
> the cluster, their capacity is actually reclaimed and reallocated in minutes 
> or seconds instead of hours. 
>> On Wed, Jul 13, 2016 at 7:11 PM Rahul Palamuttam <[email protected]> 
>> wrote:
>> Thanks David.
>> We will definitely take a look at Cook.
>> 
>> I am curious by what you mean by true multi-tenancy.
>> 
>> Under coarse-grained mode with dynamic allocation enabled - what I see in 
>> the mesos UI is that there are 3 tasks running by default (one on each of 
>> the nodes nodes we have).
>> I also see the coarsegrainedexecutors being brought up.
>> 
>> *Another point is that I always see a spark-submit command being launched 
>> even if I kill that command it comes back up and the exectors get 
>> reallocated on the worker nodes.
>> However, I am able to launch multiple spark shells and have jobs run 
>> concurrently - which we were very happy with.
>> Unfortunately, I don't understand why mesos only shows 3 tasks running. I 
>> even see the spike in thread count when launching my jobs, but the task 
>> count remains unchanged.
>> The mesos logs does show jobs coming in.
>> The three tasks just sit there in the webui - running.
>> 
>> Is this what is expected?
>> Does running coarsegrained with dynamic allocation make mesos look at each 
>> running executor as a different task?
>> 
>> 
>> 
>> 
>>> On Wed, Jul 13, 2016 at 4:34 PM, David Greenberg <[email protected]> 
>>> wrote:
>>> You could also check out Cook from twosigma. It's open source on github, 
>>> and offers true preemptive multitenancy with spark on Mesos, by 
>>> intermediating the spark drivers to optimize the cluster overall. 
>>>> On Wed, Jul 13, 2016 at 3:41 PM Rahul Palamuttam <[email protected]> 
>>>> wrote:
>>>> Thank you Joseph.
>>>> 
>>>> We'll try to explore coarse grained mode with dynamic allocation. 
>>>> 
>>>>> On Wed, Jul 13, 2016 at 12:28 PM, Joseph Wu <[email protected]> wrote:
>>>>> Looks like you're running Spark in "fine-grained" mode (deprecated).
>>>>> 
>>>>> (The Spark website appears to be down right now, so here's the doc on 
>>>>> Github:)
>>>>> https://github.com/apache/spark/blob/master/docs/running-on-mesos.md#fine-grained-deprecated
>>>>> 
>>>>>> Note that while Spark tasks in fine-grained will relinquish cores as 
>>>>>> they terminate, they will not relinquish memory, as the JVM does not 
>>>>>> give memory back to the Operating System. Neither will executors 
>>>>>> terminate when they're idle.
>>>>> 
>>>>> You can follow some of the recommendations Spark has in that document for 
>>>>> sharing resources, when using Mesos. 
>>>>> 
>>>>>> On Wed, Jul 13, 2016 at 12:12 PM, Rahul Palamuttam 
>>>>>> <[email protected]> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> Our team has been tackling multi-tenancy related issues with Mesos for 
>>>>>> quite some time.
>>>>>> 
>>>>>> The problem is that tasks aren't being allocated properly when multiple 
>>>>>> applications are trying to launch a job. If we launch application A, and 
>>>>>> soon after application B, application B waits pretty much till the 
>>>>>> completion of application A for tasks to even be staged in Mesos. Right 
>>>>>> now these applications are the spark-shell or the zeppelin interpreter. 
>>>>>> 
>>>>>> Even a simple sc.parallelize(1 to 10000000).reduce(+) launched in two 
>>>>>> different spark-shells results in the issue we're observing. One of the 
>>>>>> counts waits (in fact we don't even see the tasks being staged in mesos) 
>>>>>> until the current one finishes. This is the biggest issue we have been 
>>>>>> experience and any help or advice would be greatly appreciated. We want 
>>>>>> to be able to launch multiple jobs concurrently on our cluster and share 
>>>>>> resources appropriately. 
>>>>>> 
>>>>>> Another issue we see is that the java heap-space on the mesos executor 
>>>>>> backend process is not being cleaned up once a job has finished in the 
>>>>>> spark shell. 
>>>>>> I've attached a png file of the jvisualvm output showing that the 
>>>>>> heapspace is still allocated on a worker node. If I force the GC from 
>>>>>> jvisualvm then nearly all of that memory gets cleaned up. This may be 
>>>>>> because the spark-shell is still active - but if we've waited long 
>>>>>> enough why doesn't GC just clean up the space? However, even after 
>>>>>> forcing GC the mesos UI shows us that these resources are still being 
>>>>>> used.
>>>>>> There should be a way to bring down the memory utilization of the 
>>>>>> executors once a task is finished. It shouldn't continue to have that 
>>>>>> memory allocated, even if a spark-shell is active on the driver.
>>>>>> 
>>>>>> We have mesos configured to use fine-grained mode. 
>>>>>> The following are parameters we have set in our spark-defaults.conf file.
>>>>>> 
>>>>>> 
>>>>>> spark.eventLog.enabled           true
>>>>>> spark.eventLog.dir               hdfs://frontend-system:8090/directory
>>>>>> spark.local.dir                    /data/cluster-local/SPARK_TMP
>>>>>> 
>>>>>> spark.executor.memory            50g
>>>>>> 
>>>>>> spark.externalBlockStore.baseDir /data/cluster-local/SPARK_TMP
>>>>>> spark.executor.extraJavaOptions  -XX:MaxTenuringThreshold=0 
>>>>>> spark.executor.uri      
>>>>>> hdfs://frontend-system:8090/spark/spark-1.6.0-bin-hadoop2.4.tgz
>>>>>> spark.mesos.coarse      false
>>>>>> 
>>>>>> Please let me know if there are any questions about our configuration.
>>>>>> Any advice or experience the mesos community can share pertaining to 
>>>>>> issues with fine-grained mode would be greatly appreciated!
>>>>>> 
>>>>>> I would also like to sincerely apologize for my previous test message on 
>>>>>> the mailing list.
>>>>>> It was an ill-conceived idea since we are in a bit of a time crunch and 
>>>>>> I needed to get this message posted. I forgot I needed to send reply on 
>>>>>> to the user-subscribers email for me to be listed, resulting in message 
>>>>>> not sent emails. I will not do that again. 
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Rahul Palamuttam

Re: Mesos fine-grained multi-user mode failed to allocate tasks

Reply via email to