Re: Spark (Streaming?) holding on to Mesos resources

2015-01-29 Thread Gerard Maas
Thanks a lot.

After reading Mesos-1688, I still don't understand how/why a job will hoard
and hold on to so many resources even in the presence of that bug.
Looking at the release notes, I think this ticket could be relevant to
preventing the behavior we're seeing:
[MESOS-186] - Resource offers should be rescinded after some configurable
timeout

Bottom line, we're following your advice and we're testing Mesos 0.21 on
dev to roll out to our prod platforms later on.

Thanks!!

-kr, Gerard.


On Tue, Jan 27, 2015 at 9:15 PM, Tim Chen  wrote:

> Hi Gerard,
>
> As others has mentioned I believe you're hitting Mesos-1688, can you
> upgrade to the latest Mesos release (0.21.1) and let us know if it resolves
> your problem?
>
> Thanks,
>
> Tim
>
> On Tue, Jan 27, 2015 at 10:39 AM, Sam Bessalah 
> wrote:
>
>> Hi Geraard,
>> isn't this the same issueas this?
>> https://issues.apache.org/jira/browse/MESOS-1688
>>
>> On Mon, Jan 26, 2015 at 9:17 PM, Gerard Maas 
>> wrote:
>>
>>> Hi,
>>>
>>> We are observing with certain regularity that our Spark  jobs, as Mesos
>>> framework, are hoarding resources and not releasing them, resulting in
>>> resource starvation to all jobs running on the Mesos cluster.
>>>
>>> For example:
>>> This is a job that has spark.cores.max = 4 and spark.executor.memory="3g"
>>>
>>> IDFrameworkHostCPUsMem…5050-16506-1146497FooStreaming
>>> dnode-4.hdfs.private713.4 GB…5050-16506-1146495FooStreaming
>>> dnode-0.hdfs.private16.4 GB…5050-16506-1146491FooStreaming
>>> dnode-5.hdfs.private711.9 GB…5050-16506-1146449FooStreaming
>>> dnode-3.hdfs.private74.9 GB…5050-16506-1146247FooStreaming
>>> dnode-1.hdfs.private0.55.9 GB…5050-16506-1146226FooStreaming
>>> dnode-2.hdfs.private37.9 GB…5050-16506-1144069FooStreaming
>>> dnode-3.hdfs.private18.7 GB…5050-16506-1133091FooStreaming
>>> dnode-5.hdfs.private11.7 GB…5050-16506-1133090FooStreaming
>>> dnode-2.hdfs.private55.2 GB…5050-16506-1133089FooStreaming
>>> dnode-1.hdfs.private6.56.3 GB…5050-16506-1133088FooStreaming
>>> dnode-4.hdfs.private1251 MB…5050-16506-1133087FooStreaming
>>> dnode-0.hdfs.private6.46.8 GB
>>> The only way to release the resources is by manually finding the process
>>> in the cluster and killing it. The jobs are often streaming but also batch
>>> jobs show this behavior. We have more streaming jobs than batch, so stats
>>> are biased.
>>> Any ideas of what's up here? Hopefully some very bad ugly bug that has
>>> been fixed already and that will urge us to upgrade our infra?
>>>
>>> Mesos 0.20 +  Marathon 0.7.4 + Spark 1.1.0
>>>
>>> -kr, Gerard.
>>>
>>
>>
>


Re: Spark (Streaming?) holding on to Mesos resources

2015-01-27 Thread Tim Chen
Hi Gerard,

As others has mentioned I believe you're hitting Mesos-1688, can you
upgrade to the latest Mesos release (0.21.1) and let us know if it resolves
your problem?

Thanks,

Tim

On Tue, Jan 27, 2015 at 10:39 AM, Sam Bessalah 
wrote:

> Hi Geraard,
> isn't this the same issueas this?
> https://issues.apache.org/jira/browse/MESOS-1688
>
> On Mon, Jan 26, 2015 at 9:17 PM, Gerard Maas 
> wrote:
>
>> Hi,
>>
>> We are observing with certain regularity that our Spark  jobs, as Mesos
>> framework, are hoarding resources and not releasing them, resulting in
>> resource starvation to all jobs running on the Mesos cluster.
>>
>> For example:
>> This is a job that has spark.cores.max = 4 and spark.executor.memory="3g"
>>
>> IDFrameworkHostCPUsMem…5050-16506-1146497FooStreamingdnode-4.hdfs.private
>> 713.4 GB…5050-16506-1146495FooStreaming
>> dnode-0.hdfs.private16.4 GB…5050-16506-1146491FooStreaming
>> dnode-5.hdfs.private711.9 GB…5050-16506-1146449FooStreaming
>> dnode-3.hdfs.private74.9 GB…5050-16506-1146247FooStreaming
>> dnode-1.hdfs.private0.55.9 GB…5050-16506-1146226FooStreaming
>> dnode-2.hdfs.private37.9 GB…5050-16506-1144069FooStreaming
>> dnode-3.hdfs.private18.7 GB…5050-16506-1133091FooStreaming
>> dnode-5.hdfs.private11.7 GB…5050-16506-1133090FooStreaming
>> dnode-2.hdfs.private55.2 GB…5050-16506-1133089FooStreaming
>> dnode-1.hdfs.private6.56.3 GB…5050-16506-1133088FooStreaming
>> dnode-4.hdfs.private1251 MB…5050-16506-1133087FooStreaming
>> dnode-0.hdfs.private6.46.8 GB
>> The only way to release the resources is by manually finding the process
>> in the cluster and killing it. The jobs are often streaming but also batch
>> jobs show this behavior. We have more streaming jobs than batch, so stats
>> are biased.
>> Any ideas of what's up here? Hopefully some very bad ugly bug that has
>> been fixed already and that will urge us to upgrade our infra?
>>
>> Mesos 0.20 +  Marathon 0.7.4 + Spark 1.1.0
>>
>> -kr, Gerard.
>>
>
>


Re: Spark (Streaming?) holding on to Mesos resources

2015-01-27 Thread Sam Bessalah
Hi Geraard,
isn't this the same issueas this?
https://issues.apache.org/jira/browse/MESOS-1688

On Mon, Jan 26, 2015 at 9:17 PM, Gerard Maas  wrote:

> Hi,
>
> We are observing with certain regularity that our Spark  jobs, as Mesos
> framework, are hoarding resources and not releasing them, resulting in
> resource starvation to all jobs running on the Mesos cluster.
>
> For example:
> This is a job that has spark.cores.max = 4 and spark.executor.memory="3g"
>
> IDFrameworkHostCPUsMem…5050-16506-1146497FooStreamingdnode-4.hdfs.private713.4
> GB…5050-16506-1146495FooStreaming
> dnode-0.hdfs.private16.4 GB…5050-16506-1146491FooStreaming
> dnode-5.hdfs.private711.9 GB…5050-16506-1146449FooStreaming
> dnode-3.hdfs.private74.9 GB…5050-16506-1146247FooStreaming
> dnode-1.hdfs.private0.55.9 GB…5050-16506-1146226FooStreaming
> dnode-2.hdfs.private37.9 GB…5050-16506-1144069FooStreaming
> dnode-3.hdfs.private18.7 GB…5050-16506-1133091FooStreaming
> dnode-5.hdfs.private11.7 GB…5050-16506-1133090FooStreaming
> dnode-2.hdfs.private55.2 GB…5050-16506-1133089FooStreaming
> dnode-1.hdfs.private6.56.3 GB…5050-16506-1133088FooStreaming
> dnode-4.hdfs.private1251 MB…5050-16506-1133087FooStreaming
> dnode-0.hdfs.private6.46.8 GB
> The only way to release the resources is by manually finding the process
> in the cluster and killing it. The jobs are often streaming but also batch
> jobs show this behavior. We have more streaming jobs than batch, so stats
> are biased.
> Any ideas of what's up here? Hopefully some very bad ugly bug that has
> been fixed already and that will urge us to upgrade our infra?
>
> Mesos 0.20 +  Marathon 0.7.4 + Spark 1.1.0
>
> -kr, Gerard.
>


Re: Spark (Streaming?) holding on to Mesos Resources

2015-01-27 Thread Adam Bordelon
> Hopefully some very bad ugly bug that has been fixed already and that
will urge us to upgrade our infra?
> Mesos 0.20 +  Marathon 0.7.4 + Spark 1.1.0
Could be https://issues.apache.org/jira/browse/MESOS-1688 (fixed in Mesos
0.21)

On Mon, Jan 26, 2015 at 2:45 PM, Gerard Maas  wrote:

> Hi Jörn,
>
> A memory leak on the job would be contained within the resources reserved
> for it, wouldn't it?
> And the job holding resources is not always the same. Sometimes it's one
> of the Streaming jobs, sometimes it's a heavy batch job that runs every
> hour.
> Looks to me that whatever is causing the issue, it's participating in the
> resource offer protocol of Mesos and my first suspect would be the Mesos
> scheduler in Spark. (The table above is the tab "Offers" from the Mesos UI.
>
> Are there any other factors involved in the offer acceptance/rejection
> between Mesos and a scheduler?
>
> What do you think?
>
> -kr, Gerard.
>
> On Mon, Jan 26, 2015 at 11:23 PM, Jörn Franke 
> wrote:
>
>> Hi,
>>
>> What do your jobs do?  Ideally post source code, but some description
>> would already helpful to support you.
>>
>> Memory leaks can have several reasons - it may not be Spark at all.
>>
>> Thank you.
>>
>> Le 26 janv. 2015 22:28, "Gerard Maas"  a écrit :
>>
>> >
>> > (looks like the list didn't like a HTML table on the previous email. My
>> excuses for any duplicates)
>> >
>> > Hi,
>> >
>> > We are observing with certain regularity that our Spark  jobs, as Mesos
>> framework, are hoarding resources and not releasing them, resulting in
>> resource starvation to all jobs running on the Mesos cluster.
>> >
>> > For example:
>> > This is a job that has spark.cores.max = 4 and
>> spark.executor.memory="3g"
>> >
>> > | ID   |Framework  |Host|CPUs  |Mem
>> > …5050-16506-1146497 FooStreaming dnode-4.hdfs.private 7 13.4 GB
>> > …5050-16506-1146495 FooStreamingdnode-0.hdfs.private 1 6.4 GB
>> > …5050-16506-1146491 FooStreamingdnode-5.hdfs.private 7 11.9 GB
>> > …5050-16506-1146449 FooStreamingdnode-3.hdfs.private 7 4.9 GB
>> > …5050-16506-1146247 FooStreamingdnode-1.hdfs.private 0.5 5.9 GB
>> > …5050-16506-1146226 FooStreamingdnode-2.hdfs.private 3 7.9 GB
>> > …5050-16506-1144069 FooStreamingdnode-3.hdfs.private 1 8.7 GB
>> > …5050-16506-1133091 FooStreamingdnode-5.hdfs.private 1 1.7 GB
>> > …5050-16506-1133090 FooStreamingdnode-2.hdfs.private 5 5.2 GB
>> > …5050-16506-1133089 FooStreamingdnode-1.hdfs.private 6.5 6.3 GB
>> > …5050-16506-1133088 FooStreamingdnode-4.hdfs.private 1 251 MB
>> > …5050-16506-1133087 FooStreamingdnode-0.hdfs.private 6.4 6.8 GB
>> >
>> > The only way to release the resources is by manually finding the
>> process in the cluster and killing it. The jobs are often streaming but
>> also batch jobs show this behavior. We have more streaming jobs than batch,
>> so stats are biased.
>> > Any ideas of what's up here? Hopefully some very bad ugly bug that has
>> been fixed already and that will urge us to upgrade our infra?
>> >
>> > Mesos 0.20 +  Marathon 0.7.4 + Spark 1.1.0
>> >
>> > -kr, Gerard.
>>
>>
>


Re: Spark (Streaming?) holding on to Mesos Resources

2015-01-26 Thread Gerard Maas
Hi Jörn,

A memory leak on the job would be contained within the resources reserved
for it, wouldn't it?
And the job holding resources is not always the same. Sometimes it's one of
the Streaming jobs, sometimes it's a heavy batch job that runs every hour.
Looks to me that whatever is causing the issue, it's participating in the
resource offer protocol of Mesos and my first suspect would be the Mesos
scheduler in Spark. (The table above is the tab "Offers" from the Mesos UI.

Are there any other factors involved in the offer acceptance/rejection
between Mesos and a scheduler?

What do you think?

-kr, Gerard.

On Mon, Jan 26, 2015 at 11:23 PM, Jörn Franke  wrote:

> Hi,
>
> What do your jobs do?  Ideally post source code, but some description
> would already helpful to support you.
>
> Memory leaks can have several reasons - it may not be Spark at all.
>
> Thank you.
>
> Le 26 janv. 2015 22:28, "Gerard Maas"  a écrit :
>
> >
> > (looks like the list didn't like a HTML table on the previous email. My
> excuses for any duplicates)
> >
> > Hi,
> >
> > We are observing with certain regularity that our Spark  jobs, as Mesos
> framework, are hoarding resources and not releasing them, resulting in
> resource starvation to all jobs running on the Mesos cluster.
> >
> > For example:
> > This is a job that has spark.cores.max = 4 and spark.executor.memory="3g"
> >
> > | ID   |Framework  |Host|CPUs  |Mem
> > …5050-16506-1146497 FooStreaming dnode-4.hdfs.private 7 13.4 GB
> > …5050-16506-1146495 FooStreamingdnode-0.hdfs.private 1 6.4 GB
> > …5050-16506-1146491 FooStreamingdnode-5.hdfs.private 7 11.9 GB
> > …5050-16506-1146449 FooStreamingdnode-3.hdfs.private 7 4.9 GB
> > …5050-16506-1146247 FooStreamingdnode-1.hdfs.private 0.5 5.9 GB
> > …5050-16506-1146226 FooStreamingdnode-2.hdfs.private 3 7.9 GB
> > …5050-16506-1144069 FooStreamingdnode-3.hdfs.private 1 8.7 GB
> > …5050-16506-1133091 FooStreamingdnode-5.hdfs.private 1 1.7 GB
> > …5050-16506-1133090 FooStreamingdnode-2.hdfs.private 5 5.2 GB
> > …5050-16506-1133089 FooStreamingdnode-1.hdfs.private 6.5 6.3 GB
> > …5050-16506-1133088 FooStreamingdnode-4.hdfs.private 1 251 MB
> > …5050-16506-1133087 FooStreamingdnode-0.hdfs.private 6.4 6.8 GB
> >
> > The only way to release the resources is by manually finding the process
> in the cluster and killing it. The jobs are often streaming but also batch
> jobs show this behavior. We have more streaming jobs than batch, so stats
> are biased.
> > Any ideas of what's up here? Hopefully some very bad ugly bug that has
> been fixed already and that will urge us to upgrade our infra?
> >
> > Mesos 0.20 +  Marathon 0.7.4 + Spark 1.1.0
> >
> > -kr, Gerard.
>
>


Re: Spark (Streaming?) holding on to Mesos Resources

2015-01-26 Thread Jörn Franke
Hi,

What do your jobs do?  Ideally post source code, but some description would
already helpful to support you.

Memory leaks can have several reasons - it may not be Spark at all.

Thank you.

Le 26 janv. 2015 22:28, "Gerard Maas"  a écrit :
>
> (looks like the list didn't like a HTML table on the previous email. My
excuses for any duplicates)
>
> Hi,
>
> We are observing with certain regularity that our Spark  jobs, as Mesos
framework, are hoarding resources and not releasing them, resulting in
resource starvation to all jobs running on the Mesos cluster.
>
> For example:
> This is a job that has spark.cores.max = 4 and spark.executor.memory="3g"
>
> | ID   |Framework  |Host|CPUs  |Mem
> …5050-16506-1146497 FooStreaming dnode-4.hdfs.private 7 13.4 GB
> …5050-16506-1146495 FooStreamingdnode-0.hdfs.private 1 6.4 GB
> …5050-16506-1146491 FooStreamingdnode-5.hdfs.private 7 11.9 GB
> …5050-16506-1146449 FooStreamingdnode-3.hdfs.private 7 4.9 GB
> …5050-16506-1146247 FooStreamingdnode-1.hdfs.private 0.5 5.9 GB
> …5050-16506-1146226 FooStreamingdnode-2.hdfs.private 3 7.9 GB
> …5050-16506-1144069 FooStreamingdnode-3.hdfs.private 1 8.7 GB
> …5050-16506-1133091 FooStreamingdnode-5.hdfs.private 1 1.7 GB
> …5050-16506-1133090 FooStreamingdnode-2.hdfs.private 5 5.2 GB
> …5050-16506-1133089 FooStreamingdnode-1.hdfs.private 6.5 6.3 GB
> …5050-16506-1133088 FooStreamingdnode-4.hdfs.private 1 251 MB
> …5050-16506-1133087 FooStreamingdnode-0.hdfs.private 6.4 6.8 GB
>
> The only way to release the resources is by manually finding the process
in the cluster and killing it. The jobs are often streaming but also batch
jobs show this behavior. We have more streaming jobs than batch, so stats
are biased.
> Any ideas of what's up here? Hopefully some very bad ugly bug that has
been fixed already and that will urge us to upgrade our infra?
>
> Mesos 0.20 +  Marathon 0.7.4 + Spark 1.1.0
>
> -kr, Gerard.


Spark (Streaming?) holding on to Mesos Resources

2015-01-26 Thread Gerard Maas
(looks like the list didn't like a HTML table on the previous email. My
excuses for any duplicates)

Hi,

We are observing with certain regularity that our Spark  jobs, as Mesos
framework, are hoarding resources and not releasing them, resulting in
resource starvation to all jobs running on the Mesos cluster.

For example:
This is a job that has spark.cores.max = 4 and spark.executor.memory="3g"

| ID   |Framework  |Host|CPUs  |Mem
…5050-16506-1146497 FooStreaming dnode-4.hdfs.private 7 13.4 GB
…5050-16506-1146495 FooStreamingdnode-0.hdfs.private 1 6.4 GB
…5050-16506-1146491 FooStreamingdnode-5.hdfs.private 7 11.9 GB
…5050-16506-1146449 FooStreamingdnode-3.hdfs.private 7 4.9 GB
…5050-16506-1146247 FooStreamingdnode-1.hdfs.private 0.5 5.9 GB
…5050-16506-1146226 FooStreamingdnode-2.hdfs.private 3 7.9 GB
…5050-16506-1144069 FooStreamingdnode-3.hdfs.private 1 8.7 GB
…5050-16506-1133091 FooStreamingdnode-5.hdfs.private 1 1.7 GB
…5050-16506-1133090 FooStreamingdnode-2.hdfs.private 5 5.2 GB
…5050-16506-1133089 FooStreamingdnode-1.hdfs.private 6.5 6.3 GB
…5050-16506-1133088 FooStreamingdnode-4.hdfs.private 1 251 MB
…5050-16506-1133087 FooStreamingdnode-0.hdfs.private 6.4 6.8 GB

The only way to release the resources is by manually finding the process in
the cluster and killing it. The jobs are often streaming but also batch
jobs show this behavior. We have more streaming jobs than batch, so stats
are biased.
Any ideas of what's up here? Hopefully some very bad ugly bug that has been
fixed already and that will urge us to upgrade our infra?

Mesos 0.20 +  Marathon 0.7.4 + Spark 1.1.0

-kr, Gerard.


Spark (Streaming?) holding on to Mesos resources

2015-01-26 Thread Gerard Maas
Hi,

We are observing with certain regularity that our Spark  jobs, as Mesos
framework, are hoarding resources and not releasing them, resulting in
resource starvation to all jobs running on the Mesos cluster.

For example:
This is a job that has spark.cores.max = 4 and spark.executor.memory="3g"

IDFrameworkHostCPUsMem…5050-16506-1146497FooStreamingdnode-4.hdfs.private713.4
GB…5050-16506-1146495FooStreaming
dnode-0.hdfs.private16.4 GB…5050-16506-1146491FooStreaming
dnode-5.hdfs.private711.9 GB…5050-16506-1146449FooStreaming
dnode-3.hdfs.private74.9 GB…5050-16506-1146247FooStreaming
dnode-1.hdfs.private0.55.9 GB…5050-16506-1146226FooStreaming
dnode-2.hdfs.private37.9 GB…5050-16506-1144069FooStreaming
dnode-3.hdfs.private18.7 GB…5050-16506-1133091FooStreaming
dnode-5.hdfs.private11.7 GB…5050-16506-1133090FooStreaming
dnode-2.hdfs.private55.2 GB…5050-16506-1133089FooStreaming
dnode-1.hdfs.private6.56.3 GB…5050-16506-1133088FooStreaming
dnode-4.hdfs.private1251 MB…5050-16506-1133087FooStreaming
dnode-0.hdfs.private6.46.8 GB
The only way to release the resources is by manually finding the process in
the cluster and killing it. The jobs are often streaming but also batch
jobs show this behavior. We have more streaming jobs than batch, so stats
are biased.
Any ideas of what's up here? Hopefully some very bad ugly bug that has been
fixed already and that will urge us to upgrade our infra?

Mesos 0.20 +  Marathon 0.7.4 + Spark 1.1.0

-kr, Gerard.