Hi Jörn,

A memory leak on the job would be contained within the resources reserved
for it, wouldn't it?
And the job holding resources is not always the same. Sometimes it's one of
the Streaming jobs, sometimes it's a heavy batch job that runs every hour.
Looks to me that whatever is causing the issue, it's participating in the
resource offer protocol of Mesos and my first suspect would be the Mesos
scheduler in Spark. (The table above is the tab "Offers" from the Mesos UI.

Are there any other factors involved in the offer acceptance/rejection
between Mesos and a scheduler?

What do you think?

-kr, Gerard.

On Mon, Jan 26, 2015 at 11:23 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> Hi,
>
> What do your jobs do?  Ideally post source code, but some description
> would already helpful to support you.
>
> Memory leaks can have several reasons - it may not be Spark at all.
>
> Thank you.
>
> Le 26 janv. 2015 22:28, "Gerard Maas" <gerard.m...@gmail.com> a écrit :
>
> >
> > (looks like the list didn't like a HTML table on the previous email. My
> excuses for any duplicates)
> >
> > Hi,
> >
> > We are observing with certain regularity that our Spark  jobs, as Mesos
> framework, are hoarding resources and not releasing them, resulting in
> resource starvation to all jobs running on the Mesos cluster.
> >
> > For example:
> > This is a job that has spark.cores.max = 4 and spark.executor.memory="3g"
> >
> > | ID               |Framework      |Host                |CPUs  |Mem
> > …5050-16506-1146497 FooStreaming dnode-4.hdfs.private 7 13.4 GB
> > …5050-16506-1146495 FooStreaming    dnode-0.hdfs.private 1 6.4 GB
> > …5050-16506-1146491 FooStreaming    dnode-5.hdfs.private 7 11.9 GB
> > …5050-16506-1146449 FooStreaming    dnode-3.hdfs.private 7 4.9 GB
> > …5050-16506-1146247 FooStreaming    dnode-1.hdfs.private 0.5 5.9 GB
> > …5050-16506-1146226 FooStreaming    dnode-2.hdfs.private 3 7.9 GB
> > …5050-16506-1144069 FooStreaming    dnode-3.hdfs.private 1 8.7 GB
> > …5050-16506-1133091 FooStreaming    dnode-5.hdfs.private 1 1.7 GB
> > …5050-16506-1133090 FooStreaming    dnode-2.hdfs.private 5 5.2 GB
> > …5050-16506-1133089 FooStreaming    dnode-1.hdfs.private 6.5 6.3 GB
> > …5050-16506-1133088 FooStreaming    dnode-4.hdfs.private 1 251 MB
> > …5050-16506-1133087 FooStreaming    dnode-0.hdfs.private 6.4 6.8 GB
> >
> > The only way to release the resources is by manually finding the process
> in the cluster and killing it. The jobs are often streaming but also batch
> jobs show this behavior. We have more streaming jobs than batch, so stats
> are biased.
> > Any ideas of what's up here? Hopefully some very bad ugly bug that has
> been fixed already and that will urge us to upgrade our infra?
> >
> > Mesos 0.20 +  Marathon 0.7.4 + Spark 1.1.0
> >
> > -kr, Gerard.
>
>

Reply via email to