Spark job killed

2015-09-01 Thread Silvio Bernardinello
Hi,

We are running Spark 1.4.0 on a Mesosphere cluster (~250GB memory with 16
activated hosts).
Spark jobs are submitted in coarse mode.

Suddenly, our jobs get killed without any error.

ip-10-0-2-193.us-west-2.compute.internal, PROCESS_LOCAL, 1514 bytes)
15/09/01 10:48:24 INFO TaskSetManager: Finished task 38047.0 in stage 0.0
(TID 38160) in 2856 ms on ip-10-0-0-203.us-west-2.compute.internal
(38048/44617)
15/09/01 10:48:24 INFO TaskSetManager: Starting task 38056.0 in stage 0.0
(TID 38169, ip-10-0-0-204.us-west-2.compute.internal, PROCESS_LOCAL, 1514
bytes)
15/09/01 10:48:24 INFO TaskSetManager: Starting task 38057.0 in stage 0.0
(TID 38170, ip-10-0-0-204.us-west-2.compute.internal, PROCESS_LOCAL, 1514
bytes)
15/09/01 10:48:25 INFO TaskSetManager: Finished task 38048.0 in stage 0.0
(TID 38161) in 2290 ms on ip-10-0-2-194.us-west-2.compute.internal
(38049/44617)
Killed

Where can we find additional information to this issue?

Thank in advance

Silvio



__


*Silvio Bernardinello * |  Data Engineer


Milan | Rome | New York | Shanghai

<http://www.beintoo.com/eu/it/index>
<https://www.linkedin.com/company/beintoo>  <https://twitter.com/beintoo>
<https://www.facebook.com/Beintoo?ref=ts&fref=ts>

Beintoo Spa - Corso di Porta Romana, 68 - 20122 Milano - Italy - Office
(+39) 02.97.687.959

This email is reserved exclusively for sending and receiving messages
inherent working activities, and is not intended nor authorized for
personal use. Therefore, any outgoing messages or incoming response
messages will be treated as company messages and will be subject to the
corporate IT policy and may possibly to be read by persons other than by
the subscriber of the box. Confidential information may be contained in
this message. If you are not the address indicated in this message, please
do not copy or deliver this message to anyone. In such case, you should
notify the sender immediately and delete the original message.


Re: Spark Mesos task rescheduling

2015-07-09 Thread Silvio Bernardinello
Hi,

Thank you for confirming my doubts and for the link.
Yes, we actually run in fine-grained mode because we would like to
dynamically resize our cluster as needed (thank you for the dynamic
allocation link).

However, we tried coarse grained mode and mesos seems not to relaunch the
failed task.
Maybe there is a timeout before trying to relaunch it, but I'm not aware of
it.



On Thu, Jul 9, 2015 at 5:13 PM, Iulian DragoČ™ 
wrote:

>
>
> On Thu, Jul 9, 2015 at 12:32 PM, besil  wrote:
>
>> Hi,
>>
>> We are experimenting scheduling errors due to mesos slave failing.
>> It seems to be an open bug, more information can be found here.
>>
>> https://issues.apache.org/jira/browse/SPARK-3289
>>
>> According to this  link
>> <
>> https://mail-archives.apache.org/mod_mbox/mesos-user/201310.mbox/%3ccaakwvaxprrnrcdlazcybnmk1_9elyheodaf8urf8ssrlbac...@mail.gmail.com%3E
>> >
>> from mail archive, it seems that Spark doesn't reschedule LOST tasks to
>> active executors, but keep trying rescheduling it on the failed host.
>>
>
> Are you running in fine-grained mode? In coarse-grained mode it seems that
> Spark will notice a slave that fails repeatedly and would not accept offers
> on that slave:
>
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala#L188
>
>
>>
>> We would like to dynamically resize our Mesos cluster (adding or removing
>> machines - using an AWS autoscaling group), but this bug kills our running
>> applications if a Mesos slave running a Spark executor is shut down.
>>
>
> I think what you need is dynamic allocation, which should be available
> soon (PR: 4984 ).
>
>
>> Is any known workaround?
>>
>> Thank you
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Mesos-task-rescheduling-tp23740.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
>
> --
> Iulian Dragos
>
> --
> Reactive Apps on the JVM
> www.typesafe.com
>
>