Re: Cluster execution - Jobmanager unreachable

2015-02-11 Thread Till Rohrmann
Great to hear. On Wed, Feb 11, 2015 at 12:29 PM, Chesnay Schepler wrote: > Works now :) Thank you for your help. > > > On 11.02.2015 11:39, Till Rohrmann wrote: > >> I found the error. Due to some refactoring, a wrong message was sent to >> the >> JobManager in the JobManagerInfoServlet.java. I

Re: Cluster execution - Jobmanager unreachable

2015-02-11 Thread Chesnay Schepler
Works now :) Thank you for your help. On 11.02.2015 11:39, Till Rohrmann wrote: I found the error. Due to some refactoring, a wrong message was sent to the JobManager in the JobManagerInfoServlet.java. I pushed a fix. Could you try it out again? On Wed, Feb 11, 2015 at 11:34 AM, Till Rohrmann

Re: Cluster execution - Jobmanager unreachable

2015-02-11 Thread Till Rohrmann
I found the error. Due to some refactoring, a wrong message was sent to the JobManager in the JobManagerInfoServlet.java. I pushed a fix. Could you try it out again? On Wed, Feb 11, 2015 at 11:34 AM, Till Rohrmann wrote: > Could you check the rebasing because it seems as if the web server is now

Re: Cluster execution - Jobmanager unreachable

2015-02-11 Thread Till Rohrmann
Could you check the rebasing because it seems as if the web server is now sending RequestArchivedJobs messages to the JobManager which should not happen. These messages should go directly to the MemoryArchivist. The corresponding file is JobManagerInfoServlet.java, I think. On Wed, Feb 11, 2015 at

Re: Cluster execution - Jobmanager unreachable

2015-02-11 Thread Chesnay Schepler
I just tried Till's fix, rebased to the latest master and got a whole lot of these exceptions right away: java.lang.Exception: The slot in which the task was scheduled has been killed (probably loss of TaskManager). at org.apache.flink.runtime.instance.SimpleSlot.cancel(SimpleSlot.java:98)

Re: Cluster execution - Jobmanager unreachable

2015-02-11 Thread Ufuk Celebi
Chesnay, could you try this again with Till's fix: https://github.com/apache/flink/pull/378 The changes look good and I would like to merge it asap, but it would be nice to double check with your problem. I will also run some tests. – Ufuk On 05 Feb 2015, at 10:42, Stephan Ewen wrote: > I su

Re: Cluster execution - Jobmanager unreachable

2015-02-05 Thread Till Rohrmann
I checked and indeed the scheduleOrUpdateConsumers method can throw an IllegalStateException without properly handling such an exception on the JobManager level. It is a design decision of Scala not to complain about unhandled exceptions which are otherwise properly annotated in Java code. We shou

Re: Cluster execution - Jobmanager unreachable

2015-02-05 Thread Stephan Ewen
I suspect that this is one of the cases where an exception in an actor causes the actor to die (here the job manager) On Thu, Feb 5, 2015 at 10:40 AM, Till Rohrmann wrote: > It looks to me that the TaskManager does not receive a > ConsumerNotificationResult after having send the ScheduleOrUpdate

Re: Cluster execution - Jobmanager unreachable

2015-02-05 Thread Till Rohrmann
It looks to me that the TaskManager does not receive a ConsumerNotificationResult after having send the ScheduleOrUpdateConsumers message. This can either mean that something went wrong in ExecutionGraph.scheduleOrUpdateConsumers method or the connection was disassociated for some reasons. The logs

Re: Cluster execution - Jobmanager unreachable

2015-02-05 Thread Ufuk Celebi
Hey Chesnay, I will look into it. Can you share the complete LOGs? – Ufuk On 04 Feb 2015, at 14:49, Chesnay Schepler wrote: > Hello, > > I'm trying to run python jobs with the latest master on a cluster and get the > following exception: > > Error: The program execution failed: JobManager

Re: Cluster execution - Jobmanager unreachable

2015-02-05 Thread Stephan Ewen
Hey! The akka communication is not fully stable in the current snapshot master. We are working on this. The Buffer recycled exception is probably an artifact of the cancelling. Stephan On Wed, Feb 4, 2015 at 2:49 PM, Chesnay Schepler < chesnay.schep...@fu-berlin.de> wrote: > Hello, > > I'm try

Cluster execution - Jobmanager unreachable

2015-02-04 Thread Chesnay Schepler
Hello, I'm trying to run python jobs with the latest master on a cluster and get the following exception: Error: The program execution failed: JobManager not reachable anymore. Terminate waiting for job answer. org.apache.flink.client.program.ProgramInvocationException: The program execution