Hi James,

did you configure checkpointing [1] and a recovery strategy [2] for your
job?

Best, Fabian

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/state/checkpointing.html
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/restart_strategies.html

2018-01-17 8:10 GMT+01:00 Data Engineer <dataenginee...@gmail.com>:

> Hi Nico,
>
> Thank you for your reply.
>
> I have configured each TaskManager with 24 available task slots. When both
> TaskManagers were running, I could see that a total of 8 task slots were
> being used.
> I also see that 24 task slots are available after one TaskManager goes
> down.  I don't see any exception regarding available task slots
>
> However, I get a java.net.ConnectException each time the JobManager tries
> to connect to the TaskManager that I have killed. It retries 3 times(the
> number I have set) and then the job fails.
> I expect the JobManager to move the workload to the remaining machine on
> which TaskManager is running. Or does it expect both TaskManagers to be up
> by the time it restarts?
>
> Regards,
> James
>
> On Tue, Jan 16, 2018 at 3:02 PM, Nico Kruber <n...@data-artisans.com>
> wrote:
>
>> Hi James,
>> In this scenario, with the restart strategy set, the job should restart
>> (without YARN/Mesos) as long as you have enough slots available.
>>
>> Can you check with the web interface on http://<jobmanager>:8081/ that
>> enough slots are available after killing one TaskManager?
>>
>> Can you provide JobManager and TaskManager logs and some more details on
>> the job you are running?
>>
>>
>> Nico
>>
>> On 16/01/18 07:04, Data Engineer wrote:
>> > This question has been asked on StackOverflow:
>> > https://stackoverflow.com/questions/48262080/how-to-get-auto
>> matic-fail-over-working-in-flink
>> >
>> > I am using Apache Flink 1.4 on a cluster of 3 machines, out of which one
>> > is the JobManager and the other 2 host TaskManagers.
>> >
>> > I start flink in cluster mode and submit a flink job. I have configured
>> > 24 task slots in the flink config, and for the job I use 6 task slots.
>> >
>> > When I submit the job, I see 3 tasks are assigned to Worker machine 1
>> > and 3 are assigned to Worker machine 2. Now, when I kill the TaskManager
>> > on WorkerMachine 2, I see that the entire job fails.
>> >
>> > Is this the expected behaviour, or does it have automatic failover as in
>> > Spark.
>> >
>> > Do we need to use YARN/Mesos to achieve automatic failover?
>> >
>> > We tried the Restart Strategy, but when it restarts we get an exception
>> > saying that no task slots are available and then the job fails. We think
>> > that 24 slots is enough to take over. What could we be doing wrong here?
>> >
>> > Regards,
>> > James
>>
>>
>

Reply via email to