Hey Morgan,

Is it possible for you to provide us with the full logs of the JobManager
and the affected TaskManager?
This might give us a hint why the number of task slots is zero.

Best,
Robert


On Tue, May 5, 2020 at 11:41 AM Morgan Geldenhuys <
[email protected]> wrote:

>
> Community,
>
> I am currently doing some fault tolerance testing for Flink (1.10) running
> on Kubernetes (1.18) and am encountering an error where after a running job
> experiences a failure, the job fails completely.
>
> A Flink session cluster has been created according to the documentation
> contained here:
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/kubernetes.html.
> The job is then uploaded and deployed via the web interface and everything
> runs smoothly. The job has a parallelism of 24 with 3 worker nodes as fail
> overs in reserve. Each worker is assigned 1 task slot each (total of 27).
>
> The next step would be inject an error for which I use the Pumba Chaos
> Testing tool (https://github.com/alexei-led/pumba) to pause a random
> worker process. This selection and pausing is done manually for the moment.
>
> Looking at the error logs, Flink does detect the error after the timeout
> (The heartbeat timeout has been set to 20 seconds):
>
> java.util.concurrent.TimeoutException: The heartbeat of TaskManager with
> id 768848f91ebdbccc8d518e910160414d  timed out.
>
> After the failure has been detected, the system resets to the latest saved
> checkpoint and restarts. The system catches up nicely and resumes normal
> processing... however, after about 3 minutes, the following error occurs:
>
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Connection unexpectedly closed by remote task manager '/10.45.128.1:6121'.
> This might indicate that the remote task manager was lost.
>
> The job fails, and is unable to restart because the number of task slots
> has been reduced to zero. Looking at the kubernetes cluster, all containers
> are running...
>
> Has anyone else run into this error? What am I missing? The same thing
> happens when the containers are deleted.
>
> Regards,
> M.
>
>
>
>
>
>
>
>

Reply via email to