Hey Morgan, Is it possible for you to provide us with the full logs of the JobManager and the affected TaskManager? This might give us a hint why the number of task slots is zero.
Best, Robert On Tue, May 5, 2020 at 11:41 AM Morgan Geldenhuys < [email protected]> wrote: > > Community, > > I am currently doing some fault tolerance testing for Flink (1.10) running > on Kubernetes (1.18) and am encountering an error where after a running job > experiences a failure, the job fails completely. > > A Flink session cluster has been created according to the documentation > contained here: > https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/kubernetes.html. > The job is then uploaded and deployed via the web interface and everything > runs smoothly. The job has a parallelism of 24 with 3 worker nodes as fail > overs in reserve. Each worker is assigned 1 task slot each (total of 27). > > The next step would be inject an error for which I use the Pumba Chaos > Testing tool (https://github.com/alexei-led/pumba) to pause a random > worker process. This selection and pausing is done manually for the moment. > > Looking at the error logs, Flink does detect the error after the timeout > (The heartbeat timeout has been set to 20 seconds): > > java.util.concurrent.TimeoutException: The heartbeat of TaskManager with > id 768848f91ebdbccc8d518e910160414d timed out. > > After the failure has been detected, the system resets to the latest saved > checkpoint and restarts. The system catches up nicely and resumes normal > processing... however, after about 3 minutes, the following error occurs: > > org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: > Connection unexpectedly closed by remote task manager '/10.45.128.1:6121'. > This might indicate that the remote task manager was lost. > > The job fails, and is unable to restart because the number of task slots > has been reduced to zero. Looking at the kubernetes cluster, all containers > are running... > > Has anyone else run into this error? What am I missing? The same thing > happens when the containers are deleted. > > Regards, > M. > > > > > > > >
