Stephan Ewen created FLINK-16225:
------------------------------------

             Summary: Metaspace Out Of Memory should be handled as Fatal Error 
in TaskManager
                 Key: FLINK-16225
                 URL: https://issues.apache.org/jira/browse/FLINK-16225
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Task
            Reporter: Stephan Ewen
             Fix For: 1.11.0


When an {{OutOfMemory (Metaspace)}} exception happens, there is usually no way 
to recover. This is often the result of user code or libraries that have subtle 
class loading leaks.

The one way to recover is to kill the TaskManagers and to let the resource 
orchestrators (K8s, Yarn, Mesos) restart them. Flink's fault tolerance should 
then be able to recover the job.

I would suggest to implement this the following way:
* The user code ClassLoader takes an "OOM Handler", which is called when class 
loading causes an OOM exception.
* The handler wraps this into an Exception with a good error message (see 
below) and invokes the TaskManager's {{FatalErrorHandler}}.
* The {{FatalErrorHandler}} in turn should attempt to cancel everything and 
notify the JM before shutting down. That way, we get decent error reporting and 
users can see what is going on.


The error message should describe the following:
* If user sees the error consistently on the first deploy, then the metaspace 
is simply too small for their application, and they need to explicitly increase 
it in the configuration
* If the user sees occasionally TaskManagers in a session cluster failing with 
that exception when deploying new jobs, then some user code or library probably 
has a class leak. The TM failure / restart is done in order to forcefully clean 
up.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to