David Moravek created FLINK-13958:
-------------------------------------

             Summary: Job class loader may not be reused after batch job 
recovery
                 Key: FLINK-13958
                 URL: https://issues.apache.org/jira/browse/FLINK-13958
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Task
    Affects Versions: 1.9.0
            Reporter: David Moravek


[https://lists.apache.org/thread.html/e241be9a1a10810a1203786dff3b7386d265fbe8702815a77bad42eb@%3Cdev.flink.apache.org%3E|http://example.com]

1) We have a per-job flink cluster
2) We use BATCH execution mode + region failover strategy

Point 1) should imply single user code class loader per task manager (because 
there is only single pipeline, that reuses class loader cached in 
BlobLibraryCacheManager). We need this property, because we have UDFs that 
access C libraries using JNI (I think this may be fairly common use-case when 
dealing with legacy code). [JDK 
internals|https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ClassLoader.java#L2466]
 make sure that single library can be only loaded by a single class loader per 
JVM.

When region recovery is triggered, vertices that need recover are first reset 
back to CREATED stated and then rescheduled. In case all tasks in a task 
manager are reset, this results in [cached class loader being 
released|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/execution/librarycache/BlobLibraryCacheManager.java#L338].
 This unfortunately causes job failure, because we try to reload a native 
library in a newly created class loader.

I believe the correct approach would be not to release cached class loader if 
the job is recovering, even though there are no tasks currently registered with 
TM.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to