Zhilong Hong created FLINK-21201:
------------------------------------

             Summary: Creating BoundedBlockingSubpartition blocks TaskManager’s 
main thread
                 Key: FLINK-21201
                 URL: https://issues.apache.org/jira/browse/FLINK-21201
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Network
    Affects Versions: 1.12.1
            Reporter: Zhilong Hong
         Attachments: jobmanager.log.tar.gz, taskmanager.log.tar.gz

When we are trying to run batch jobs with 8k parallelism, it takes a long time 
to deploy the vertices. After the investigation, we find that creating 
BoundedBlockingSubpartition blocks TaskManager’s main thread during the 
procedure of {{submitTask}}. 

When JobMaster invokes {{submitTask}} and sends an RPC call to the TaskManager, 
the TaskManager will receive the RPC call and execute the {{submitTask}} method 
in its main thread. In the {{submitTask}} method, the TaskExecutor will create 
a Task instance and try to start it. During the creation, the TaskExecutor will 
create the ResultPartition and its ResultSubpartitions. 

For the batch job, the type of ResultSubpartitions is the 
BoundedBlockingSubpartition with the FileChannelBoundedData. The 
BoundedBlockingSubpartition will create a file on the local disk, which is an 
IO operation and could take a long time. 

In our test, it would take at most 28 seconds to create 8k 
BoundedBlockingSubpartitions. This procedure blocks the main thread of the 
TaskManager, and would lead to heartbeat timeout and slow task deploying. In my 
opinion, the IO operation should be executed with IOExecutor rather than the 
main thread. 

The log of JobManager and TaskManager is attached below. A typical task is 
Source 0: #898.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to