GlenGeng opened a new pull request #1373: URL: https://github.com/apache/hadoop-ozone/pull/1373
## What changes were proposed in this pull request? Current RetryPolicy of Datanode for SCM is retryForeverWithFixedSleep: ``` RetryPolicy retryPolicy = RetryPolicies.retryForeverWithFixedSleep( 1000, TimeUnit.MILLISECONDS); StorageContainerDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy( StorageContainerDatanodeProtocolPB.class, version, address, UserGroupInformation.getCurrentUser(), hadoopConfig, NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(), retryPolicy).getProxy(); ``` that for Recon is retryUpToMaximumCountWithFixedSleep: ``` RetryPolicy retryPolicy = RetryPolicies.retryUpToMaximumCountWithFixedSleep(10, 60000, TimeUnit.MILLISECONDS); ReconDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy( ReconDatanodeProtocolPB.class, version, address, UserGroupInformation.getCurrentUser(), hadoopConfig, NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(), retryPolicy).getProxy(); ``` The executorService in DatanodeStateMachine is Executors.newFixedThreadPool(...), whose default pool size is 2, one for Recon, another for SCM. When encounter rpc failure, call() of RegisterEndpointTask, VersionEndpointTask, HeartbeatEndpointTask will retry while holding the rpcEndpoint.lock(). For example: ``` public EndpointStateMachine.EndPointStates call() throws Exception { rpcEndpoint.lock(); try { .... SCMHeartbeatResponseProto reponse = rpcEndpoint.getEndPoint() .sendHeartbeat(request); .... } finally { rpcEndpoint.unlock(); } return rpcEndpoint.getState(); } ``` **The problem is:** If setup one Recon and one SCM, then shutdown the Recon server, all Datanodes will be stale/dead very soon at SCM side. **The root cause is:** The thread running Recon task will retry due to rpc failure, meanwhile holds the lock of EndpointStateMachine for Recon. When DatanodeStateMachine schedule the next round of SCM/Recon task, the only left thread will be assigned to run Recon task, and blocked at waiting for the lock of EndpointStateMachine for Recon. ``` public EndpointStateMachine.EndPointStates call() throws Exception { rpcEndpoint.lock(); ... ``` **The solution is:** Since DatanodeStateMachine will periodically schedule SCM/Recon tasks, we may adjust RetryPolicy so that won't retry for longer that 1min. The change has no side effect: 1) VersionEndpointTask.call() is fine 2) RegisterEndpointTask.call() will query containerReport, nodeReport, pipelineReports from OzoneContainer, which is fine. 3) HeartbeatEndpointTask.call() will putBackReports(), which is fine. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-4186 ## How was this patch tested? CI ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org