[ 
https://issues.apache.org/jira/browse/HBASE-25505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264952#comment-17264952
 ] 

Duo Zhang edited comment on HBASE-25505 at 1/14/21, 3:18 PM:
-------------------------------------------------------------

I think I found a possible problem to lead to this situation, especially that 
[~larsh] confirmed that the not closed ZKWatcher is in ReplicationLogCleaner.

Before HBASE-23340, ReplicationLogCleaner will create its own ZKWatcher, as we 
will close it in the stop method, which will be called in the cleanup method of 
CleanerChore.

Then here comes the problem. The cleanup method of a ScheduledChore will only 
be called in the run method, so if you just call stop on the stopper instance 
which is passed to the ScheduledChore when creating it, everything will be 
fine. But in HMaster.stopChore, we use ChoreService.cancelChore to stop the 
ScheduledChore. So if the ScheduledChore has not been scheduled again after we 
set stopped to true for HMaster and before we call cancelChore(I even can not 
make sure setting stopped to true is happened before we call cancelChore...), 
the cleanup method will never be executed. And this is likely the case as the 
default schedule interval is 10 minutes...

I added a UT in the uploaded patch to show that, calling cancelChore will not 
introduce a call to cleanup.

I think a possible fix is to also call cleanup in the cancelChore method of 
ChoreService, just need to add a comment to say that the implementation should 
make sure that the method can be called multiple times without side effect.

Thanks.


was (Author: apache9):
I think I found a possible problem to lead to this situation, especially that 
[~larsh] confirmed that the not closed ZKWatcher is in ReplicationLogCleaner.

Before HBASE-23340, ReplicationLogCleaner will create its own ZKWatcher, as we 
will close it in the stop method, which will be called in the cleanup method of 
CleanerChore.

Then here comes the problem. The cleanup method of a ScheduledChore will only 
be called in the run method, so if you just call stop on the stopper instance 
which is passed to the ScheduledChore when creating it, everything will be 
fine. But in HMaster.stopChore, we use ChoreService.cancelChore to stop the 
ScheduledChore. So if the ScheduledChore has been scheduled again after we set 
stopped to true for HMaster and before we call cancelChore(I even can not make 
sure setting stopped to true is happened before we call cancelChore...), the 
cleanup method will never be executed. And this is likely the case as the 
default schedule interval is 10 minutes...

I added a UT in the uploaded patch to show that, calling cancelChore will not 
introduce a call to cleanup.

I think a possible fix is to also call cleanup in the cancelChore method of 
ChoreService, just need to add a comment to say that the implementation should 
make sure that the method can be called multiple times without side effect.

Thanks.

> ZK watcher threads are daemonized; reconsider
> ---------------------------------------------
>
>                 Key: HBASE-25505
>                 URL: https://issues.apache.org/jira/browse/HBASE-25505
>             Project: HBase
>          Issue Type: Brainstorming
>            Reporter: Andrew Kyle Purtell
>            Priority: Major
>         Attachments: ScheduledChore.cleanup-not-called.diff
>
>
> On HBASE-25279 there was some discussion and difference of opinion about 
> having ZK watcher pool threads be daemonized. This is not necessarily a 
> problem but should be reconsidered. 
> Daemon threads are subject to abrupt termination during JVM shutdown and 
> therefore may be interrupted before state changes are complete or resources 
> are released. 
> As long as ZK watchers are properly closed by shutdown logic the pool threads 
> will be terminated in a controlled manner and the JVM will exit. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to