[ https://issues.apache.org/jira/browse/HBASE-23694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Duo Zhang resolved HBASE-23694. ------------------------------- Hadoop Flags: Reviewed Resolution: Fixed Pushed to master and branch-2. Thanks [~wchevreuil] for fixing this. > After RegionProcedureStore completes migration of WALProcedureStore, still > running WALProcedureStore.syncThread keeps trying to delete now inexistent > log files. > ------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > Key: HBASE-23694 > URL: https://issues.apache.org/jira/browse/HBASE-23694 > Project: HBase > Issue Type: Bug > Components: proc-v2, RegionProcedureStore > Affects Versions: 3.0.0 > Reporter: Wellington Chevreuil > Assignee: Wellington Chevreuil > Priority: Major > Fix For: 3.0.0, 2.3.0 > > > With the introduction of the new procedure store to save procs on a table, > instead of proc WALs, we added some migration logic to enable a smooth > upgrade of clusters under previous versions that used to have proc wal files > as the underlying procedure store. > While running a fresh install with latest master branch version, I had > noticed that once the old _WALProcedureStore_ directory is detected, it > creates and start a WPS instance > [here|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/procedure2/store/region/RegionProcedureStore.java#L311], > passing a built-in _ProcedureLoader_ > [here|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/procedure2/store/region/RegionProcedureStore.java#L315] > to perform the conversion from proc wal files to table based. > At the end of this conversion, the builtin loader deletes the proc wal dir > [here|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/procedure2/store/region/RegionProcedureStore.java#L363]. > This may cause a race condition where a running _WALProcedureStore_ internal > thread may be trying to delete the first proc wal file that may had already > rolled, then failing with FNFE. > Even worse is that _WALProcedureStore.syncLoop_ keeps trying indefinitely, > flooding Master logs with messages such as below: > {noformat} > 2020-01-14 06:13:23,331 INFO [master/hbase01:16000:becomeActiveMaster] > region.RegionProcedureStore: The old WALProcedureStore wal directory > file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs exists, migrating... > 2020-01-14 06:13:23,361 DEBUG [master/hbase01:16000:becomeActiveMaster] > wal.WALProcedureStore: Starting WAL Procedure Store lease recovery > 2020-01-14 06:13:23,453 INFO [master/hbase01:16000:becomeActiveMaster] > wal.WALProcedureStore: Rolled new Procedure Store WAL, id=1 > 2020-01-14 06:13:23,454 DEBUG [master/hbase01:16000:becomeActiveMaster] > wal.WALProcedureStore: Lease acquired for flushLogId=1 > 2020-01-14 06:13:23,454 DEBUG [master/hbase01:16000:becomeActiveMaster] > wal.WALProcedureStore: No state logs to replay. > 2020-01-14 06:13:23,454 INFO [master/hbase01:16000:becomeActiveMaster] > region.RegionProcedureStore: The WALProcedureStore max pid is 0, and the max > pid of all loaded procedures is -1 > 2020-01-14 06:13:23,455 INFO [master/hbase01:16000:becomeActiveMaster] > region.RegionProcedureStore: Migration of WALProcedureStore finished > 2020-01-14 06:13:23,455 INFO [master/hbase01:16000:becomeActiveMaster] > procedure2.ProcedureExecutor: Recovered RegionProcedureStore lease in 784 msec > 2020-01-14 06:13:23,471 TRACE [WALProcedureStoreSyncThread] > wal.WALProcedureStore: no active procedures > 2020-01-14 06:13:23,516 INFO [master/hbase01:16000:becomeActiveMaster] > procedure2.ProcedureExecutor: Loaded RegionProcedureStore in 61 msec > 2020-01-14 06:13:23,516 INFO [master/hbase01:16000:becomeActiveMaster] > procedure2.RemoteProcedureDispatcher: Instantiated, coreThreads=128 > (allowCoreThreadTimeOut=true), queueMaxSize=32, operationDelay=150 > 2020-01-14 06:13:23,628 INFO [WALProcedureStoreSyncThread] > wal.WALProcedureStore: Rolled new Procedure Store WAL, id=2 > 2020-01-14 06:13:23,634 INFO [WALProcedureStoreSyncThread] > wal.WALProcedureStore: Remove all state logs with ID less than 1, since no > active procedures > 2020-01-14 06:13:23,634 TRACE [WALProcedureStoreSyncThread] > wal.WALProcedureStore: Removing > log=file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log > 2020-01-14 06:13:23,634 INFO [WALProcedureStoreSyncThread] > wal.ProcedureWALFile: Archiving > file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log > to > file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/oldWALs/pv2-00000000000000000001.log > 2020-01-14 06:13:23,634 ERROR [WALProcedureStoreSyncThread] > wal.WALProcedureStore: Unable to remove log: > file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log > java.io.FileNotFoundException: File > file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log > does not exist > ... > 2020-01-14 06:13:23,635 ERROR [WALProcedureStoreSyncThread] > wal.WALProcedureStore: Unable to remove log: > file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log > java.io.FileNotFoundException: File > file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log > does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:635) > ... > 2020-01-14 06:13:37,352 ERROR [WALProcedureStoreSyncThread] > wal.WALProcedureStore: Unable to remove log: > file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log > java.io.FileNotFoundException: File > file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log > does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:635) > {noformat} > I believe a simple solution in this case is to explicitly call > _WALProcedureStore.stop_ method at the end of the migration, which stops > _WALProcedureStore_ internal _syncLoop_ thread. > Am opening a PR soon. -- This message was sent by Atlassian Jira (v8.3.4#803005)