[jira] [Commented] (HBASE-22078) corrupted procs in proc WAL
[ https://issues.apache.org/jira/browse/HBASE-22078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16804306#comment-16804306 ] Sergey Shelukhin commented on HBASE-22078: -- No, the logs that far back were not available.. > corrupted procs in proc WAL > --- > > Key: HBASE-22078 > URL: https://issues.apache.org/jira/browse/HBASE-22078 > Project: HBase > Issue Type: Bug >Reporter: Sergey Shelukhin >Priority: Major > > Not sure what the root cause is... there are ~500 proc wal files (I actually > wonder if cleanup is also blocked by this, since I see these lines on master > restart, do WALs with abandoned procedures like that get deleted?). > {noformat} > 2019-03-20 07:37:53,212 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7571, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > 2019-03-20 07:37:53,213 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7600, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > 2019-03-20 07:37:53,213 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7610, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > 2019-03-20 07:37:53,213 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7631, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > 2019-03-20 07:37:53,213 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7650, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > 2019-03-20 07:37:53,213 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7651, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > 2019-03-20 07:37:53,213 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7657, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > 2019-03-20 07:37:53,213 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7683, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > {noformat} > Followed by > {noformat} > 2019-03-20 07:37:53,751 ERROR [master/...:17000:becomeActiveMaster] > procedure2.ProcedureExecutor: Corrupt pid=66829, > state=WAITING:DISABLE_TABLE_ADD_REPLICATION_BARRIER, hasLock=false; > DisableTableProcedure table=... > {noformat} > And 1000s of child procedures and grandchild procedures of this procedure. > I think this area needs general review... we should have a record for the > procedure durably persisted before we create any child procedures, so I'm not > sure how this could happen. Actually, I also wonder why we even have separate > proc WAL when HBase already has a working WAL that's more or less time > tested... -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22078) corrupted procs in proc WAL
[ https://issues.apache.org/jira/browse/HBASE-22078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16802929#comment-16802929 ] stack commented on HBASE-22078: --- bq. there are ~500 proc wal files. This is a problem. The cause has probably rolled away? Can you see where it went bad? > corrupted procs in proc WAL > --- > > Key: HBASE-22078 > URL: https://issues.apache.org/jira/browse/HBASE-22078 > Project: HBase > Issue Type: Bug >Reporter: Sergey Shelukhin >Priority: Major > > Not sure what the root cause is... there are ~500 proc wal files (I actually > wonder if cleanup is also blocked by this, since I see these lines on master > restart, do WALs with abandoned procedures like that get deleted?). > {noformat} > 2019-03-20 07:37:53,212 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7571, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > 2019-03-20 07:37:53,213 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7600, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > 2019-03-20 07:37:53,213 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7610, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > 2019-03-20 07:37:53,213 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7631, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > 2019-03-20 07:37:53,213 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7650, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > 2019-03-20 07:37:53,213 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7651, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > 2019-03-20 07:37:53,213 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7657, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > 2019-03-20 07:37:53,213 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7683, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > {noformat} > Followed by > {noformat} > 2019-03-20 07:37:53,751 ERROR [master/...:17000:becomeActiveMaster] > procedure2.ProcedureExecutor: Corrupt pid=66829, > state=WAITING:DISABLE_TABLE_ADD_REPLICATION_BARRIER, hasLock=false; > DisableTableProcedure table=... > {noformat} > And 1000s of child procedures and grandchild procedures of this procedure. > I think this area needs general review... we should have a record for the > procedure durably persisted before we create any child procedures, so I'm not > sure how this could happen. Actually, I also wonder why we even have separate > proc WAL when HBase already has a working WAL that's more or less time > tested... -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22078) corrupted procs in proc WAL
[ https://issues.apache.org/jira/browse/HBASE-22078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16798448#comment-16798448 ] Sean Busbey commented on HBASE-22078: - to map to the pre-existing WAL subsystem we'd make up some Key-Value structure to represent a procedure and then treat procedure completion as "flushed and safe to discard"? > corrupted procs in proc WAL > --- > > Key: HBASE-22078 > URL: https://issues.apache.org/jira/browse/HBASE-22078 > Project: HBase > Issue Type: Bug >Reporter: Sergey Shelukhin >Priority: Major > > Not sure what the root cause is... there are ~500 proc wal files (I actually > wonder if cleanup is also blocked by this, since I see these lines on master > restart, do WALs with abandoned procedures like that get deleted?). > {noformat} > 2019-03-20 07:37:53,212 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7571, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > 2019-03-20 07:37:53,213 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7600, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > 2019-03-20 07:37:53,213 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7610, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > 2019-03-20 07:37:53,213 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7631, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > 2019-03-20 07:37:53,213 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7650, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > 2019-03-20 07:37:53,213 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7651, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > 2019-03-20 07:37:53,213 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7657, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > 2019-03-20 07:37:53,213 ERROR [master/...:17000:becomeActiveMaster] > wal.WALProcedureTree: Missing stack id 7683, max stack id is 7754, root > procedure is Procedure(pid=66829, ppid=-1, > class=org.apache.hadoop.hbase.master.procedure.DisableTableProcedure) > {noformat} > Followed by > {noformat} > 2019-03-20 07:37:53,751 ERROR [master/...:17000:becomeActiveMaster] > procedure2.ProcedureExecutor: Corrupt pid=66829, > state=WAITING:DISABLE_TABLE_ADD_REPLICATION_BARRIER, hasLock=false; > DisableTableProcedure table=... > {noformat} > And 1000s of child procedures and grandchild procedures of this procedure. > I think this area needs general review... we should have a record for the > procedure durably persisted before we create any child procedures, so I'm not > sure how this could happen. Actually, I also wonder why we even have separate > proc WAL when HBase already has a working WAL that's more or less time > tested... -- This message was sent by Atlassian JIRA (v7.6.3#76005)