[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking
[ https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Purtell updated HBASE-16721: --- Fix Version/s: (was: 1.4.0) > Concurrency issue in WAL unflushed seqId tracking > - > > Key: HBASE-16721 > URL: https://issues.apache.org/jira/browse/HBASE-16721 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 1.0.0, 1.1.0, 1.2.0 >Reporter: Enis Soztutar >Assignee: Enis Soztutar >Priority: Critical > Fix For: 2.0.0, 1.3.0, 1.2.4, 1.1.8 > > Attachments: hbase-16721_addendum.patch, > hbase-16721_addendum2.branch-1.1.patch, > hbase-16721_addendum2.branch-1.1.patch, > hbase-16721_addendum2.branch-1.1.patch, hbase-16721_v1.branch-1.patch, > hbase-16721_v2.branch-1.patch, hbase-16721_v2.master.patch > > > I'm inspecting an interesting case where in a production cluster, some > regionservers ends up accumulating hundreds of WAL files, even with force > flushes going on due to max logs. This happened multiple times on the > cluster, but not on other clusters. The cluster has periodic memstore flusher > disabled, however, this still does not explain why the force flush of regions > due to max limit is not working. I think the periodic memstore flusher just > masks the underlying problem, which is why we do not see this in other > clusters. > The problem starts like this: > {code} > 2016-09-21 17:49:18,272 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-21 17:49:18,273 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > then, it continues until the RS is restarted: > {code} > 2016-09-23 17:43:49,356 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-23 17:43:49,357 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already > split some time ago, and was able to flush its data and split without any > problems. However, the FSHLog still thinks that there is some unflushed data > for this region. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking
[ https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-16721: -- Resolution: Fixed Status: Resolved (was: Patch Available) Pushed addendum2 to 1.1. Other branches do not need it. > Concurrency issue in WAL unflushed seqId tracking > - > > Key: HBASE-16721 > URL: https://issues.apache.org/jira/browse/HBASE-16721 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 1.0.0, 1.1.0, 1.2.0 >Reporter: Enis Soztutar >Assignee: Enis Soztutar >Priority: Critical > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.4, 1.1.8 > > Attachments: hbase-16721_addendum.patch, > hbase-16721_addendum2.branch-1.1.patch, > hbase-16721_addendum2.branch-1.1.patch, > hbase-16721_addendum2.branch-1.1.patch, hbase-16721_v1.branch-1.patch, > hbase-16721_v2.branch-1.patch, hbase-16721_v2.master.patch > > > I'm inspecting an interesting case where in a production cluster, some > regionservers ends up accumulating hundreds of WAL files, even with force > flushes going on due to max logs. This happened multiple times on the > cluster, but not on other clusters. The cluster has periodic memstore flusher > disabled, however, this still does not explain why the force flush of regions > due to max limit is not working. I think the periodic memstore flusher just > masks the underlying problem, which is why we do not see this in other > clusters. > The problem starts like this: > {code} > 2016-09-21 17:49:18,272 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-21 17:49:18,273 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > then, it continues until the RS is restarted: > {code} > 2016-09-23 17:43:49,356 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-23 17:43:49,357 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already > split some time ago, and was able to flush its data and split without any > problems. However, the FSHLog still thinks that there is some unflushed data > for this region. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking
[ https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-16721: -- Attachment: hbase-16721_addendum2.branch-1.1.patch > Concurrency issue in WAL unflushed seqId tracking > - > > Key: HBASE-16721 > URL: https://issues.apache.org/jira/browse/HBASE-16721 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 1.0.0, 1.1.0, 1.2.0 >Reporter: Enis Soztutar >Assignee: Enis Soztutar >Priority: Critical > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.4, 1.1.8 > > Attachments: hbase-16721_addendum.patch, > hbase-16721_addendum2.branch-1.1.patch, > hbase-16721_addendum2.branch-1.1.patch, > hbase-16721_addendum2.branch-1.1.patch, hbase-16721_v1.branch-1.patch, > hbase-16721_v2.branch-1.patch, hbase-16721_v2.master.patch > > > I'm inspecting an interesting case where in a production cluster, some > regionservers ends up accumulating hundreds of WAL files, even with force > flushes going on due to max logs. This happened multiple times on the > cluster, but not on other clusters. The cluster has periodic memstore flusher > disabled, however, this still does not explain why the force flush of regions > due to max limit is not working. I think the periodic memstore flusher just > masks the underlying problem, which is why we do not see this in other > clusters. > The problem starts like this: > {code} > 2016-09-21 17:49:18,272 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-21 17:49:18,273 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > then, it continues until the RS is restarted: > {code} > 2016-09-23 17:43:49,356 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-23 17:43:49,357 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already > split some time ago, and was able to flush its data and split without any > problems. However, the FSHLog still thinks that there is some unflushed data > for this region. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking
[ https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-16721: -- Attachment: hbase-16721_addendum2.branch-1.1.patch > Concurrency issue in WAL unflushed seqId tracking > - > > Key: HBASE-16721 > URL: https://issues.apache.org/jira/browse/HBASE-16721 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 1.0.0, 1.1.0, 1.2.0 >Reporter: Enis Soztutar >Assignee: Enis Soztutar >Priority: Critical > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.4, 1.1.8 > > Attachments: hbase-16721_addendum.patch, > hbase-16721_addendum2.branch-1.1.patch, > hbase-16721_addendum2.branch-1.1.patch, hbase-16721_v1.branch-1.patch, > hbase-16721_v2.branch-1.patch, hbase-16721_v2.master.patch > > > I'm inspecting an interesting case where in a production cluster, some > regionservers ends up accumulating hundreds of WAL files, even with force > flushes going on due to max logs. This happened multiple times on the > cluster, but not on other clusters. The cluster has periodic memstore flusher > disabled, however, this still does not explain why the force flush of regions > due to max limit is not working. I think the periodic memstore flusher just > masks the underlying problem, which is why we do not see this in other > clusters. > The problem starts like this: > {code} > 2016-09-21 17:49:18,272 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-21 17:49:18,273 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > then, it continues until the RS is restarted: > {code} > 2016-09-23 17:43:49,356 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-23 17:43:49,357 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already > split some time ago, and was able to flush its data and split without any > problems. However, the FSHLog still thinks that there is some unflushed data > for this region. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking
[ https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-16721: -- Status: Patch Available (was: Reopened) > Concurrency issue in WAL unflushed seqId tracking > - > > Key: HBASE-16721 > URL: https://issues.apache.org/jira/browse/HBASE-16721 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 1.2.0, 1.1.0, 1.0.0 >Reporter: Enis Soztutar >Assignee: Enis Soztutar >Priority: Critical > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.4, 1.1.8 > > Attachments: hbase-16721_addendum.patch, > hbase-16721_addendum2.branch-1.1.patch, hbase-16721_v1.branch-1.patch, > hbase-16721_v2.branch-1.patch, hbase-16721_v2.master.patch > > > I'm inspecting an interesting case where in a production cluster, some > regionservers ends up accumulating hundreds of WAL files, even with force > flushes going on due to max logs. This happened multiple times on the > cluster, but not on other clusters. The cluster has periodic memstore flusher > disabled, however, this still does not explain why the force flush of regions > due to max limit is not working. I think the periodic memstore flusher just > masks the underlying problem, which is why we do not see this in other > clusters. > The problem starts like this: > {code} > 2016-09-21 17:49:18,272 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-21 17:49:18,273 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > then, it continues until the RS is restarted: > {code} > 2016-09-23 17:43:49,356 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-23 17:43:49,357 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already > split some time ago, and was able to flush its data and split without any > problems. However, the FSHLog still thinks that there is some unflushed data > for this region. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking
[ https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-16721: -- Attachment: hbase-16721_addendum2.branch-1.1.patch This addendum is needed for 1.1 without a proper fix for HBASE-16820. It is basically undoing the HRegion changes, but adding a {{waitForPreviousTransactionsComplete()}} call at the beginning instead. We end up doing two mvcc transactions, but at the end of the flush, we are guaranteed to have the mvcc read point to be advanced to the flush sequence id. > Concurrency issue in WAL unflushed seqId tracking > - > > Key: HBASE-16721 > URL: https://issues.apache.org/jira/browse/HBASE-16721 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 1.0.0, 1.1.0, 1.2.0 >Reporter: Enis Soztutar >Assignee: Enis Soztutar >Priority: Critical > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.4, 1.1.8 > > Attachments: hbase-16721_addendum.patch, > hbase-16721_addendum2.branch-1.1.patch, hbase-16721_v1.branch-1.patch, > hbase-16721_v2.branch-1.patch, hbase-16721_v2.master.patch > > > I'm inspecting an interesting case where in a production cluster, some > regionservers ends up accumulating hundreds of WAL files, even with force > flushes going on due to max logs. This happened multiple times on the > cluster, but not on other clusters. The cluster has periodic memstore flusher > disabled, however, this still does not explain why the force flush of regions > due to max limit is not working. I think the periodic memstore flusher just > masks the underlying problem, which is why we do not see this in other > clusters. > The problem starts like this: > {code} > 2016-09-21 17:49:18,272 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-21 17:49:18,273 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > then, it continues until the RS is restarted: > {code} > 2016-09-23 17:43:49,356 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-23 17:43:49,357 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already > split some time ago, and was able to flush its data and split without any > problems. However, the FSHLog still thinks that there is some unflushed data > for this region. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking
[ https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-16721: -- Release Note: Fixed a bug in sequenceId tracking for the WALs that caused WAL files to accumulate without being deleted due to a rare race condition. > Concurrency issue in WAL unflushed seqId tracking > - > > Key: HBASE-16721 > URL: https://issues.apache.org/jira/browse/HBASE-16721 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 1.0.0, 1.1.0, 1.2.0 >Reporter: Enis Soztutar >Assignee: Enis Soztutar >Priority: Critical > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.4, 1.1.8 > > Attachments: hbase-16721_addendum.patch, > hbase-16721_v1.branch-1.patch, hbase-16721_v2.branch-1.patch, > hbase-16721_v2.master.patch > > > I'm inspecting an interesting case where in a production cluster, some > regionservers ends up accumulating hundreds of WAL files, even with force > flushes going on due to max logs. This happened multiple times on the > cluster, but not on other clusters. The cluster has periodic memstore flusher > disabled, however, this still does not explain why the force flush of regions > due to max limit is not working. I think the periodic memstore flusher just > masks the underlying problem, which is why we do not see this in other > clusters. > The problem starts like this: > {code} > 2016-09-21 17:49:18,272 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-21 17:49:18,273 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > then, it continues until the RS is restarted: > {code} > 2016-09-23 17:43:49,356 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-23 17:43:49,357 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already > split some time ago, and was able to flush its data and split without any > problems. However, the FSHLog still thinks that there is some unflushed data > for this region. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking
[ https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-16721: -- Attachment: hbase-16721_addendum.patch Pushed the addendum. Thanks Duo. > Concurrency issue in WAL unflushed seqId tracking > - > > Key: HBASE-16721 > URL: https://issues.apache.org/jira/browse/HBASE-16721 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 1.0.0, 1.1.0, 1.2.0 >Reporter: Enis Soztutar >Assignee: Enis Soztutar >Priority: Critical > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.4, 1.1.8 > > Attachments: hbase-16721_addendum.patch, > hbase-16721_v1.branch-1.patch, hbase-16721_v2.branch-1.patch, > hbase-16721_v2.master.patch > > > I'm inspecting an interesting case where in a production cluster, some > regionservers ends up accumulating hundreds of WAL files, even with force > flushes going on due to max logs. This happened multiple times on the > cluster, but not on other clusters. The cluster has periodic memstore flusher > disabled, however, this still does not explain why the force flush of regions > due to max limit is not working. I think the periodic memstore flusher just > masks the underlying problem, which is why we do not see this in other > clusters. > The problem starts like this: > {code} > 2016-09-21 17:49:18,272 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-21 17:49:18,273 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > then, it continues until the RS is restarted: > {code} > 2016-09-23 17:43:49,356 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-23 17:43:49,357 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already > split some time ago, and was able to flush its data and split without any > problems. However, the FSHLog still thinks that there is some unflushed data > for this region. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking
[ https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-16721: -- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: (was: 1.1.7) 1.1.8 Status: Resolved (was: Patch Available) Pushed this. Thanks everyone. > Concurrency issue in WAL unflushed seqId tracking > - > > Key: HBASE-16721 > URL: https://issues.apache.org/jira/browse/HBASE-16721 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 1.0.0, 1.1.0, 1.2.0 >Reporter: Enis Soztutar >Assignee: Enis Soztutar >Priority: Critical > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.4, 1.1.8 > > Attachments: hbase-16721_v1.branch-1.patch, > hbase-16721_v2.branch-1.patch, hbase-16721_v2.master.patch > > > I'm inspecting an interesting case where in a production cluster, some > regionservers ends up accumulating hundreds of WAL files, even with force > flushes going on due to max logs. This happened multiple times on the > cluster, but not on other clusters. The cluster has periodic memstore flusher > disabled, however, this still does not explain why the force flush of regions > due to max limit is not working. I think the periodic memstore flusher just > masks the underlying problem, which is why we do not see this in other > clusters. > The problem starts like this: > {code} > 2016-09-21 17:49:18,272 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-21 17:49:18,273 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > then, it continues until the RS is restarted: > {code} > 2016-09-23 17:43:49,356 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-23 17:43:49,357 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already > split some time ago, and was able to flush its data and split without any > problems. However, the FSHLog still thinks that there is some unflushed data > for this region. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking
[ https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-16721: -- Attachment: hbase-16721_v2.master.patch Attaching the master patch as well. On master, the test does not fail, since as explained about we do not unlock the updatesLock until the transaction is fully complete (WAL.append() went through). Hence the master patch is only the test case. > Concurrency issue in WAL unflushed seqId tracking > - > > Key: HBASE-16721 > URL: https://issues.apache.org/jira/browse/HBASE-16721 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 1.0.0, 1.1.0, 1.2.0 >Reporter: Enis Soztutar >Assignee: Enis Soztutar >Priority: Critical > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 1.2.4 > > Attachments: hbase-16721_v1.branch-1.patch, > hbase-16721_v2.branch-1.patch, hbase-16721_v2.master.patch > > > I'm inspecting an interesting case where in a production cluster, some > regionservers ends up accumulating hundreds of WAL files, even with force > flushes going on due to max logs. This happened multiple times on the > cluster, but not on other clusters. The cluster has periodic memstore flusher > disabled, however, this still does not explain why the force flush of regions > due to max limit is not working. I think the periodic memstore flusher just > masks the underlying problem, which is why we do not see this in other > clusters. > The problem starts like this: > {code} > 2016-09-21 17:49:18,272 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-21 17:49:18,273 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > then, it continues until the RS is restarted: > {code} > 2016-09-23 17:43:49,356 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-23 17:43:49,357 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already > split some time ago, and was able to flush its data and split without any > problems. However, the FSHLog still thinks that there is some unflushed data > for this region. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking
[ https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-16721: -- Attachment: hbase-16721_v2.branch-1.patch I was able to come up with a unit test for this. It is a bit involved, but basically simulates the situation with holding the threads at appropriate times. As any good unit test should, it fails on without the patch: {code} java.lang.AssertionError: Found seqId for the region which is already flushed Expected :-1 Actual :3 {code} Let me see whether we need the patch in master. > Concurrency issue in WAL unflushed seqId tracking > - > > Key: HBASE-16721 > URL: https://issues.apache.org/jira/browse/HBASE-16721 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 1.0.0, 1.1.0, 1.2.0 >Reporter: Enis Soztutar >Assignee: Enis Soztutar >Priority: Critical > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 1.2.4 > > Attachments: hbase-16721_v1.branch-1.patch, > hbase-16721_v2.branch-1.patch > > > I'm inspecting an interesting case where in a production cluster, some > regionservers ends up accumulating hundreds of WAL files, even with force > flushes going on due to max logs. This happened multiple times on the > cluster, but not on other clusters. The cluster has periodic memstore flusher > disabled, however, this still does not explain why the force flush of regions > due to max limit is not working. I think the periodic memstore flusher just > masks the underlying problem, which is why we do not see this in other > clusters. > The problem starts like this: > {code} > 2016-09-21 17:49:18,272 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-21 17:49:18,273 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > then, it continues until the RS is restarted: > {code} > 2016-09-23 17:43:49,356 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-23 17:43:49,357 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already > split some time ago, and was able to flush its data and split without any > problems. However, the FSHLog still thinks that there is some unflushed data > for this region. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking
[ https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-16721: -- Status: Patch Available (was: Open) > Concurrency issue in WAL unflushed seqId tracking > - > > Key: HBASE-16721 > URL: https://issues.apache.org/jira/browse/HBASE-16721 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 1.2.0, 1.1.0, 1.0.0 >Reporter: Enis Soztutar >Assignee: Enis Soztutar >Priority: Critical > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 1.2.4 > > Attachments: hbase-16721_v1.branch-1.patch > > > I'm inspecting an interesting case where in a production cluster, some > regionservers ends up accumulating hundreds of WAL files, even with force > flushes going on due to max logs. This happened multiple times on the > cluster, but not on other clusters. The cluster has periodic memstore flusher > disabled, however, this still does not explain why the force flush of regions > due to max limit is not working. I think the periodic memstore flusher just > masks the underlying problem, which is why we do not see this in other > clusters. > The problem starts like this: > {code} > 2016-09-21 17:49:18,272 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-21 17:49:18,273 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > then, it continues until the RS is restarted: > {code} > 2016-09-23 17:43:49,356 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-23 17:43:49,357 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already > split some time ago, and was able to flush its data and split without any > problems. However, the FSHLog still thinks that there is some unflushed data > for this region. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking
[ https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-16721: -- Attachment: hbase-16721_v1.branch-1.patch This should solve the problem for branch-1. On master, we may not have the same problem since in master, we keep holding the region read lock until the existing transaction is already sync'ed. The patch makes it so that after getting the region write lock, we also wait for the ongoing transactions before starting the FSHLog.startCacheFlush() call. We already wait for all transactions to finish under the region write lock, so semantically the behavior should not change (would be good if the reviewers verify this independently). > Concurrency issue in WAL unflushed seqId tracking > - > > Key: HBASE-16721 > URL: https://issues.apache.org/jira/browse/HBASE-16721 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 1.0.0, 1.1.0, 1.2.0 >Reporter: Enis Soztutar >Assignee: Enis Soztutar >Priority: Critical > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 1.2.4 > > Attachments: hbase-16721_v1.branch-1.patch > > > I'm inspecting an interesting case where in a production cluster, some > regionservers ends up accumulating hundreds of WAL files, even with force > flushes going on due to max logs. This happened multiple times on the > cluster, but not on other clusters. The cluster has periodic memstore flusher > disabled, however, this still does not explain why the force flush of regions > due to max limit is not working. I think the periodic memstore flusher just > masks the underlying problem, which is why we do not see this in other > clusters. > The problem starts like this: > {code} > 2016-09-21 17:49:18,272 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-21 17:49:18,273 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > then, it continues until the RS is restarted: > {code} > 2016-09-23 17:43:49,356 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-23 17:43:49,357 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already > split some time ago, and was able to flush its data and split without any > problems. However, the FSHLog still thinks that there is some unflushed data > for this region. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking
[ https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated HBASE-16721: Affects Version/s: 1.0.0 > Concurrency issue in WAL unflushed seqId tracking > - > > Key: HBASE-16721 > URL: https://issues.apache.org/jira/browse/HBASE-16721 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 1.0.0, 1.1.0, 1.2.0 >Reporter: Enis Soztutar >Assignee: Enis Soztutar >Priority: Critical > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 1.2.4 > > > I'm inspecting an interesting case where in a production cluster, some > regionservers ends up accumulating hundreds of WAL files, even with force > flushes going on due to max logs. This happened multiple times on the > cluster, but not on other clusters. The cluster has periodic memstore flusher > disabled, however, this still does not explain why the force flush of regions > due to max limit is not working. I think the periodic memstore flusher just > masks the underlying problem, which is why we do not see this in other > clusters. > The problem starts like this: > {code} > 2016-09-21 17:49:18,272 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-21 17:49:18,273 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > then, it continues until the RS is restarted: > {code} > 2016-09-23 17:43:49,356 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-23 17:43:49,357 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already > split some time ago, and was able to flush its data and split without any > problems. However, the FSHLog still thinks that there is some unflushed data > for this region. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking
[ https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated HBASE-16721: Priority: Critical (was: Major) > Concurrency issue in WAL unflushed seqId tracking > - > > Key: HBASE-16721 > URL: https://issues.apache.org/jira/browse/HBASE-16721 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 1.1.0, 1.2.0 >Reporter: Enis Soztutar >Assignee: Enis Soztutar >Priority: Critical > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 1.2.4 > > > I'm inspecting an interesting case where in a production cluster, some > regionservers ends up accumulating hundreds of WAL files, even with force > flushes going on due to max logs. This happened multiple times on the > cluster, but not on other clusters. The cluster has periodic memstore flusher > disabled, however, this still does not explain why the force flush of regions > due to max limit is not working. I think the periodic memstore flusher just > masks the underlying problem, which is why we do not see this in other > clusters. > The problem starts like this: > {code} > 2016-09-21 17:49:18,272 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-21 17:49:18,273 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > then, it continues until the RS is restarted: > {code} > 2016-09-23 17:43:49,356 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-23 17:43:49,357 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already > split some time ago, and was able to flush its data and split without any > problems. However, the FSHLog still thinks that there is some unflushed data > for this region. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking
[ https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated HBASE-16721: Affects Version/s: 1.1.0 1.2.0 > Concurrency issue in WAL unflushed seqId tracking > - > > Key: HBASE-16721 > URL: https://issues.apache.org/jira/browse/HBASE-16721 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 1.1.0, 1.2.0 >Reporter: Enis Soztutar >Assignee: Enis Soztutar >Priority: Critical > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 1.2.4 > > > I'm inspecting an interesting case where in a production cluster, some > regionservers ends up accumulating hundreds of WAL files, even with force > flushes going on due to max logs. This happened multiple times on the > cluster, but not on other clusters. The cluster has periodic memstore flusher > disabled, however, this still does not explain why the force flush of regions > due to max limit is not working. I think the periodic memstore flusher just > masks the underlying problem, which is why we do not see this in other > clusters. > The problem starts like this: > {code} > 2016-09-21 17:49:18,272 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-21 17:49:18,273 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > then, it continues until the RS is restarted: > {code} > 2016-09-23 17:43:49,356 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-23 17:43:49,357 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already > split some time ago, and was able to flush its data and split without any > problems. However, the FSHLog still thinks that there is some unflushed data > for this region. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking
[ https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated HBASE-16721: Component/s: wal > Concurrency issue in WAL unflushed seqId tracking > - > > Key: HBASE-16721 > URL: https://issues.apache.org/jira/browse/HBASE-16721 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 1.1.0, 1.2.0 >Reporter: Enis Soztutar >Assignee: Enis Soztutar >Priority: Critical > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 1.2.4 > > > I'm inspecting an interesting case where in a production cluster, some > regionservers ends up accumulating hundreds of WAL files, even with force > flushes going on due to max logs. This happened multiple times on the > cluster, but not on other clusters. The cluster has periodic memstore flusher > disabled, however, this still does not explain why the force flush of regions > due to max limit is not working. I think the periodic memstore flusher just > masks the underlying problem, which is why we do not see this in other > clusters. > The problem starts like this: > {code} > 2016-09-21 17:49:18,272 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-21 17:49:18,273 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > then, it continues until the RS is restarted: > {code} > 2016-09-23 17:43:49,356 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-23 17:43:49,357 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already > split some time ago, and was able to flush its data and split without any > problems. However, the FSHLog still thinks that there is some unflushed data > for this region. -- This message was sent by Atlassian JIRA (v6.3.4#6332)