[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking

2017-11-08 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-16721:
---
Fix Version/s: (was: 1.4.0)

> Concurrency issue in WAL unflushed seqId tracking
> -
>
> Key: HBASE-16721
> URL: https://issues.apache.org/jira/browse/HBASE-16721
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 1.0.0, 1.1.0, 1.2.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.2.4, 1.1.8
>
> Attachments: hbase-16721_addendum.patch, 
> hbase-16721_addendum2.branch-1.1.patch, 
> hbase-16721_addendum2.branch-1.1.patch, 
> hbase-16721_addendum2.branch-1.1.patch, hbase-16721_v1.branch-1.patch, 
> hbase-16721_v2.branch-1.patch, hbase-16721_v2.master.patch
>
>
> I'm inspecting an interesting case where in a production cluster, some 
> regionservers ends up accumulating hundreds of WAL files, even with force 
> flushes going on due to max logs. This happened multiple times on the 
> cluster, but not on other clusters. The cluster has periodic memstore flusher 
> disabled, however, this still does not explain why the force flush of regions 
> due to max limit is not working. I think the periodic memstore flusher just 
> masks the underlying problem, which is why we do not see this in other 
> clusters. 
> The problem starts like this: 
> {code}
> 2016-09-21 17:49:18,272 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-21 17:49:18,273 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> then, it continues until the RS is restarted: 
> {code}
> 2016-09-23 17:43:49,356 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-23 17:43:49,357 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already 
> split some time ago, and was able to flush its data and split without any 
> problems. However, the FSHLog still thinks that there is some unflushed data 
> for this region. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking

2016-10-17 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-16721:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Pushed addendum2 to 1.1. Other branches do not need it. 

> Concurrency issue in WAL unflushed seqId tracking
> -
>
> Key: HBASE-16721
> URL: https://issues.apache.org/jira/browse/HBASE-16721
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 1.0.0, 1.1.0, 1.2.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.4, 1.1.8
>
> Attachments: hbase-16721_addendum.patch, 
> hbase-16721_addendum2.branch-1.1.patch, 
> hbase-16721_addendum2.branch-1.1.patch, 
> hbase-16721_addendum2.branch-1.1.patch, hbase-16721_v1.branch-1.patch, 
> hbase-16721_v2.branch-1.patch, hbase-16721_v2.master.patch
>
>
> I'm inspecting an interesting case where in a production cluster, some 
> regionservers ends up accumulating hundreds of WAL files, even with force 
> flushes going on due to max logs. This happened multiple times on the 
> cluster, but not on other clusters. The cluster has periodic memstore flusher 
> disabled, however, this still does not explain why the force flush of regions 
> due to max limit is not working. I think the periodic memstore flusher just 
> masks the underlying problem, which is why we do not see this in other 
> clusters. 
> The problem starts like this: 
> {code}
> 2016-09-21 17:49:18,272 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-21 17:49:18,273 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> then, it continues until the RS is restarted: 
> {code}
> 2016-09-23 17:43:49,356 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-23 17:43:49,357 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already 
> split some time ago, and was able to flush its data and split without any 
> problems. However, the FSHLog still thinks that there is some unflushed data 
> for this region. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking

2016-10-14 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-16721:
--
Attachment: hbase-16721_addendum2.branch-1.1.patch

> Concurrency issue in WAL unflushed seqId tracking
> -
>
> Key: HBASE-16721
> URL: https://issues.apache.org/jira/browse/HBASE-16721
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 1.0.0, 1.1.0, 1.2.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.4, 1.1.8
>
> Attachments: hbase-16721_addendum.patch, 
> hbase-16721_addendum2.branch-1.1.patch, 
> hbase-16721_addendum2.branch-1.1.patch, 
> hbase-16721_addendum2.branch-1.1.patch, hbase-16721_v1.branch-1.patch, 
> hbase-16721_v2.branch-1.patch, hbase-16721_v2.master.patch
>
>
> I'm inspecting an interesting case where in a production cluster, some 
> regionservers ends up accumulating hundreds of WAL files, even with force 
> flushes going on due to max logs. This happened multiple times on the 
> cluster, but not on other clusters. The cluster has periodic memstore flusher 
> disabled, however, this still does not explain why the force flush of regions 
> due to max limit is not working. I think the periodic memstore flusher just 
> masks the underlying problem, which is why we do not see this in other 
> clusters. 
> The problem starts like this: 
> {code}
> 2016-09-21 17:49:18,272 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-21 17:49:18,273 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> then, it continues until the RS is restarted: 
> {code}
> 2016-09-23 17:43:49,356 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-23 17:43:49,357 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already 
> split some time ago, and was able to flush its data and split without any 
> problems. However, the FSHLog still thinks that there is some unflushed data 
> for this region. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking

2016-10-13 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-16721:
--
Attachment: hbase-16721_addendum2.branch-1.1.patch

> Concurrency issue in WAL unflushed seqId tracking
> -
>
> Key: HBASE-16721
> URL: https://issues.apache.org/jira/browse/HBASE-16721
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 1.0.0, 1.1.0, 1.2.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.4, 1.1.8
>
> Attachments: hbase-16721_addendum.patch, 
> hbase-16721_addendum2.branch-1.1.patch, 
> hbase-16721_addendum2.branch-1.1.patch, hbase-16721_v1.branch-1.patch, 
> hbase-16721_v2.branch-1.patch, hbase-16721_v2.master.patch
>
>
> I'm inspecting an interesting case where in a production cluster, some 
> regionservers ends up accumulating hundreds of WAL files, even with force 
> flushes going on due to max logs. This happened multiple times on the 
> cluster, but not on other clusters. The cluster has periodic memstore flusher 
> disabled, however, this still does not explain why the force flush of regions 
> due to max limit is not working. I think the periodic memstore flusher just 
> masks the underlying problem, which is why we do not see this in other 
> clusters. 
> The problem starts like this: 
> {code}
> 2016-09-21 17:49:18,272 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-21 17:49:18,273 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> then, it continues until the RS is restarted: 
> {code}
> 2016-09-23 17:43:49,356 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-23 17:43:49,357 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already 
> split some time ago, and was able to flush its data and split without any 
> problems. However, the FSHLog still thinks that there is some unflushed data 
> for this region. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking

2016-10-12 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-16721:
--
Status: Patch Available  (was: Reopened)

> Concurrency issue in WAL unflushed seqId tracking
> -
>
> Key: HBASE-16721
> URL: https://issues.apache.org/jira/browse/HBASE-16721
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 1.2.0, 1.1.0, 1.0.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.4, 1.1.8
>
> Attachments: hbase-16721_addendum.patch, 
> hbase-16721_addendum2.branch-1.1.patch, hbase-16721_v1.branch-1.patch, 
> hbase-16721_v2.branch-1.patch, hbase-16721_v2.master.patch
>
>
> I'm inspecting an interesting case where in a production cluster, some 
> regionservers ends up accumulating hundreds of WAL files, even with force 
> flushes going on due to max logs. This happened multiple times on the 
> cluster, but not on other clusters. The cluster has periodic memstore flusher 
> disabled, however, this still does not explain why the force flush of regions 
> due to max limit is not working. I think the periodic memstore flusher just 
> masks the underlying problem, which is why we do not see this in other 
> clusters. 
> The problem starts like this: 
> {code}
> 2016-09-21 17:49:18,272 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-21 17:49:18,273 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> then, it continues until the RS is restarted: 
> {code}
> 2016-09-23 17:43:49,356 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-23 17:43:49,357 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already 
> split some time ago, and was able to flush its data and split without any 
> problems. However, the FSHLog still thinks that there is some unflushed data 
> for this region. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking

2016-10-12 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-16721:
--
Attachment: hbase-16721_addendum2.branch-1.1.patch

This addendum is needed for 1.1 without a proper fix for HBASE-16820. It is 
basically undoing the HRegion changes, but adding a 
{{waitForPreviousTransactionsComplete()}} call at the beginning instead. We end 
up doing two mvcc transactions, but at the end of the flush, we are guaranteed 
to have the mvcc read point to be advanced to the flush sequence id. 

> Concurrency issue in WAL unflushed seqId tracking
> -
>
> Key: HBASE-16721
> URL: https://issues.apache.org/jira/browse/HBASE-16721
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 1.0.0, 1.1.0, 1.2.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.4, 1.1.8
>
> Attachments: hbase-16721_addendum.patch, 
> hbase-16721_addendum2.branch-1.1.patch, hbase-16721_v1.branch-1.patch, 
> hbase-16721_v2.branch-1.patch, hbase-16721_v2.master.patch
>
>
> I'm inspecting an interesting case where in a production cluster, some 
> regionservers ends up accumulating hundreds of WAL files, even with force 
> flushes going on due to max logs. This happened multiple times on the 
> cluster, but not on other clusters. The cluster has periodic memstore flusher 
> disabled, however, this still does not explain why the force flush of regions 
> due to max limit is not working. I think the periodic memstore flusher just 
> masks the underlying problem, which is why we do not see this in other 
> clusters. 
> The problem starts like this: 
> {code}
> 2016-09-21 17:49:18,272 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-21 17:49:18,273 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> then, it continues until the RS is restarted: 
> {code}
> 2016-09-23 17:43:49,356 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-23 17:43:49,357 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already 
> split some time ago, and was able to flush its data and split without any 
> problems. However, the FSHLog still thinks that there is some unflushed data 
> for this region. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking

2016-10-10 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-16721:
--
Release Note: Fixed a bug in sequenceId tracking for the WALs that caused 
WAL files to accumulate without being deleted due to a rare race condition. 

> Concurrency issue in WAL unflushed seqId tracking
> -
>
> Key: HBASE-16721
> URL: https://issues.apache.org/jira/browse/HBASE-16721
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 1.0.0, 1.1.0, 1.2.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.4, 1.1.8
>
> Attachments: hbase-16721_addendum.patch, 
> hbase-16721_v1.branch-1.patch, hbase-16721_v2.branch-1.patch, 
> hbase-16721_v2.master.patch
>
>
> I'm inspecting an interesting case where in a production cluster, some 
> regionservers ends up accumulating hundreds of WAL files, even with force 
> flushes going on due to max logs. This happened multiple times on the 
> cluster, but not on other clusters. The cluster has periodic memstore flusher 
> disabled, however, this still does not explain why the force flush of regions 
> due to max limit is not working. I think the periodic memstore flusher just 
> masks the underlying problem, which is why we do not see this in other 
> clusters. 
> The problem starts like this: 
> {code}
> 2016-09-21 17:49:18,272 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-21 17:49:18,273 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> then, it continues until the RS is restarted: 
> {code}
> 2016-09-23 17:43:49,356 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-23 17:43:49,357 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already 
> split some time ago, and was able to flush its data and split without any 
> problems. However, the FSHLog still thinks that there is some unflushed data 
> for this region. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking

2016-10-01 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-16721:
--
Attachment: hbase-16721_addendum.patch

Pushed the addendum. Thanks Duo. 

> Concurrency issue in WAL unflushed seqId tracking
> -
>
> Key: HBASE-16721
> URL: https://issues.apache.org/jira/browse/HBASE-16721
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 1.0.0, 1.1.0, 1.2.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.4, 1.1.8
>
> Attachments: hbase-16721_addendum.patch, 
> hbase-16721_v1.branch-1.patch, hbase-16721_v2.branch-1.patch, 
> hbase-16721_v2.master.patch
>
>
> I'm inspecting an interesting case where in a production cluster, some 
> regionservers ends up accumulating hundreds of WAL files, even with force 
> flushes going on due to max logs. This happened multiple times on the 
> cluster, but not on other clusters. The cluster has periodic memstore flusher 
> disabled, however, this still does not explain why the force flush of regions 
> due to max limit is not working. I think the periodic memstore flusher just 
> masks the underlying problem, which is why we do not see this in other 
> clusters. 
> The problem starts like this: 
> {code}
> 2016-09-21 17:49:18,272 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-21 17:49:18,273 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> then, it continues until the RS is restarted: 
> {code}
> 2016-09-23 17:43:49,356 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-23 17:43:49,357 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already 
> split some time ago, and was able to flush its data and split without any 
> problems. However, the FSHLog still thinks that there is some unflushed data 
> for this region. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking

2016-09-29 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-16721:
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: (was: 1.1.7)
   1.1.8
   Status: Resolved  (was: Patch Available)

Pushed this. Thanks everyone. 

> Concurrency issue in WAL unflushed seqId tracking
> -
>
> Key: HBASE-16721
> URL: https://issues.apache.org/jira/browse/HBASE-16721
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 1.0.0, 1.1.0, 1.2.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.4, 1.1.8
>
> Attachments: hbase-16721_v1.branch-1.patch, 
> hbase-16721_v2.branch-1.patch, hbase-16721_v2.master.patch
>
>
> I'm inspecting an interesting case where in a production cluster, some 
> regionservers ends up accumulating hundreds of WAL files, even with force 
> flushes going on due to max logs. This happened multiple times on the 
> cluster, but not on other clusters. The cluster has periodic memstore flusher 
> disabled, however, this still does not explain why the force flush of regions 
> due to max limit is not working. I think the periodic memstore flusher just 
> masks the underlying problem, which is why we do not see this in other 
> clusters. 
> The problem starts like this: 
> {code}
> 2016-09-21 17:49:18,272 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-21 17:49:18,273 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> then, it continues until the RS is restarted: 
> {code}
> 2016-09-23 17:43:49,356 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-23 17:43:49,357 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already 
> split some time ago, and was able to flush its data and split without any 
> problems. However, the FSHLog still thinks that there is some unflushed data 
> for this region. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking

2016-09-28 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-16721:
--
Attachment: hbase-16721_v2.master.patch

Attaching the master patch as well. On master, the test does not fail, since as 
explained about we do not unlock the updatesLock until the transaction is fully 
complete (WAL.append() went through). Hence the master patch is only the test 
case. 

> Concurrency issue in WAL unflushed seqId tracking
> -
>
> Key: HBASE-16721
> URL: https://issues.apache.org/jira/browse/HBASE-16721
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 1.0.0, 1.1.0, 1.2.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 1.2.4
>
> Attachments: hbase-16721_v1.branch-1.patch, 
> hbase-16721_v2.branch-1.patch, hbase-16721_v2.master.patch
>
>
> I'm inspecting an interesting case where in a production cluster, some 
> regionservers ends up accumulating hundreds of WAL files, even with force 
> flushes going on due to max logs. This happened multiple times on the 
> cluster, but not on other clusters. The cluster has periodic memstore flusher 
> disabled, however, this still does not explain why the force flush of regions 
> due to max limit is not working. I think the periodic memstore flusher just 
> masks the underlying problem, which is why we do not see this in other 
> clusters. 
> The problem starts like this: 
> {code}
> 2016-09-21 17:49:18,272 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-21 17:49:18,273 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> then, it continues until the RS is restarted: 
> {code}
> 2016-09-23 17:43:49,356 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-23 17:43:49,357 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already 
> split some time ago, and was able to flush its data and split without any 
> problems. However, the FSHLog still thinks that there is some unflushed data 
> for this region. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking

2016-09-28 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-16721:
--
Attachment: hbase-16721_v2.branch-1.patch

I was able to come up with a unit test for this. It is a bit involved, but 
basically simulates the situation with holding the threads at appropriate 
times. As any good unit test should, it fails on without the patch: 
{code}
java.lang.AssertionError: Found seqId for the region which is already flushed 
Expected :-1
Actual   :3
{code}

Let me see whether we need the patch in master. 

> Concurrency issue in WAL unflushed seqId tracking
> -
>
> Key: HBASE-16721
> URL: https://issues.apache.org/jira/browse/HBASE-16721
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 1.0.0, 1.1.0, 1.2.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 1.2.4
>
> Attachments: hbase-16721_v1.branch-1.patch, 
> hbase-16721_v2.branch-1.patch
>
>
> I'm inspecting an interesting case where in a production cluster, some 
> regionservers ends up accumulating hundreds of WAL files, even with force 
> flushes going on due to max logs. This happened multiple times on the 
> cluster, but not on other clusters. The cluster has periodic memstore flusher 
> disabled, however, this still does not explain why the force flush of regions 
> due to max limit is not working. I think the periodic memstore flusher just 
> masks the underlying problem, which is why we do not see this in other 
> clusters. 
> The problem starts like this: 
> {code}
> 2016-09-21 17:49:18,272 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-21 17:49:18,273 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> then, it continues until the RS is restarted: 
> {code}
> 2016-09-23 17:43:49,356 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-23 17:43:49,357 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already 
> split some time ago, and was able to flush its data and split without any 
> problems. However, the FSHLog still thinks that there is some unflushed data 
> for this region. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking

2016-09-28 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-16721:
--
Status: Patch Available  (was: Open)

> Concurrency issue in WAL unflushed seqId tracking
> -
>
> Key: HBASE-16721
> URL: https://issues.apache.org/jira/browse/HBASE-16721
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 1.2.0, 1.1.0, 1.0.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 1.2.4
>
> Attachments: hbase-16721_v1.branch-1.patch
>
>
> I'm inspecting an interesting case where in a production cluster, some 
> regionservers ends up accumulating hundreds of WAL files, even with force 
> flushes going on due to max logs. This happened multiple times on the 
> cluster, but not on other clusters. The cluster has periodic memstore flusher 
> disabled, however, this still does not explain why the force flush of regions 
> due to max limit is not working. I think the periodic memstore flusher just 
> masks the underlying problem, which is why we do not see this in other 
> clusters. 
> The problem starts like this: 
> {code}
> 2016-09-21 17:49:18,272 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-21 17:49:18,273 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> then, it continues until the RS is restarted: 
> {code}
> 2016-09-23 17:43:49,356 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-23 17:43:49,357 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already 
> split some time ago, and was able to flush its data and split without any 
> problems. However, the FSHLog still thinks that there is some unflushed data 
> for this region. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking

2016-09-28 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-16721:
--
Attachment: hbase-16721_v1.branch-1.patch

This should solve the problem for branch-1. On master, we may not have the same 
problem since in master, we keep holding the region read lock until the 
existing transaction is already sync'ed.  

The patch makes it so that after getting the region write lock, we also wait 
for the ongoing transactions before starting the FSHLog.startCacheFlush() call. 
We already wait for all transactions to finish under the region write lock, so 
semantically the behavior should not change (would be good if the reviewers 
verify this independently). 

> Concurrency issue in WAL unflushed seqId tracking
> -
>
> Key: HBASE-16721
> URL: https://issues.apache.org/jira/browse/HBASE-16721
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 1.0.0, 1.1.0, 1.2.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 1.2.4
>
> Attachments: hbase-16721_v1.branch-1.patch
>
>
> I'm inspecting an interesting case where in a production cluster, some 
> regionservers ends up accumulating hundreds of WAL files, even with force 
> flushes going on due to max logs. This happened multiple times on the 
> cluster, but not on other clusters. The cluster has periodic memstore flusher 
> disabled, however, this still does not explain why the force flush of regions 
> due to max limit is not working. I think the periodic memstore flusher just 
> masks the underlying problem, which is why we do not see this in other 
> clusters. 
> The problem starts like this: 
> {code}
> 2016-09-21 17:49:18,272 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-21 17:49:18,273 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> then, it continues until the RS is restarted: 
> {code}
> 2016-09-23 17:43:49,356 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-23 17:43:49,357 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already 
> split some time ago, and was able to flush its data and split without any 
> problems. However, the FSHLog still thinks that there is some unflushed data 
> for this region. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking

2016-09-28 Thread Sean Busbey (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Busbey updated HBASE-16721:

Affects Version/s: 1.0.0

> Concurrency issue in WAL unflushed seqId tracking
> -
>
> Key: HBASE-16721
> URL: https://issues.apache.org/jira/browse/HBASE-16721
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 1.0.0, 1.1.0, 1.2.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 1.2.4
>
>
> I'm inspecting an interesting case where in a production cluster, some 
> regionservers ends up accumulating hundreds of WAL files, even with force 
> flushes going on due to max logs. This happened multiple times on the 
> cluster, but not on other clusters. The cluster has periodic memstore flusher 
> disabled, however, this still does not explain why the force flush of regions 
> due to max limit is not working. I think the periodic memstore flusher just 
> masks the underlying problem, which is why we do not see this in other 
> clusters. 
> The problem starts like this: 
> {code}
> 2016-09-21 17:49:18,272 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-21 17:49:18,273 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> then, it continues until the RS is restarted: 
> {code}
> 2016-09-23 17:43:49,356 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-23 17:43:49,357 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already 
> split some time ago, and was able to flush its data and split without any 
> problems. However, the FSHLog still thinks that there is some unflushed data 
> for this region. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking

2016-09-28 Thread Sean Busbey (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Busbey updated HBASE-16721:

Priority: Critical  (was: Major)

> Concurrency issue in WAL unflushed seqId tracking
> -
>
> Key: HBASE-16721
> URL: https://issues.apache.org/jira/browse/HBASE-16721
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 1.2.4
>
>
> I'm inspecting an interesting case where in a production cluster, some 
> regionservers ends up accumulating hundreds of WAL files, even with force 
> flushes going on due to max logs. This happened multiple times on the 
> cluster, but not on other clusters. The cluster has periodic memstore flusher 
> disabled, however, this still does not explain why the force flush of regions 
> due to max limit is not working. I think the periodic memstore flusher just 
> masks the underlying problem, which is why we do not see this in other 
> clusters. 
> The problem starts like this: 
> {code}
> 2016-09-21 17:49:18,272 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-21 17:49:18,273 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> then, it continues until the RS is restarted: 
> {code}
> 2016-09-23 17:43:49,356 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-23 17:43:49,357 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already 
> split some time ago, and was able to flush its data and split without any 
> problems. However, the FSHLog still thinks that there is some unflushed data 
> for this region. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking

2016-09-28 Thread Sean Busbey (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Busbey updated HBASE-16721:

Affects Version/s: 1.1.0
   1.2.0

> Concurrency issue in WAL unflushed seqId tracking
> -
>
> Key: HBASE-16721
> URL: https://issues.apache.org/jira/browse/HBASE-16721
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 1.2.4
>
>
> I'm inspecting an interesting case where in a production cluster, some 
> regionservers ends up accumulating hundreds of WAL files, even with force 
> flushes going on due to max logs. This happened multiple times on the 
> cluster, but not on other clusters. The cluster has periodic memstore flusher 
> disabled, however, this still does not explain why the force flush of regions 
> due to max limit is not working. I think the periodic memstore flusher just 
> masks the underlying problem, which is why we do not see this in other 
> clusters. 
> The problem starts like this: 
> {code}
> 2016-09-21 17:49:18,272 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-21 17:49:18,273 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> then, it continues until the RS is restarted: 
> {code}
> 2016-09-23 17:43:49,356 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-23 17:43:49,357 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already 
> split some time ago, and was able to flush its data and split without any 
> problems. However, the FSHLog still thinks that there is some unflushed data 
> for this region. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking

2016-09-28 Thread Sean Busbey (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Busbey updated HBASE-16721:

Component/s: wal

> Concurrency issue in WAL unflushed seqId tracking
> -
>
> Key: HBASE-16721
> URL: https://issues.apache.org/jira/browse/HBASE-16721
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 1.2.4
>
>
> I'm inspecting an interesting case where in a production cluster, some 
> regionservers ends up accumulating hundreds of WAL files, even with force 
> flushes going on due to max logs. This happened multiple times on the 
> cluster, but not on other clusters. The cluster has periodic memstore flusher 
> disabled, however, this still does not explain why the force flush of regions 
> due to max limit is not working. I think the periodic memstore flusher just 
> masks the underlying problem, which is why we do not see this in other 
> clusters. 
> The problem starts like this: 
> {code}
> 2016-09-21 17:49:18,272 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-21 17:49:18,273 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> then, it continues until the RS is restarted: 
> {code}
> 2016-09-23 17:43:49,356 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-23 17:43:49,357 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already 
> split some time ago, and was able to flush its data and split without any 
> problems. However, the FSHLog still thinks that there is some unflushed data 
> for this region. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)