[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host
[ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12964918#action_12964918 ] Jonathan Gray commented on HBASE-3243: -- +1 to your proposal stack I've also gone through this several times and have not been able to come up with anything besides what is contained in these patches. Todd, do your best to break the new RC :) Disable Table closed region on wrong host - Key: HBASE-3243 URL: https://issues.apache.org/jira/browse/HBASE-3243 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Blocker Fix For: 0.90.0 Attachments: hbase-3243-logs.tar.bz2, HBASE-3243-v1.patch, hri.diff I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host
[ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12964924#action_12964924 ] stack commented on HBASE-3243: -- Did you intend to do this in your patch Jon? {code} @@ -1359,11 +1361,6 @@ } synchronized (this.regions) { this.regions.remove(hri); -} -synchronized (this.regionPlans) { - this.regionPlans.remove(hri.getEncodedName()); -} -synchronized (this.servers) { for (ListHRegionInfo regions : this.servers.values()) { for (int i=0;iregions.size();i++) { if (regions.get(i).equals(hri)) { {code} Disable Table closed region on wrong host - Key: HBASE-3243 URL: https://issues.apache.org/jira/browse/HBASE-3243 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Blocker Fix For: 0.90.0 Attachments: hbase-3243-logs.tar.bz2, HBASE-3243-v1.patch, hri.diff I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host
[ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12964926#action_12964926 ] Jonathan Gray commented on HBASE-3243: -- Yes. That was wrong, we never use this.servers as a lock, we always manipulate this.regions and this.servers under the this.regions lock. The removal from regionPlans was just moved underneath it, I didn't take that part out. It's in the next chunk of the diff. Disable Table closed region on wrong host - Key: HBASE-3243 URL: https://issues.apache.org/jira/browse/HBASE-3243 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Blocker Fix For: 0.90.0 Attachments: hbase-3243-logs.tar.bz2, HBASE-3243-v1.patch, hri.diff I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host
[ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12965020#action_12965020 ] Todd Lipcon commented on HBASE-3243: Sounds good to me. Disable Table closed region on wrong host - Key: HBASE-3243 URL: https://issues.apache.org/jira/browse/HBASE-3243 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: Todd Lipcon Fix For: 0.90.1 Attachments: 3243-v2.patch, hbase-3243-logs.tar.bz2, HBASE-3243-v1.patch, hri.diff I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host
[ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934535#action_12934535 ] Jonathan Gray commented on HBASE-3243: -- Well try running again with my patch. Or you could even run it again without to see if it happens again and we could get another set of logs. I guess run it with the patch and then if it doesn't ever happen again we can punt the issue or resolve it until we see it again. Disable Table closed region on wrong host - Key: HBASE-3243 URL: https://issues.apache.org/jira/browse/HBASE-3243 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Blocker Fix For: 0.90.0 Attachments: hbase-3243-logs.tar.bz2, HBASE-3243-v1.patch I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host
[ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933628#action_12933628 ] stack commented on HBASE-3243: -- This is an odd one. I don't see anything jumping out at me. Need to dig more. Disable Table closed region on wrong host - Key: HBASE-3243 URL: https://issues.apache.org/jira/browse/HBASE-3243 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Blocker Fix For: 0.90.0 Attachments: hbase-3243-logs.tar.bz2, HBASE-3243-v1.patch I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host
[ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932894#action_12932894 ] Jonathan Gray commented on HBASE-3243: -- Dug into code around compareTo() and the values instance. Pretty sure we are not touching that while a cluster is running, and definitely not during disabling. We've actually even taken the 'offline' flag out of META as well, it's a node in ZK now that signals the state of a table (enabling/disabling/disabled). It is interesting that HRI comparator uses the full HTD comparator. It should probably just use the tableName itself to compare though considering HTD is a member it might make sense in some circumstances to do the full HTD.compareTo()? Looking around the code, it does look like there are multiple places we're using regions w/o a lock in the disable table path. Pretty sure this is the cause. Patch soon. Disable Table closed region on wrong host - Key: HBASE-3243 URL: https://issues.apache.org/jira/browse/HBASE-3243 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Blocker Fix For: 0.90.0 I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host
[ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932947#action_12932947 ] Jonathan Gray commented on HBASE-3243: -- This is very weird. Can you put up the full logs somewhere? Disable Table closed region on wrong host - Key: HBASE-3243 URL: https://issues.apache.org/jira/browse/HBASE-3243 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Blocker Fix For: 0.90.0 Attachments: HBASE-3243-v1.patch I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host
[ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933094#action_12933094 ] Todd Lipcon commented on HBASE-3243: bq. Looking at this more, I'm not sure synchronization is the issue here because TreeMap appears to only be not thread-safe when there are mutations. The two critical pieces of code where a conflict could happen are where we read the server a region is assigned to, and where we set the server a region is assigned to What about removals? I thought I saw a couple places with remove() that were unsynchronized. Will take a look at your patch momentarily. bq. This is very weird. Can you put up the full logs somewhere Yep, will upload them here, it's just fake data, nothing secret. Disable Table closed region on wrong host - Key: HBASE-3243 URL: https://issues.apache.org/jira/browse/HBASE-3243 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Blocker Fix For: 0.90.0 Attachments: HBASE-3243-v1.patch I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host
[ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932818#action_12932818 ] Todd Lipcon commented on HBASE-3243: No idea if it's related but it seems like AssignmentManager's {{regions}} member is accessed without synchronization sometimes... could end up getting some incorrect data due to this. Disable Table closed region on wrong host - Key: HBASE-3243 URL: https://issues.apache.org/jira/browse/HBASE-3243 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Blocker Fix For: 0.90.0 I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host
[ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932825#action_12932825 ] Todd Lipcon commented on HBASE-3243: Just grasping at straws here... but another thought is this: The keys of the {{regions}} map are HRegionInfo, where compareTo() includes calling down to compareTo() on HTableDescriptor. HTableDescriptor's compareTo eventually delegates to hashcode on the {{values}} instance... do we end up changing the {{values}} instance when a table gets disabled? Perhaps this is then changing the sort order in the TreeMap and confusing things? Disable Table closed region on wrong host - Key: HBASE-3243 URL: https://issues.apache.org/jira/browse/HBASE-3243 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Blocker Fix For: 0.90.0 I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.