[jira] [Commented] (HBASE-14020) Unsafe based optimized write in ByteBufferOutputStream
[ https://issues.apache.org/jira/browse/HBASE-14020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614713#comment-14614713 ] Hudson commented on HBASE-14020: SUCCESS: Integrated in HBase-TRUNK #6629 (See [https://builds.apache.org/job/HBase-TRUNK/6629/]) HBASE-14020 Unsafe based optimized write in ByteBufferOutputStream. (anoopsamjohn: rev 7d3456d8fd027e252b1da7578e943f146626135d) * hbase-common/src/main/java/org/apache/hadoop/hbase/util/UnsafeAccess.java * hbase-common/src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java * hbase-common/src/main/java/org/apache/hadoop/hbase/io/ByteBufferOutputStream.java Unsafe based optimized write in ByteBufferOutputStream -- Key: HBASE-14020 URL: https://issues.apache.org/jira/browse/HBASE-14020 Project: HBase Issue Type: Sub-task Components: Scanners Reporter: Anoop Sam John Assignee: Anoop Sam John Fix For: 2.0.0 Attachments: HBASE-14020.patch, HBASE-14020_v2.patch, HBASE-14020_v3.patch, benchmark.zip We use this class to build the cellblock at RPC layer. The write operation is doing puts to java ByteBuffer which is having lot of overhead. Instead we can do Unsafe based copy to buffer operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12298) Support BB usage in PrefixTree
[ https://issues.apache.org/jira/browse/HBASE-12298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614782#comment-14614782 ] Hadoop QA commented on HBASE-12298: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12743678/HBASE-12298.patch against master branch at commit 7d3456d8fd027e252b1da7578e943f146626135d. ATTACHMENT ID: 12743678 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 12 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.7.0) {color:red}-1 javac{color}. The applied patch generated 17 javac compiler warnings (more than the master's current 16 warnings). {color:green}+1 protoc{color}. The applied patch does not increase the total number of protoc compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:red}-1 checkstyle{color}. The applied patch generated 1901 checkstyle errors (more than the master's current 1898 errors). {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn post-site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/14671//testReport/ Release Findbugs (version 2.0.3)warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/14671//artifact/patchprocess/newFindbugsWarnings.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/14671//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14671//console This message is automatically generated. Support BB usage in PrefixTree -- Key: HBASE-12298 URL: https://issues.apache.org/jira/browse/HBASE-12298 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: ramkrishna.s.vasudevan Attachments: HBASE-12298.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12213) HFileBlock backed by Array of ByteBuffers
[ https://issues.apache.org/jira/browse/HBASE-12213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614809#comment-14614809 ] Anoop Sam John commented on HBASE-12213: Pls add the patch in RB HFileBlock backed by Array of ByteBuffers - Key: HBASE-12213 URL: https://issues.apache.org/jira/browse/HBASE-12213 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: ramkrishna.s.vasudevan Attachments: HBASE-12213_1.patch, HBASE-12213_2.patch, HBASE-12213_4.patch In L2 cache (offheap) an HFile block might have been cached into multiple chunks of buffers. If HFileBlock need single BB, we will end up in recreation of bigger BB and copying. Instead we can make HFileBlock to serve data from an array of BBs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12213) HFileBlock backed by Array of ByteBuffers
[ https://issues.apache.org/jira/browse/HBASE-12213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HBASE-12213: --- Status: Patch Available (was: Open) HFileBlock backed by Array of ByteBuffers - Key: HBASE-12213 URL: https://issues.apache.org/jira/browse/HBASE-12213 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: ramkrishna.s.vasudevan Attachments: HBASE-12213_1.patch, HBASE-12213_2.patch, HBASE-12213_4.patch In L2 cache (offheap) an HFile block might have been cached into multiple chunks of buffers. If HFileBlock need single BB, we will end up in recreation of bigger BB and copying. Instead we can make HFileBlock to serve data from an array of BBs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12213) HFileBlock backed by Array of ByteBuffers
[ https://issues.apache.org/jira/browse/HBASE-12213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HBASE-12213: --- Attachment: HBASE-12213_4.patch Updated patch correcting the javadoc and checkstyle comments. HFileBlock backed by Array of ByteBuffers - Key: HBASE-12213 URL: https://issues.apache.org/jira/browse/HBASE-12213 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: ramkrishna.s.vasudevan Attachments: HBASE-12213_1.patch, HBASE-12213_2.patch, HBASE-12213_4.patch In L2 cache (offheap) an HFile block might have been cached into multiple chunks of buffers. If HFileBlock need single BB, we will end up in recreation of bigger BB and copying. Instead we can make HFileBlock to serve data from an array of BBs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12213) HFileBlock backed by Array of ByteBuffers
[ https://issues.apache.org/jira/browse/HBASE-12213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HBASE-12213: --- Status: Open (was: Patch Available) HFileBlock backed by Array of ByteBuffers - Key: HBASE-12213 URL: https://issues.apache.org/jira/browse/HBASE-12213 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: ramkrishna.s.vasudevan Attachments: HBASE-12213_1.patch, HBASE-12213_2.patch In L2 cache (offheap) an HFile block might have been cached into multiple chunks of buffers. If HFileBlock need single BB, we will end up in recreation of bigger BB and copying. Instead we can make HFileBlock to serve data from an array of BBs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12298) Support BB usage in PrefixTree
[ https://issues.apache.org/jira/browse/HBASE-12298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HBASE-12298: --- Status: Open (was: Patch Available) Support BB usage in PrefixTree -- Key: HBASE-12298 URL: https://issues.apache.org/jira/browse/HBASE-12298 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: ramkrishna.s.vasudevan Attachments: HBASE-12298.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12298) Support BB usage in PrefixTree
[ https://issues.apache.org/jira/browse/HBASE-12298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HBASE-12298: --- Attachment: HBASE-12298_1.patch Updated patch removing the unused imports. Support BB usage in PrefixTree -- Key: HBASE-12298 URL: https://issues.apache.org/jira/browse/HBASE-12298 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: ramkrishna.s.vasudevan Attachments: HBASE-12298.patch, HBASE-12298_1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12298) Support BB usage in PrefixTree
[ https://issues.apache.org/jira/browse/HBASE-12298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HBASE-12298: --- Status: Patch Available (was: Open) Support BB usage in PrefixTree -- Key: HBASE-12298 URL: https://issues.apache.org/jira/browse/HBASE-12298 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: ramkrishna.s.vasudevan Attachments: HBASE-12298.patch, HBASE-12298_1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-8642) [Snapshot] List and delete snapshot by table
[ https://issues.apache.org/jira/browse/HBASE-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashish Singhi updated HBASE-8642: - Attachment: HBASE-8642-0.98.patch Patch for 0.98 branch. [Snapshot] List and delete snapshot by table Key: HBASE-8642 URL: https://issues.apache.org/jira/browse/HBASE-8642 Project: HBase Issue Type: Improvement Components: snapshots Affects Versions: 0.98.0, 0.95.0, 0.95.1, 0.95.2 Reporter: Julian Zhou Assignee: Ashish Singhi Fix For: 2.0.0 Attachments: 8642-trunk-0.95-v0.patch, 8642-trunk-0.95-v1.patch, 8642-trunk-0.95-v2.patch, HBASE-8642-0.98.patch, HBASE-8642-v1.patch, HBASE-8642-v2.patch, HBASE-8642-v3.patch, HBASE-8642-v4.patch, HBASE-8642.patch Support list and delete snapshots by table names. User scenario: A user wants to delete all the snapshots which were taken in January month for a table 't' where snapshot names starts with 'Jan'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13387) Add ByteBufferedCell an extension to Cell
[ https://issues.apache.org/jira/browse/HBASE-13387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anoop Sam John updated HBASE-13387: --- Fix Version/s: 2.0.0 Status: Patch Available (was: Open) Add ByteBufferedCell an extension to Cell - Key: HBASE-13387 URL: https://issues.apache.org/jira/browse/HBASE-13387 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: Anoop Sam John Fix For: 2.0.0 Attachments: ByteBufferedCell.docx, HBASE-13387_v1.patch, WIP_HBASE-13387_V2.patch, WIP_ServerCell.patch, benchmark.zip This came in btw the discussion abt the parent Jira and recently Stack added as a comment on the E2E patch on the parent Jira. The idea is to add a new Interface 'ByteBufferedCell' in which we can add new buffer based getter APIs and getters for position in components in BB. We will keep this interface @InterfaceAudience.Private. When the Cell is backed by a DBB, we can create an Object implementing this new interface. The Comparators has to be aware abt this new Cell extension and has to use the BB based APIs rather than getXXXArray(). Also give util APIs in CellUtil to abstract the checks for new Cell type. (Like matchingXXX APIs, getValueAstype APIs etc) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13387) Add ByteBufferedCell an extension to Cell
[ https://issues.apache.org/jira/browse/HBASE-13387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anoop Sam John updated HBASE-13387: --- Attachment: HBASE-13387_v1.patch Added the ByteBufferedCell , an extension to Cell at server side Added the instance based check annd usage of proper APIs in CellComparator and CellUtil.. Refactored some other core code to make use of CellUtil/ CellComparator APIs. The instance based checks are limited to these 2 classes Still some other parts of code using getXXXArray() API with out any checks. Will correct them with follow on tasks. The areas are mainly 1. Filters 2. CPs 3. Tag area as we have byte[] backed tag impl alone 4. DBE area Add ByteBufferedCell an extension to Cell - Key: HBASE-13387 URL: https://issues.apache.org/jira/browse/HBASE-13387 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: Anoop Sam John Fix For: 2.0.0 Attachments: ByteBufferedCell.docx, HBASE-13387_v1.patch, WIP_HBASE-13387_V2.patch, WIP_ServerCell.patch, benchmark.zip This came in btw the discussion abt the parent Jira and recently Stack added as a comment on the E2E patch on the parent Jira. The idea is to add a new Interface 'ByteBufferedCell' in which we can add new buffer based getter APIs and getters for position in components in BB. We will keep this interface @InterfaceAudience.Private. When the Cell is backed by a DBB, we can create an Object implementing this new interface. The Comparators has to be aware abt this new Cell extension and has to use the BB based APIs rather than getXXXArray(). Also give util APIs in CellUtil to abstract the checks for new Cell type. (Like matchingXXX APIs, getValueAstype APIs etc) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-8642) [Snapshot] List and delete snapshot by table
[ https://issues.apache.org/jira/browse/HBASE-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashish Singhi updated HBASE-8642: - Fix Version/s: 1.3.0 0.98.14 [Snapshot] List and delete snapshot by table Key: HBASE-8642 URL: https://issues.apache.org/jira/browse/HBASE-8642 Project: HBase Issue Type: Improvement Components: snapshots Affects Versions: 0.98.0, 0.95.0, 0.95.1, 0.95.2 Reporter: Julian Zhou Assignee: Ashish Singhi Fix For: 2.0.0, 0.98.14, 1.3.0 Attachments: 8642-trunk-0.95-v0.patch, 8642-trunk-0.95-v1.patch, 8642-trunk-0.95-v2.patch, HBASE-8642-0.98.patch, HBASE-8642-v1.patch, HBASE-8642-v2.patch, HBASE-8642-v3.patch, HBASE-8642-v4.patch, HBASE-8642.patch Support list and delete snapshots by table names. User scenario: A user wants to delete all the snapshots which were taken in January month for a table 't' where snapshot names starts with 'Jan'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13387) Add ByteBufferedCell an extension to Cell
[ https://issues.apache.org/jira/browse/HBASE-13387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614937#comment-14614937 ] Hadoop QA commented on HBASE-13387: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12743694/HBASE-13387_v1.patch against master branch at commit 7d3456d8fd027e252b1da7578e943f146626135d. ATTACHMENT ID: 12743694 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.7.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 protoc{color}. The applied patch does not increase the total number of protoc compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 1 warning messages. {color:red}-1 checkstyle{color}. The applied patch generated 1899 checkstyle errors (more than the master's current 1898 errors). {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn post-site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/14672//testReport/ Release Findbugs (version 2.0.3)warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/14672//artifact/patchprocess/newFindbugsWarnings.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/14672//artifact/patchprocess/checkstyle-aggregate.html Javadoc warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/14672//artifact/patchprocess/patchJavadocWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14672//console This message is automatically generated. Add ByteBufferedCell an extension to Cell - Key: HBASE-13387 URL: https://issues.apache.org/jira/browse/HBASE-13387 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: Anoop Sam John Fix For: 2.0.0 Attachments: ByteBufferedCell.docx, HBASE-13387_v1.patch, WIP_HBASE-13387_V2.patch, WIP_ServerCell.patch, benchmark.zip This came in btw the discussion abt the parent Jira and recently Stack added as a comment on the E2E patch on the parent Jira. The idea is to add a new Interface 'ByteBufferedCell' in which we can add new buffer based getter APIs and getters for position in components in BB. We will keep this interface @InterfaceAudience.Private. When the Cell is backed by a DBB, we can create an Object implementing this new interface. The Comparators has to be aware abt this new Cell extension and has to use the BB based APIs rather than getXXXArray(). Also give util APIs in CellUtil to abstract the checks for new Cell type. (Like matchingXXX APIs, getValueAstype APIs etc) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12298) Support BB usage in PrefixTree
[ https://issues.apache.org/jira/browse/HBASE-12298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614959#comment-14614959 ] Hadoop QA commented on HBASE-12298: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12743696/HBASE-12298_1.patch against master branch at commit 7d3456d8fd027e252b1da7578e943f146626135d. ATTACHMENT ID: 12743696 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 12 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.7.0) {color:red}-1 javac{color}. The applied patch generated 17 javac compiler warnings (more than the master's current 16 warnings). {color:green}+1 protoc{color}. The applied patch does not increase the total number of protoc compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 checkstyle{color}. The applied patch does not increase the total number of checkstyle errors {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn post-site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/14674//testReport/ Release Findbugs (version 2.0.3)warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/14674//artifact/patchprocess/newFindbugsWarnings.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/14674//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14674//console This message is automatically generated. Support BB usage in PrefixTree -- Key: HBASE-12298 URL: https://issues.apache.org/jira/browse/HBASE-12298 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: ramkrishna.s.vasudevan Attachments: HBASE-12298.patch, HBASE-12298_1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13337) Table regions are not assigning back, after restarting all regionservers at once.
[ https://issues.apache.org/jira/browse/HBASE-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614971#comment-14614971 ] ramkrishna.s.vasudevan commented on HBASE-13337: +1 on it. Some one needs to take a look at this? Table regions are not assigning back, after restarting all regionservers at once. - Key: HBASE-13337 URL: https://issues.apache.org/jira/browse/HBASE-13337 Project: HBase Issue Type: Bug Components: Region Assignment Affects Versions: 2.0.0 Reporter: Y. SREENIVASULU REDDY Assignee: Samir Ahmic Priority: Blocker Fix For: 2.0.0 Attachments: HBASE-13337-v2.patch, HBASE-13337-v3.patch, HBASE-13337.patch Regions of the table are continouly in state=FAILED_CLOSE. {noformat} RegionState RIT time (ms) 8f62e819b356736053e06240f7f7c6fd t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), server=VM1,16040,1427362531818 113929 caf59209ae65ea80fca6bdc6996a7d68 t1,,1427362431330.caf59209ae65ea80fca6bdc6996a7d68. state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), server=VM2,16040,1427362533691 113929 db52a74988f71e5cf257bbabf31f26f3 t1,,1427362431330.db52a74988f71e5cf257bbabf31f26f3. state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), server=VM3,16040,1427362533691 113920 43f3a65b9f9ff283f598c5450feab1f8 t1,,1427362431330.43f3a65b9f9ff283f598c5450feab1f8. state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), server=VM1,16040,1427362531818 113920 {noformat} *Steps to reproduce:* 1. Start HBase cluster with more than one regionserver. 2. Create a table with precreated regions. (lets say 15 regions) 3. Make sure the regions are well balanced. 4. Restart all the Regionservers process at once across the cluster, except HMaster process 5. After restarting the Regionservers, successfully will connect to the HMaster. *Bug:* But no regions are assigning back to the Regionservers. *Master log shows as follows:* {noformat} 2015-03-26 15:05:36,201 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd state=OFFLINE, ts=1427362536106, server=VM2,16040,1427362242602} to {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, server=VM1,16040,1427362531818} 2015-03-26 15:05:36,202 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] master.RegionStateStore: Updating row t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with state=PENDING_OPENsn=VM1,16040,1427362531818 2015-03-26 15:05:36,244 DEBUG [VM2,16020,1427362216887-GeneralBulkAssigner-0] master.AssignmentManager: Force region state offline {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, server=VM1,16040,1427362531818} 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, server=VM1,16040,1427362531818} to {8f62e819b356736053e06240f7f7c6fd state=PENDING_CLOSE, ts=1427362536244, server=VM1,16040,1427362531818} 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] master.RegionStateStore: Updating row t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with state=PENDING_CLOSE 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] master.AssignmentManager: Server VM1,16040,1427362531818 returned java.nio.channels.ClosedChannelException for t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=1 of 10 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] master.AssignmentManager: Server VM1,16040,1427362531818 returned java.nio.channels.ClosedChannelException for t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=2 of 10 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] master.AssignmentManager: Server VM1,16040,1427362531818 returned java.nio.channels.ClosedChannelException for t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=3 of 10 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] master.AssignmentManager: Server VM1,16040,1427362531818 returned java.nio.channels.ClosedChannelException for t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=4 of 10
[jira] [Commented] (HBASE-12213) HFileBlock backed by Array of ByteBuffers
[ https://issues.apache.org/jira/browse/HBASE-12213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614969#comment-14614969 ] Hadoop QA commented on HBASE-12213: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12743695/HBASE-12213_4.patch against master branch at commit 7d3456d8fd027e252b1da7578e943f146626135d. ATTACHMENT ID: 12743695 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 51 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.7.0) {color:red}-1 javac{color}. The applied patch generated 20 javac compiler warnings (more than the master's current 16 warnings). {color:green}+1 protoc{color}. The applied patch does not increase the total number of protoc compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 checkstyle{color}. The applied patch does not increase the total number of checkstyle errors {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn post-site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/14673//testReport/ Release Findbugs (version 2.0.3)warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/14673//artifact/patchprocess/newFindbugsWarnings.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/14673//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14673//console This message is automatically generated. HFileBlock backed by Array of ByteBuffers - Key: HBASE-12213 URL: https://issues.apache.org/jira/browse/HBASE-12213 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: ramkrishna.s.vasudevan Attachments: HBASE-12213_1.patch, HBASE-12213_2.patch, HBASE-12213_4.patch In L2 cache (offheap) an HFile block might have been cached into multiple chunks of buffers. If HFileBlock need single BB, we will end up in recreation of bigger BB and copying. Instead we can make HFileBlock to serve data from an array of BBs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12848) Utilize Flash storage for WAL
[ https://issues.apache.org/jira/browse/HBASE-12848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614983#comment-14614983 ] Anoop Sam John commented on HBASE-12848: We will be doing the archiving of the WAL files later. Ideally no need to keep the archived WALs in flash storage. That will be wastage of the resource. The archive op is done by file rename. Had some discussion with [~tedyu] and [~jingcheng...@intel.com]. HDFS wont move the file content across different block volumes on rename. The Mover tool only can do this which has to be invoked explicitly. We have to solve this some way? Later on if we try to place some HFiles also (depending on some stats) to SSDs, this might become more critical. Utilize Flash storage for WAL - Key: HBASE-12848 URL: https://issues.apache.org/jira/browse/HBASE-12848 Project: HBase Issue Type: Sub-task Reporter: Ted Yu Assignee: Ted Yu Fix For: 2.0.0, 1.1.0 Attachments: 12848-v1.patch, 12848-v2.patch, 12848-v3.patch, 12848-v4.patch, 12848-v4.patch One way to improve data ingestion rate is to make use of Flash storage. HDFS is doing the heavy lifting - see HDFS-7228. We assume an environment where: 1. Some servers have a mix of flash, e.g. 2 flash drives and 4 traditional drives. 2. Some servers have all traditional storage. 3. RegionServers are deployed on both profiles within one HBase cluster. This JIRA allows WAL to be managed on flash in a mixed-profile environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HBASE-13792) Regionserver unable to report to master when master is restarted
[ https://issues.apache.org/jira/browse/HBASE-13792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anoop Sam John resolved HBASE-13792. Resolution: Duplicate Assignee: (was: Samir Ahmic) Closing as dup of HBASE-13337. Regionserver unable to report to master when master is restarted Key: HBASE-13792 URL: https://issues.apache.org/jira/browse/HBASE-13792 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 2.0.0 Environment: x86_64 GNU/Linux Reporter: Samir Ahmic Priority: Critical Fix For: 2.0.0 I was testing master branch on distributed cluster and i notice that when master is restarted on running cluster regionservers are unable report back when master is up again. Things back to normal after i restarted regionservers. Logs showing that regionservers are correctly detecting master znode. After some digging i notice that we have changed client implementation in RpcClientFactory to AsyncRpcClient so i have tried running cluster with previous RpcClientImpl and issue was gone. So issue is probably caused by AsyncRpcClient which is unable reconnect to master once original connection is gone. I was able to fix issue by creating new rpcClient object inside HRegionServer#createRegionServerStatusStub() and using it for channel creation here is diff: {code} diff --git a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java index fa56966..27e658c 100644 --- a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java +++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java @@ -2219,8 +2219,11 @@ public class HRegionServer extends HasThread implements break; } try { + LOG.info(***Creating new client connection); + rpcClient = RpcClientFactory.createClient(conf, clusterId, new InetSocketAddress( +rpcServices.isa.getAddress(), 0)); BlockingRpcChannel channel = -this.rpcClient.createBlockingRpcChannel(sn, userProvider.getCurrent(), + rpcClient.createBlockingRpcChannel(sn, userProvider.getCurrent(), shortOperationTimeout); intf = RegionServerStatusService.newBlockingStub(channel); break; {code} If this is acceptable way for fixing this issue i will create and attach patch? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12213) HFileBlock backed by Array of ByteBuffers
[ https://issues.apache.org/jira/browse/HBASE-12213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HBASE-12213: --- Attachment: HBASE-12213_jmh.zip Adding some jmh files that imitates the blockSeek behaviour in terms of the mark, reset and skipping to read till the end of the block. HFileBlock backed by Array of ByteBuffers - Key: HBASE-12213 URL: https://issues.apache.org/jira/browse/HBASE-12213 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: ramkrishna.s.vasudevan Attachments: HBASE-12213_1.patch, HBASE-12213_2.patch, HBASE-12213_4.patch, HBASE-12213_jmh.zip In L2 cache (offheap) an HFile block might have been cached into multiple chunks of buffers. If HFileBlock need single BB, we will end up in recreation of bigger BB and copying. Instead we can make HFileBlock to serve data from an array of BBs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13337) Table regions are not assigning back, after restarting all regionservers at once.
[ https://issues.apache.org/jira/browse/HBASE-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614986#comment-14614986 ] Anoop Sam John commented on HBASE-13337: +1 Table regions are not assigning back, after restarting all regionservers at once. - Key: HBASE-13337 URL: https://issues.apache.org/jira/browse/HBASE-13337 Project: HBase Issue Type: Bug Components: Region Assignment Affects Versions: 2.0.0 Reporter: Y. SREENIVASULU REDDY Assignee: Samir Ahmic Priority: Blocker Fix For: 2.0.0 Attachments: HBASE-13337-v2.patch, HBASE-13337-v3.patch, HBASE-13337.patch Regions of the table are continouly in state=FAILED_CLOSE. {noformat} RegionState RIT time (ms) 8f62e819b356736053e06240f7f7c6fd t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), server=VM1,16040,1427362531818 113929 caf59209ae65ea80fca6bdc6996a7d68 t1,,1427362431330.caf59209ae65ea80fca6bdc6996a7d68. state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), server=VM2,16040,1427362533691 113929 db52a74988f71e5cf257bbabf31f26f3 t1,,1427362431330.db52a74988f71e5cf257bbabf31f26f3. state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), server=VM3,16040,1427362533691 113920 43f3a65b9f9ff283f598c5450feab1f8 t1,,1427362431330.43f3a65b9f9ff283f598c5450feab1f8. state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), server=VM1,16040,1427362531818 113920 {noformat} *Steps to reproduce:* 1. Start HBase cluster with more than one regionserver. 2. Create a table with precreated regions. (lets say 15 regions) 3. Make sure the regions are well balanced. 4. Restart all the Regionservers process at once across the cluster, except HMaster process 5. After restarting the Regionservers, successfully will connect to the HMaster. *Bug:* But no regions are assigning back to the Regionservers. *Master log shows as follows:* {noformat} 2015-03-26 15:05:36,201 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd state=OFFLINE, ts=1427362536106, server=VM2,16040,1427362242602} to {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, server=VM1,16040,1427362531818} 2015-03-26 15:05:36,202 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] master.RegionStateStore: Updating row t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with state=PENDING_OPENsn=VM1,16040,1427362531818 2015-03-26 15:05:36,244 DEBUG [VM2,16020,1427362216887-GeneralBulkAssigner-0] master.AssignmentManager: Force region state offline {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, server=VM1,16040,1427362531818} 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, server=VM1,16040,1427362531818} to {8f62e819b356736053e06240f7f7c6fd state=PENDING_CLOSE, ts=1427362536244, server=VM1,16040,1427362531818} 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] master.RegionStateStore: Updating row t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with state=PENDING_CLOSE 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] master.AssignmentManager: Server VM1,16040,1427362531818 returned java.nio.channels.ClosedChannelException for t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=1 of 10 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] master.AssignmentManager: Server VM1,16040,1427362531818 returned java.nio.channels.ClosedChannelException for t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=2 of 10 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] master.AssignmentManager: Server VM1,16040,1427362531818 returned java.nio.channels.ClosedChannelException for t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=3 of 10 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] master.AssignmentManager: Server VM1,16040,1427362531818 returned java.nio.channels.ClosedChannelException for t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=4 of 10 2015-03-26 15:05:36,249 INFO
[jira] [Commented] (HBASE-13849) Remove restore and clone snapshot from the WebUI
[ https://issues.apache.org/jira/browse/HBASE-13849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615538#comment-14615538 ] Hudson commented on HBASE-13849: FAILURE: Integrated in HBase-1.3 #37 (See [https://builds.apache.org/job/HBase-1.3/37/]) HBASE-13849 Remove restore and clone snapshot from the WebUI (busbey: rev d347c66c90dc727ac9fc059eba0db4064ee818b5) * hbase-server/src/main/resources/hbase-webapps/master/snapshot.jsp Remove restore and clone snapshot from the WebUI Key: HBASE-13849 URL: https://issues.apache.org/jira/browse/HBASE-13849 Project: HBase Issue Type: Bug Components: snapshots Affects Versions: 1.0.1, 1.1.0, 0.98.13, 1.2.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Fix For: 2.0.0, 0.98.14, 1.2.0 Attachments: HBASE-13849-v0.patch Remove the clone and restore snapshot buttons from the WebUI. first reason, is that the operation may be too long for having the user wait on the WebUI. second reason is that an action from the webUI does not play well with security. since it is going to be executed by the hbase user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low
[ https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615561#comment-14615561 ] Hadoop QA commented on HBASE-13832: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12743753/HBASE-13832-v5.patch against master branch at commit 608c3aa15c34b9014f99e857b374645db58cbbe3. ATTACHMENT ID: 12743753 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 12 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.7.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 protoc{color}. The applied patch does not increase the total number of protoc compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 checkstyle{color}. The applied patch does not increase the total number of checkstyle errors {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn post-site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.master.procedure.TestWALProcedureStoreOnHDFS {color:red}-1 core zombie tests{color}. There are 5 zombie test(s): at org.apache.hadoop.hbase.wal.TestWALSplit.testThreadingSlowWriterSmallBuffer(TestWALSplit.java:902) at org.apache.hadoop.hbase.wal.TestWALSplit.testSplitDeletedRegion(TestWALSplit.java:722) Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/14679//testReport/ Release Findbugs (version 2.0.3)warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/14679//artifact/patchprocess/newFindbugsWarnings.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/14679//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14679//console This message is automatically generated. Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low --- Key: HBASE-13832 URL: https://issues.apache.org/jira/browse/HBASE-13832 Project: HBase Issue Type: Sub-task Components: master, proc-v2 Affects Versions: 2.0.0, 1.1.0, 1.2.0 Reporter: Stephen Yuan Jiang Assignee: Matteo Bertozzi Priority: Critical Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, HBASE-13832-v2.patch, HBASE-13832-v4.patch, HBASE-13832-v5.patch, HDFSPipeline.java, hbase-13832-test-hang.patch, hbase-13832-v3.patch when the data node 3, we got failure in WALProcedureStore#syncLoop() during master start. The failure prevents master to get started. {noformat} 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] wal.WALProcedureStore: Sync slot failed, abort. java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK], DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]], original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK], DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983- 490ece56c772,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951) {noformat} One proposal is to implement some similar logic as FSHLog: if IOException is thrown during syncLoop in WALProcedureStore#start(), instead of immediate abort, we could try to roll the log and see whether this resolve the issue; if the new log cannot be created or more exception from rolling the log, we then abort. -- This message was sent by Atlassian JIRA
[jira] [Commented] (HBASE-13646) HRegion#execService should not try to build incomplete messages
[ https://issues.apache.org/jira/browse/HBASE-13646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615595#comment-14615595 ] Hudson commented on HBASE-13646: FAILURE: Integrated in HBase-0.98 #1047 (See [https://builds.apache.org/job/HBase-0.98/1047/]) HBASE-13646 HRegion#execService should not try to build incomplete messages (busbey: rev 2353c1bcf7debcc90e1f6d47787404c42a9d53b0) * hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestRegionServerCoprocessorEndpoint.java * hbase-server/src/test/protobuf/DummyRegionServerEndpoint.proto * hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/protobuf/generated/DummyRegionServerEndpointProtos.java * hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestCoprocessorEndpoint.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ResponseConverter.java * hbase-server/pom.xml HRegion#execService should not try to build incomplete messages --- Key: HBASE-13646 URL: https://issues.apache.org/jira/browse/HBASE-13646 Project: HBase Issue Type: Bug Components: Coprocessors, regionserver Affects Versions: 2.0.0, 1.2.0, 1.1.1 Reporter: Andrey Stepachev Assignee: Andrey Stepachev Fix For: 2.0.0, 0.98.14, 1.2.0 Attachments: HBASE-13646-branch-1.patch, HBASE-13646.patch, HBASE-13646.v2.patch, HBASE-13646.v2.patch If some RPC service, called on region throws exception, execService still tries to build Message. In case of complex messages with required fields it complicates service code because service need to pass fake protobuf objects, so they can be barely buildable. To mitigate that I propose to check that controller was failed and return null from call instead of failing with exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14017) Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion
[ https://issues.apache.org/jira/browse/HBASE-14017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615625#comment-14615625 ] Hadoop QA commented on HBASE-14017: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12743786/HBASE-14017.as-pushed-master.patch against master branch at commit 1713f1fcaf9d721a97bc564faaf070f2e6b0b1d1. ATTACHMENT ID: 12743786 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14681//console This message is automatically generated. Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion - Key: HBASE-14017 URL: https://issues.apache.org/jira/browse/HBASE-14017 Project: HBase Issue Type: Sub-task Components: proc-v2 Affects Versions: 2.0.0, 1.2.0, 1.1.1, 1.3.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.2 Attachments: HBASE-14017-v0.patch, HBASE-14017-v0.patch, HBASE-14017.as-pushed-master.patch, HBASE-14017.v1-branch1.1.patch, HBASE-14017.v1-branch1.1.patch [~syuanjiang] found a concurrecy issue in the procedure queue delete where we don't have an exclusive lock before deleting the table {noformat} Thread 1: Create table is running - the queue is empty and wlock is false Thread 2: markTableAsDeleted see the queue empty and wlock= false Thread 1: tryWrite() set wlock=true; too late Thread 2: delete the queue Thread 1: never able to release the lock - NPE when trying to get the queue {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14017) Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion
[ https://issues.apache.org/jira/browse/HBASE-14017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615658#comment-14615658 ] Sean Busbey commented on HBASE-14017: - I've been pushing the changes [~mbertozzi] had ready to go in a local repo before the ASF git outage started. I just got to branch-1.1. I'll check to see if it's different? Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion - Key: HBASE-14017 URL: https://issues.apache.org/jira/browse/HBASE-14017 Project: HBase Issue Type: Sub-task Components: proc-v2 Affects Versions: 2.0.0, 1.2.0, 1.1.1, 1.3.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.2 Attachments: HBASE-14017-v0.patch, HBASE-14017-v0.patch, HBASE-14017.as-pushed-master.patch, HBASE-14017.v1-branch1.1.patch, HBASE-14017.v1-branch1.1.patch [~syuanjiang] found a concurrecy issue in the procedure queue delete where we don't have an exclusive lock before deleting the table {noformat} Thread 1: Create table is running - the queue is empty and wlock is false Thread 2: markTableAsDeleted see the queue empty and wlock= false Thread 1: tryWrite() set wlock=true; too late Thread 2: delete the queue Thread 1: never able to release the lock - NPE when trying to get the queue {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14017) Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion
[ https://issues.apache.org/jira/browse/HBASE-14017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615661#comment-14615661 ] Hudson commented on HBASE-14017: SUCCESS: Integrated in HBase-1.3-IT #22 (See [https://builds.apache.org/job/HBase-1.3-IT/22/]) HBASE-14017 Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion (busbey: rev 80b0a3e914c8f7b2600de93a27cc5d050d36ebf7) * hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureQueue.java * hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestMasterProcedureQueue.java Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion - Key: HBASE-14017 URL: https://issues.apache.org/jira/browse/HBASE-14017 Project: HBase Issue Type: Sub-task Components: proc-v2 Affects Versions: 2.0.0, 1.2.0, 1.1.1, 1.3.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.2 Attachments: HBASE-14017-v0.patch, HBASE-14017-v0.patch, HBASE-14017.as-pushed-master.patch, HBASE-14017.v1-branch1.1.patch, HBASE-14017.v1-branch1.1.patch [~syuanjiang] found a concurrecy issue in the procedure queue delete where we don't have an exclusive lock before deleting the table {noformat} Thread 1: Create table is running - the queue is empty and wlock is false Thread 2: markTableAsDeleted see the queue empty and wlock= false Thread 1: tryWrite() set wlock=true; too late Thread 2: delete the queue Thread 1: never able to release the lock - NPE when trying to get the queue {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13927) Allow hbase-daemon.sh to conditionally redirect the log or not
[ https://issues.apache.org/jira/browse/HBASE-13927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615660#comment-14615660 ] Hudson commented on HBASE-13927: SUCCESS: Integrated in HBase-1.3-IT #22 (See [https://builds.apache.org/job/HBase-1.3-IT/22/]) HBASE-13927 Allow hbase-daemon.sh to conditionally redirect the log or not (busbey: rev e28094fe4d1569a67131322e59090a0196889d05) * bin/hbase-daemon.sh Allow hbase-daemon.sh to conditionally redirect the log or not -- Key: HBASE-13927 URL: https://issues.apache.org/jira/browse/HBASE-13927 Project: HBase Issue Type: Improvement Components: shell Affects Versions: 2.0.0, 1.2.0 Reporter: Elliott Clark Assignee: Elliott Clark Labels: shell Fix For: 2.0.0, 1.2.0 Attachments: HBASE-13927.patch, HBASE-13927.patch Kind of like HBASE_NOEXEC allow hbase-daemon to skip redirecting to a log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14017) Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion
[ https://issues.apache.org/jira/browse/HBASE-14017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated HBASE-14017: Resolution: Fixed Status: Resolved (was: Patch Available) pushed to branch-1.1+. branch-1, branch-1.2 both were equivalent to the as-pushed-to-master patch. branch-1.1 was equivalent to the posted v1-branch-1 patch. Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion - Key: HBASE-14017 URL: https://issues.apache.org/jira/browse/HBASE-14017 Project: HBase Issue Type: Sub-task Components: proc-v2 Affects Versions: 2.0.0, 1.2.0, 1.1.1, 1.3.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.2 Attachments: HBASE-14017-v0.patch, HBASE-14017-v0.patch, HBASE-14017.as-pushed-master.patch, HBASE-14017.v1-branch1.1.patch, HBASE-14017.v1-branch1.1.patch [~syuanjiang] found a concurrecy issue in the procedure queue delete where we don't have an exclusive lock before deleting the table {noformat} Thread 1: Create table is running - the queue is empty and wlock is false Thread 2: markTableAsDeleted see the queue empty and wlock= false Thread 1: tryWrite() set wlock=true; too late Thread 2: delete the queue Thread 1: never able to release the lock - NPE when trying to get the queue {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14017) Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion
[ https://issues.apache.org/jira/browse/HBASE-14017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen Yuan Jiang updated HBASE-14017: --- Attachment: (was: HBASE-14017.v1-branch1.1.patch) Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion - Key: HBASE-14017 URL: https://issues.apache.org/jira/browse/HBASE-14017 Project: HBase Issue Type: Sub-task Components: proc-v2 Affects Versions: 2.0.0, 1.2.0, 1.1.1, 1.3.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.2 Attachments: HBASE-14017-v0.patch, HBASE-14017-v0.patch, HBASE-14017.v1-branch1.1.patch, HBASE-14017.v1-branch1.1.patch [~syuanjiang] found a concurrecy issue in the procedure queue delete where we don't have an exclusive lock before deleting the table {noformat} Thread 1: Create table is running - the queue is empty and wlock is false Thread 2: markTableAsDeleted see the queue empty and wlock= false Thread 1: tryWrite() set wlock=true; too late Thread 2: delete the queue Thread 1: never able to release the lock - NPE when trying to get the queue {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14017) Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion
[ https://issues.apache.org/jira/browse/HBASE-14017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen Yuan Jiang updated HBASE-14017: --- Attachment: HBASE-14017.v1-branch1.1.patch Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion - Key: HBASE-14017 URL: https://issues.apache.org/jira/browse/HBASE-14017 Project: HBase Issue Type: Sub-task Components: proc-v2 Affects Versions: 2.0.0, 1.2.0, 1.1.1, 1.3.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.2 Attachments: HBASE-14017-v0.patch, HBASE-14017-v0.patch, HBASE-14017.v1-branch1.1.patch, HBASE-14017.v1-branch1.1.patch [~syuanjiang] found a concurrecy issue in the procedure queue delete where we don't have an exclusive lock before deleting the table {noformat} Thread 1: Create table is running - the queue is empty and wlock is false Thread 2: markTableAsDeleted see the queue empty and wlock= false Thread 1: tryWrite() set wlock=true; too late Thread 2: delete the queue Thread 1: never able to release the lock - NPE when trying to get the queue {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13927) Allow hbase-daemon.sh to conditionally redirect the log or not
[ https://issues.apache.org/jira/browse/HBASE-13927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615608#comment-14615608 ] Hudson commented on HBASE-13927: FAILURE: Integrated in HBase-TRUNK #6631 (See [https://builds.apache.org/job/HBase-TRUNK/6631/]) HBASE-13927 Allow hbase-daemon.sh to conditionally redirect the log or not (busbey: rev 608c3aa15c34b9014f99e857b374645db58cbbe3) * bin/hbase-daemon.sh Allow hbase-daemon.sh to conditionally redirect the log or not -- Key: HBASE-13927 URL: https://issues.apache.org/jira/browse/HBASE-13927 Project: HBase Issue Type: Improvement Components: shell Affects Versions: 2.0.0, 1.2.0 Reporter: Elliott Clark Assignee: Elliott Clark Labels: shell Fix For: 2.0.0, 1.2.0 Attachments: HBASE-13927.patch, HBASE-13927.patch Kind of like HBASE_NOEXEC allow hbase-daemon to skip redirecting to a log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13849) Remove restore and clone snapshot from the WebUI
[ https://issues.apache.org/jira/browse/HBASE-13849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615676#comment-14615676 ] Hudson commented on HBASE-13849: FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #1001 (See [https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/1001/]) HBASE-13849 Remove restore and clone snapshot from the WebUI (busbey: rev b30d0311aade4e5c8cff23cb4a3dadc742f6e237) * hbase-server/src/main/resources/hbase-webapps/master/snapshot.jsp Remove restore and clone snapshot from the WebUI Key: HBASE-13849 URL: https://issues.apache.org/jira/browse/HBASE-13849 Project: HBase Issue Type: Bug Components: snapshots Affects Versions: 1.0.1, 1.1.0, 0.98.13, 1.2.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Fix For: 2.0.0, 0.98.14, 1.2.0 Attachments: HBASE-13849-v0.patch Remove the clone and restore snapshot buttons from the WebUI. first reason, is that the operation may be too long for having the user wait on the WebUI. second reason is that an action from the webUI does not play well with security. since it is going to be executed by the hbase user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12213) HFileBlock backed by Array of ByteBuffers
[ https://issues.apache.org/jira/browse/HBASE-12213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615267#comment-14615267 ] ramkrishna.s.vasudevan commented on HBASE-12213: https://reviews.apache.org/r/36206/ HFileBlock backed by Array of ByteBuffers - Key: HBASE-12213 URL: https://issues.apache.org/jira/browse/HBASE-12213 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: ramkrishna.s.vasudevan Attachments: HBASE-12213_1.patch, HBASE-12213_2.patch, HBASE-12213_4.patch, HBASE-12213_jmh.zip In L2 cache (offheap) an HFile block might have been cached into multiple chunks of buffers. If HFileBlock need single BB, we will end up in recreation of bigger BB and copying. Instead we can make HFileBlock to serve data from an array of BBs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HBASE-12213) HFileBlock backed by Array of ByteBuffers
[ https://issues.apache.org/jira/browse/HBASE-12213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615267#comment-14615267 ] ramkrishna.s.vasudevan edited comment on HBASE-12213 at 7/6/15 4:22 PM: https://reviews.apache.org/r/36206/ - RB link was (Author: ram_krish): https://reviews.apache.org/r/36206/ HFileBlock backed by Array of ByteBuffers - Key: HBASE-12213 URL: https://issues.apache.org/jira/browse/HBASE-12213 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: ramkrishna.s.vasudevan Attachments: HBASE-12213_1.patch, HBASE-12213_2.patch, HBASE-12213_4.patch, HBASE-12213_jmh.zip In L2 cache (offheap) an HFile block might have been cached into multiple chunks of buffers. If HFileBlock need single BB, we will end up in recreation of bigger BB and copying. Instead we can make HFileBlock to serve data from an array of BBs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14012) Double Assignment and Dataloss when ServerCrashProcedure runs during Master failover
[ https://issues.apache.org/jira/browse/HBASE-14012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615353#comment-14615353 ] Hudson commented on HBASE-14012: SUCCESS: Integrated in HBase-1.3 #36 (See [https://builds.apache.org/job/HBase-1.3/36/]) HBASE-14012 Double Assignment and Dataloss when ServerCrashProcedure runs during Master failover (stack: rev 4e36815906e5fd175063b0700532e3dce88bcda6) * hbase-protocol/src/main/protobuf/MasterProcedure.proto * hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java * hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/MasterProcedureProtos.java * hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java * hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java * hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStateStore.java Double Assignment and Dataloss when ServerCrashProcedure runs during Master failover Key: HBASE-14012 URL: https://issues.apache.org/jira/browse/HBASE-14012 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 2.0.0, 1.2.0 Reporter: stack Assignee: stack Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.3.0 Attachments: 14012.txt, 14012v2.txt (Rewrite to be more explicit about what the problem is) ITBLL. Master comes up (It is being killed every 1-5 minutes or so). It is joining a running cluster (all servers up except Master with most regions assigned out on cluster). ProcedureStore has two ServerCrashProcedures unfinished (RUNNABLE state) for two separate servers. One SCP is in the middle of the assign step when master crashes (SERVER_CRASH_ASSIGN). This SCP step has this comment on it: {code} // Assign may not be idempotent. SSH used to requeue the SSH if we got an IOE assigning // which is what we are mimicing here but it looks prone to double assignment if assign // fails midway. TODO: Test. {code} This issue is 1.2+ only since it is ServerCrashProcedure (Added in HBASE-13616, post hbase-1.1.x). Looking at ServerShutdownHandler, how we used to do crash processing before we moved over to the Pv2 framework, SSH may have (accidentally) avoided this issue since it does its processing in a big blob starting over if killed mid-crash. In particular, post-crash, SSH scans hbase:meta to find servers that were on the downed server. SCP scanneds Meta in one step, saves off the regions it finds into the ProcedureStore, and then in the next step, does actual assign. In this case, we crashed post-meta scan and during assign. Assign is a bulk assign. It mostly succeeded but got this: {code} 809622 2015-06-09 20:05:28,576 INFO [ProcedureExecutorThread-9] master.GeneralBulkAssigner: Failed assigning 3 regions to server c2021.halxg.cloudera.com,16020,1433905510696, reassigning them {code} So, most regions actually made it to new locations except for a few stragglers. All of the successfully assigned regions then are reassigned on other side of master restart when we replay the SCP assign step. Let me put together the scan meta and assign steps in SCP; this should do until we redo all of assign to run on Pv2. A few other things I noticed: In SCP, we only check if failover in first step, not for every step, which means ServerCrashProcedure will run if on reload it is beyond the first step. {code} // Is master fully online? If not, yield. No processing of servers unless master is up if (!services.getAssignmentManager().isFailoverCleanupDone()) { throwProcedureYieldException(Waiting on master failover to complete); } {code} This means we are assigning while Master is still coming up, a no-no (though it does not seem to have caused problem here). Fix. Also, I see that over the 8 hours of this particular log, each time the master crashes and comes back up, we queue a ServerCrashProcedure for c2022 because an empty dir never gets cleaned up: {code} 39 2015-06-09 22:15:33,074 WARN [ProcedureExecutorThread-0] master.SplitLogManager: returning success without actually splitting and deleting all the log files in path hdfs://c2020.halxg.cloudera.com:8020/hbase/WALs/c2022.halxg.cloudera.com,16020,1433902151857-splitting {code} Fix this too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13849) Remove restore and clone snapshot from the WebUI
[ https://issues.apache.org/jira/browse/HBASE-13849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615395#comment-14615395 ] Hudson commented on HBASE-13849: SUCCESS: Integrated in HBase-1.3-IT #21 (See [https://builds.apache.org/job/HBase-1.3-IT/21/]) HBASE-13849 Remove restore and clone snapshot from the WebUI (busbey: rev d347c66c90dc727ac9fc059eba0db4064ee818b5) * hbase-server/src/main/resources/hbase-webapps/master/snapshot.jsp Remove restore and clone snapshot from the WebUI Key: HBASE-13849 URL: https://issues.apache.org/jira/browse/HBASE-13849 Project: HBase Issue Type: Bug Components: snapshots Affects Versions: 1.0.1, 1.1.0, 0.98.13, 1.2.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Fix For: 2.0.0, 0.98.14, 1.2.0 Attachments: HBASE-13849-v0.patch Remove the clone and restore snapshot buttons from the WebUI. first reason, is that the operation may be too long for having the user wait on the WebUI. second reason is that an action from the webUI does not play well with security. since it is going to be executed by the hbase user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14012) Double Assignment and Dataloss when ServerCrashProcedure runs during Master failover
[ https://issues.apache.org/jira/browse/HBASE-14012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615397#comment-14615397 ] Hudson commented on HBASE-14012: SUCCESS: Integrated in HBase-1.3-IT #21 (See [https://builds.apache.org/job/HBase-1.3-IT/21/]) HBASE-14012 Double Assignment and Dataloss when ServerCrashProcedure runs during Master failover (stack: rev 4e36815906e5fd175063b0700532e3dce88bcda6) * hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java * hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java * hbase-protocol/src/main/protobuf/MasterProcedure.proto * hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/MasterProcedureProtos.java * hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java * hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStateStore.java Double Assignment and Dataloss when ServerCrashProcedure runs during Master failover Key: HBASE-14012 URL: https://issues.apache.org/jira/browse/HBASE-14012 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 2.0.0, 1.2.0 Reporter: stack Assignee: stack Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.3.0 Attachments: 14012.txt, 14012v2.txt (Rewrite to be more explicit about what the problem is) ITBLL. Master comes up (It is being killed every 1-5 minutes or so). It is joining a running cluster (all servers up except Master with most regions assigned out on cluster). ProcedureStore has two ServerCrashProcedures unfinished (RUNNABLE state) for two separate servers. One SCP is in the middle of the assign step when master crashes (SERVER_CRASH_ASSIGN). This SCP step has this comment on it: {code} // Assign may not be idempotent. SSH used to requeue the SSH if we got an IOE assigning // which is what we are mimicing here but it looks prone to double assignment if assign // fails midway. TODO: Test. {code} This issue is 1.2+ only since it is ServerCrashProcedure (Added in HBASE-13616, post hbase-1.1.x). Looking at ServerShutdownHandler, how we used to do crash processing before we moved over to the Pv2 framework, SSH may have (accidentally) avoided this issue since it does its processing in a big blob starting over if killed mid-crash. In particular, post-crash, SSH scans hbase:meta to find servers that were on the downed server. SCP scanneds Meta in one step, saves off the regions it finds into the ProcedureStore, and then in the next step, does actual assign. In this case, we crashed post-meta scan and during assign. Assign is a bulk assign. It mostly succeeded but got this: {code} 809622 2015-06-09 20:05:28,576 INFO [ProcedureExecutorThread-9] master.GeneralBulkAssigner: Failed assigning 3 regions to server c2021.halxg.cloudera.com,16020,1433905510696, reassigning them {code} So, most regions actually made it to new locations except for a few stragglers. All of the successfully assigned regions then are reassigned on other side of master restart when we replay the SCP assign step. Let me put together the scan meta and assign steps in SCP; this should do until we redo all of assign to run on Pv2. A few other things I noticed: In SCP, we only check if failover in first step, not for every step, which means ServerCrashProcedure will run if on reload it is beyond the first step. {code} // Is master fully online? If not, yield. No processing of servers unless master is up if (!services.getAssignmentManager().isFailoverCleanupDone()) { throwProcedureYieldException(Waiting on master failover to complete); } {code} This means we are assigning while Master is still coming up, a no-no (though it does not seem to have caused problem here). Fix. Also, I see that over the 8 hours of this particular log, each time the master crashes and comes back up, we queue a ServerCrashProcedure for c2022 because an empty dir never gets cleaned up: {code} 39 2015-06-09 22:15:33,074 WARN [ProcedureExecutorThread-0] master.SplitLogManager: returning success without actually splitting and deleting all the log files in path hdfs://c2020.halxg.cloudera.com:8020/hbase/WALs/c2022.halxg.cloudera.com,16020,1433902151857-splitting {code} Fix this too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13646) HRegion#execService should not try to build incomplete messages
[ https://issues.apache.org/jira/browse/HBASE-13646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615396#comment-14615396 ] Hudson commented on HBASE-13646: SUCCESS: Integrated in HBase-1.3-IT #21 (See [https://builds.apache.org/job/HBase-1.3-IT/21/]) HBASE-13646 HRegion#execService should not try to build incomplete messages (busbey: rev 8d71d283b92501baf9ba681c8cb63e234e4c2ece) * hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestCoprocessorEndpoint.java * hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ResponseConverter.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java * hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/protobuf/generated/DummyRegionServerEndpointProtos.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java * hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestRegionServerCoprocessorEndpoint.java * hbase-server/src/test/protobuf/DummyRegionServerEndpoint.proto * hbase-server/pom.xml HRegion#execService should not try to build incomplete messages --- Key: HBASE-13646 URL: https://issues.apache.org/jira/browse/HBASE-13646 Project: HBase Issue Type: Bug Components: Coprocessors, regionserver Affects Versions: 2.0.0, 1.2.0, 1.1.1 Reporter: Andrey Stepachev Assignee: Andrey Stepachev Fix For: 2.0.0, 0.98.14, 1.2.0 Attachments: HBASE-13646-branch-1.patch, HBASE-13646.patch, HBASE-13646.v2.patch, HBASE-13646.v2.patch If some RPC service, called on region throws exception, execService still tries to build Message. In case of complex messages with required fields it complicates service code because service need to pass fake protobuf objects, so they can be barely buildable. To mitigate that I propose to check that controller was failed and return null from call instead of failing with exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14017) Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion
[ https://issues.apache.org/jira/browse/HBASE-14017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615417#comment-14615417 ] Sean Busbey commented on HBASE-14017: - current patches do not apply to master or branch-1. Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion - Key: HBASE-14017 URL: https://issues.apache.org/jira/browse/HBASE-14017 Project: HBase Issue Type: Sub-task Components: proc-v2 Affects Versions: 2.0.0, 1.2.0, 1.1.1, 1.3.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.2 Attachments: HBASE-14017-v0.patch, HBASE-14017-v0.patch, HBASE-14017.v1-branch1.1.patch, HBASE-14017.v1-branch1.1.patch [~syuanjiang] found a concurrecy issue in the procedure queue delete where we don't have an exclusive lock before deleting the table {noformat} Thread 1: Create table is running - the queue is empty and wlock is false Thread 2: markTableAsDeleted see the queue empty and wlock= false Thread 1: tryWrite() set wlock=true; too late Thread 2: delete the queue Thread 1: never able to release the lock - NPE when trying to get the queue {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14017) Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion
[ https://issues.apache.org/jira/browse/HBASE-14017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated HBASE-14017: Attachment: HBASE-14017.as-pushed-master.patch attaching patch as pushed to master for [~mbertozzi] Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion - Key: HBASE-14017 URL: https://issues.apache.org/jira/browse/HBASE-14017 Project: HBase Issue Type: Sub-task Components: proc-v2 Affects Versions: 2.0.0, 1.2.0, 1.1.1, 1.3.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.2 Attachments: HBASE-14017-v0.patch, HBASE-14017-v0.patch, HBASE-14017.as-pushed-master.patch, HBASE-14017.v1-branch1.1.patch, HBASE-14017.v1-branch1.1.patch [~syuanjiang] found a concurrecy issue in the procedure queue delete where we don't have an exclusive lock before deleting the table {noformat} Thread 1: Create table is running - the queue is empty and wlock is false Thread 2: markTableAsDeleted see the queue empty and wlock= false Thread 1: tryWrite() set wlock=true; too late Thread 2: delete the queue Thread 1: never able to release the lock - NPE when trying to get the queue {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13849) Remove restore and clone snapshot from the WebUI
[ https://issues.apache.org/jira/browse/HBASE-13849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615594#comment-14615594 ] Hudson commented on HBASE-13849: FAILURE: Integrated in HBase-0.98 #1047 (See [https://builds.apache.org/job/HBase-0.98/1047/]) HBASE-13849 Remove restore and clone snapshot from the WebUI (busbey: rev b30d0311aade4e5c8cff23cb4a3dadc742f6e237) * hbase-server/src/main/resources/hbase-webapps/master/snapshot.jsp Remove restore and clone snapshot from the WebUI Key: HBASE-13849 URL: https://issues.apache.org/jira/browse/HBASE-13849 Project: HBase Issue Type: Bug Components: snapshots Affects Versions: 1.0.1, 1.1.0, 0.98.13, 1.2.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Fix For: 2.0.0, 0.98.14, 1.2.0 Attachments: HBASE-13849-v0.patch Remove the clone and restore snapshot buttons from the WebUI. first reason, is that the operation may be too long for having the user wait on the WebUI. second reason is that an action from the webUI does not play well with security. since it is going to be executed by the hbase user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13849) Remove restore and clone snapshot from the WebUI
[ https://issues.apache.org/jira/browse/HBASE-13849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615609#comment-14615609 ] Hudson commented on HBASE-13849: FAILURE: Integrated in HBase-TRUNK #6631 (See [https://builds.apache.org/job/HBase-TRUNK/6631/]) HBASE-13849 Remove restore and clone snapshot from the WebUI (busbey: rev d26978a2ff26a918ccbb52fe002294d567e5d8f9) * hbase-server/src/main/resources/hbase-webapps/master/snapshot.jsp Remove restore and clone snapshot from the WebUI Key: HBASE-13849 URL: https://issues.apache.org/jira/browse/HBASE-13849 Project: HBase Issue Type: Bug Components: snapshots Affects Versions: 1.0.1, 1.1.0, 0.98.13, 1.2.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Fix For: 2.0.0, 0.98.14, 1.2.0 Attachments: HBASE-13849-v0.patch Remove the clone and restore snapshot buttons from the WebUI. first reason, is that the operation may be too long for having the user wait on the WebUI. second reason is that an action from the webUI does not play well with security. since it is going to be executed by the hbase user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13927) Allow hbase-daemon.sh to conditionally redirect the log or not
[ https://issues.apache.org/jira/browse/HBASE-13927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615537#comment-14615537 ] Hudson commented on HBASE-13927: FAILURE: Integrated in HBase-1.3 #37 (See [https://builds.apache.org/job/HBase-1.3/37/]) HBASE-13927 Allow hbase-daemon.sh to conditionally redirect the log or not (busbey: rev e28094fe4d1569a67131322e59090a0196889d05) * bin/hbase-daemon.sh Allow hbase-daemon.sh to conditionally redirect the log or not -- Key: HBASE-13927 URL: https://issues.apache.org/jira/browse/HBASE-13927 Project: HBase Issue Type: Improvement Components: shell Affects Versions: 2.0.0, 1.2.0 Reporter: Elliott Clark Assignee: Elliott Clark Labels: shell Fix For: 2.0.0, 1.2.0 Attachments: HBASE-13927.patch, HBASE-13927.patch Kind of like HBASE_NOEXEC allow hbase-daemon to skip redirecting to a log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13646) HRegion#execService should not try to build incomplete messages
[ https://issues.apache.org/jira/browse/HBASE-13646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615539#comment-14615539 ] Hudson commented on HBASE-13646: FAILURE: Integrated in HBase-1.3 #37 (See [https://builds.apache.org/job/HBase-1.3/37/]) HBASE-13646 HRegion#execService should not try to build incomplete messages (busbey: rev 8d71d283b92501baf9ba681c8cb63e234e4c2ece) * hbase-server/src/test/protobuf/DummyRegionServerEndpoint.proto * hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestCoprocessorEndpoint.java * hbase-server/pom.xml * hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestRegionServerCoprocessorEndpoint.java * hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ResponseConverter.java * hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/protobuf/generated/DummyRegionServerEndpointProtos.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java HRegion#execService should not try to build incomplete messages --- Key: HBASE-13646 URL: https://issues.apache.org/jira/browse/HBASE-13646 Project: HBase Issue Type: Bug Components: Coprocessors, regionserver Affects Versions: 2.0.0, 1.2.0, 1.1.1 Reporter: Andrey Stepachev Assignee: Andrey Stepachev Fix For: 2.0.0, 0.98.14, 1.2.0 Attachments: HBASE-13646-branch-1.patch, HBASE-13646.patch, HBASE-13646.v2.patch, HBASE-13646.v2.patch If some RPC service, called on region throws exception, execService still tries to build Message. In case of complex messages with required fields it complicates service code because service need to pass fake protobuf objects, so they can be barely buildable. To mitigate that I propose to check that controller was failed and return null from call instead of failing with exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13927) Allow hbase-daemon.sh to conditionally redirect the log or not
[ https://issues.apache.org/jira/browse/HBASE-13927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615583#comment-14615583 ] Hudson commented on HBASE-13927: FAILURE: Integrated in HBase-1.2 #53 (See [https://builds.apache.org/job/HBase-1.2/53/]) HBASE-13927 Allow hbase-daemon.sh to conditionally redirect the log or not (busbey: rev 3e2636b4b314072ace25aba809c1631332a4076e) * bin/hbase-daemon.sh Allow hbase-daemon.sh to conditionally redirect the log or not -- Key: HBASE-13927 URL: https://issues.apache.org/jira/browse/HBASE-13927 Project: HBase Issue Type: Improvement Components: shell Affects Versions: 2.0.0, 1.2.0 Reporter: Elliott Clark Assignee: Elliott Clark Labels: shell Fix For: 2.0.0, 1.2.0 Attachments: HBASE-13927.patch, HBASE-13927.patch Kind of like HBASE_NOEXEC allow hbase-daemon to skip redirecting to a log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13849) Remove restore and clone snapshot from the WebUI
[ https://issues.apache.org/jira/browse/HBASE-13849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615584#comment-14615584 ] Hudson commented on HBASE-13849: FAILURE: Integrated in HBase-1.2 #53 (See [https://builds.apache.org/job/HBase-1.2/53/]) HBASE-13849 Remove restore and clone snapshot from the WebUI (busbey: rev 2bc55875cb533af1a4bbe3bb02482b4e9d4bf4c8) * hbase-server/src/main/resources/hbase-webapps/master/snapshot.jsp Remove restore and clone snapshot from the WebUI Key: HBASE-13849 URL: https://issues.apache.org/jira/browse/HBASE-13849 Project: HBase Issue Type: Bug Components: snapshots Affects Versions: 1.0.1, 1.1.0, 0.98.13, 1.2.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Fix For: 2.0.0, 0.98.14, 1.2.0 Attachments: HBASE-13849-v0.patch Remove the clone and restore snapshot buttons from the WebUI. first reason, is that the operation may be too long for having the user wait on the WebUI. second reason is that an action from the webUI does not play well with security. since it is going to be executed by the hbase user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13927) Allow hbase-daemon.sh to conditionally redirect the log or not
[ https://issues.apache.org/jira/browse/HBASE-13927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615613#comment-14615613 ] Hudson commented on HBASE-13927: FAILURE: Integrated in HBase-1.2-IT #39 (See [https://builds.apache.org/job/HBase-1.2-IT/39/]) HBASE-13927 Allow hbase-daemon.sh to conditionally redirect the log or not (busbey: rev 3e2636b4b314072ace25aba809c1631332a4076e) * bin/hbase-daemon.sh Allow hbase-daemon.sh to conditionally redirect the log or not -- Key: HBASE-13927 URL: https://issues.apache.org/jira/browse/HBASE-13927 Project: HBase Issue Type: Improvement Components: shell Affects Versions: 2.0.0, 1.2.0 Reporter: Elliott Clark Assignee: Elliott Clark Labels: shell Fix For: 2.0.0, 1.2.0 Attachments: HBASE-13927.patch, HBASE-13927.patch Kind of like HBASE_NOEXEC allow hbase-daemon to skip redirecting to a log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14017) Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion
[ https://issues.apache.org/jira/browse/HBASE-14017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615647#comment-14615647 ] Stephen Yuan Jiang commented on HBASE-14017: [~busbey] Sean, could you also push the same master patch to branch-1.2? The code in branch-1.1 is a little different, you need to use the branch-1.1 patch in this JIRA to push the change to branch-1.1. Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion - Key: HBASE-14017 URL: https://issues.apache.org/jira/browse/HBASE-14017 Project: HBase Issue Type: Sub-task Components: proc-v2 Affects Versions: 2.0.0, 1.2.0, 1.1.1, 1.3.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.2 Attachments: HBASE-14017-v0.patch, HBASE-14017-v0.patch, HBASE-14017.as-pushed-master.patch, HBASE-14017.v1-branch1.1.patch, HBASE-14017.v1-branch1.1.patch [~syuanjiang] found a concurrecy issue in the procedure queue delete where we don't have an exclusive lock before deleting the table {noformat} Thread 1: Create table is running - the queue is empty and wlock is false Thread 2: markTableAsDeleted see the queue empty and wlock= false Thread 1: tryWrite() set wlock=true; too late Thread 2: delete the queue Thread 1: never able to release the lock - NPE when trying to get the queue {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14017) Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion
[ https://issues.apache.org/jira/browse/HBASE-14017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen Yuan Jiang updated HBASE-14017: --- Fix Version/s: 1.3.0 Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion - Key: HBASE-14017 URL: https://issues.apache.org/jira/browse/HBASE-14017 Project: HBase Issue Type: Sub-task Components: proc-v2 Affects Versions: 2.0.0, 1.2.0, 1.1.1, 1.3.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.2, 1.3.0 Attachments: HBASE-14017-v0.patch, HBASE-14017-v0.patch, HBASE-14017.as-pushed-master.patch, HBASE-14017.v1-branch1.1.patch, HBASE-14017.v1-branch1.1.patch [~syuanjiang] found a concurrecy issue in the procedure queue delete where we don't have an exclusive lock before deleting the table {noformat} Thread 1: Create table is running - the queue is empty and wlock is false Thread 2: markTableAsDeleted see the queue empty and wlock= false Thread 1: tryWrite() set wlock=true; too late Thread 2: delete the queue Thread 1: never able to release the lock - NPE when trying to get the queue {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14017) Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion
[ https://issues.apache.org/jira/browse/HBASE-14017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615705#comment-14615705 ] Stephen Yuan Jiang commented on HBASE-14017: Great! Thanks, Sean. Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion - Key: HBASE-14017 URL: https://issues.apache.org/jira/browse/HBASE-14017 Project: HBase Issue Type: Sub-task Components: proc-v2 Affects Versions: 2.0.0, 1.2.0, 1.1.1, 1.3.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.2, 1.3.0 Attachments: HBASE-14017-v0.patch, HBASE-14017-v0.patch, HBASE-14017.as-pushed-master.patch, HBASE-14017.v1-branch1.1.patch, HBASE-14017.v1-branch1.1.patch [~syuanjiang] found a concurrecy issue in the procedure queue delete where we don't have an exclusive lock before deleting the table {noformat} Thread 1: Create table is running - the queue is empty and wlock is false Thread 2: markTableAsDeleted see the queue empty and wlock= false Thread 1: tryWrite() set wlock=true; too late Thread 2: delete the queue Thread 1: never able to release the lock - NPE when trying to get the queue {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13849) Remove restore and clone snapshot from the WebUI
[ https://issues.apache.org/jira/browse/HBASE-13849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated HBASE-13849: Resolution: Fixed Hadoop Flags: Incompatible change Release Note: The HBase master status web page no longer allows operators to clone snapshots nor restore snapshots. Status: Resolved (was: Patch Available) Remove restore and clone snapshot from the WebUI Key: HBASE-13849 URL: https://issues.apache.org/jira/browse/HBASE-13849 Project: HBase Issue Type: Bug Components: snapshots Affects Versions: 1.0.1, 1.1.0, 0.98.13, 1.2.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Fix For: 2.0.0, 0.98.14, 1.2.0 Attachments: HBASE-13849-v0.patch Remove the clone and restore snapshot buttons from the WebUI. first reason, is that the operation may be too long for having the user wait on the WebUI. second reason is that an action from the webUI does not play well with security. since it is going to be executed by the hbase user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13646) HRegion#execService should not try to build incomplete messages
[ https://issues.apache.org/jira/browse/HBASE-13646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615332#comment-14615332 ] Hudson commented on HBASE-13646: SUCCESS: Integrated in HBase-1.2-IT #38 (See [https://builds.apache.org/job/HBase-1.2-IT/38/]) HBASE-13646 HRegion#execService should not try to build incomplete messages (busbey: rev 042f53b2f50b7c57fcf2eec62f8c67be57b0d850) * hbase-server/pom.xml * hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ResponseConverter.java * hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/protobuf/generated/DummyRegionServerEndpointProtos.java * hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestRegionServerCoprocessorEndpoint.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java * hbase-server/src/test/protobuf/DummyRegionServerEndpoint.proto * hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestCoprocessorEndpoint.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java HRegion#execService should not try to build incomplete messages --- Key: HBASE-13646 URL: https://issues.apache.org/jira/browse/HBASE-13646 Project: HBase Issue Type: Bug Components: Coprocessors, regionserver Affects Versions: 2.0.0, 1.2.0, 1.1.1 Reporter: Andrey Stepachev Assignee: Andrey Stepachev Fix For: 2.0.0, 0.98.14, 1.2.0 Attachments: HBASE-13646-branch-1.patch, HBASE-13646.patch, HBASE-13646.v2.patch, HBASE-13646.v2.patch If some RPC service, called on region throws exception, execService still tries to build Message. In case of complex messages with required fields it complicates service code because service need to pass fake protobuf objects, so they can be barely buildable. To mitigate that I propose to check that controller was failed and return null from call instead of failing with exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14017) Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion
[ https://issues.apache.org/jira/browse/HBASE-14017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen Yuan Jiang updated HBASE-14017: --- Status: Patch Available (was: Open) Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion - Key: HBASE-14017 URL: https://issues.apache.org/jira/browse/HBASE-14017 Project: HBase Issue Type: Sub-task Components: proc-v2 Affects Versions: 1.1.1, 2.0.0, 1.2.0, 1.3.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.2 Attachments: HBASE-14017-v0.patch, HBASE-14017-v0.patch, HBASE-14017.v1-branch1.1.patch, HBASE-14017.v1-branch1.1.patch [~syuanjiang] found a concurrecy issue in the procedure queue delete where we don't have an exclusive lock before deleting the table {noformat} Thread 1: Create table is running - the queue is empty and wlock is false Thread 2: markTableAsDeleted see the queue empty and wlock= false Thread 1: tryWrite() set wlock=true; too late Thread 2: delete the queue Thread 1: never able to release the lock - NPE when trying to get the queue {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14012) Double Assignment and Dataloss when ServerCrashProcedure runs during Master failover
[ https://issues.apache.org/jira/browse/HBASE-14012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615405#comment-14615405 ] Hudson commented on HBASE-14012: SUCCESS: Integrated in HBase-1.2 #52 (See [https://builds.apache.org/job/HBase-1.2/52/]) HBASE-14012 Double Assignment and Dataloss when ServerCrashProcedure runs during Master failover (stack: rev 8660a6004c7bc500536e43c0d35498cfc16c9867) * hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java * hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java * hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/MasterProcedureProtos.java * hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java * hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStateStore.java * hbase-protocol/src/main/protobuf/MasterProcedure.proto Double Assignment and Dataloss when ServerCrashProcedure runs during Master failover Key: HBASE-14012 URL: https://issues.apache.org/jira/browse/HBASE-14012 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 2.0.0, 1.2.0 Reporter: stack Assignee: stack Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.3.0 Attachments: 14012.txt, 14012v2.txt (Rewrite to be more explicit about what the problem is) ITBLL. Master comes up (It is being killed every 1-5 minutes or so). It is joining a running cluster (all servers up except Master with most regions assigned out on cluster). ProcedureStore has two ServerCrashProcedures unfinished (RUNNABLE state) for two separate servers. One SCP is in the middle of the assign step when master crashes (SERVER_CRASH_ASSIGN). This SCP step has this comment on it: {code} // Assign may not be idempotent. SSH used to requeue the SSH if we got an IOE assigning // which is what we are mimicing here but it looks prone to double assignment if assign // fails midway. TODO: Test. {code} This issue is 1.2+ only since it is ServerCrashProcedure (Added in HBASE-13616, post hbase-1.1.x). Looking at ServerShutdownHandler, how we used to do crash processing before we moved over to the Pv2 framework, SSH may have (accidentally) avoided this issue since it does its processing in a big blob starting over if killed mid-crash. In particular, post-crash, SSH scans hbase:meta to find servers that were on the downed server. SCP scanneds Meta in one step, saves off the regions it finds into the ProcedureStore, and then in the next step, does actual assign. In this case, we crashed post-meta scan and during assign. Assign is a bulk assign. It mostly succeeded but got this: {code} 809622 2015-06-09 20:05:28,576 INFO [ProcedureExecutorThread-9] master.GeneralBulkAssigner: Failed assigning 3 regions to server c2021.halxg.cloudera.com,16020,1433905510696, reassigning them {code} So, most regions actually made it to new locations except for a few stragglers. All of the successfully assigned regions then are reassigned on other side of master restart when we replay the SCP assign step. Let me put together the scan meta and assign steps in SCP; this should do until we redo all of assign to run on Pv2. A few other things I noticed: In SCP, we only check if failover in first step, not for every step, which means ServerCrashProcedure will run if on reload it is beyond the first step. {code} // Is master fully online? If not, yield. No processing of servers unless master is up if (!services.getAssignmentManager().isFailoverCleanupDone()) { throwProcedureYieldException(Waiting on master failover to complete); } {code} This means we are assigning while Master is still coming up, a no-no (though it does not seem to have caused problem here). Fix. Also, I see that over the 8 hours of this particular log, each time the master crashes and comes back up, we queue a ServerCrashProcedure for c2022 because an empty dir never gets cleaned up: {code} 39 2015-06-09 22:15:33,074 WARN [ProcedureExecutorThread-0] master.SplitLogManager: returning success without actually splitting and deleting all the log files in path hdfs://c2020.halxg.cloudera.com:8020/hbase/WALs/c2022.halxg.cloudera.com,16020,1433902151857-splitting {code} Fix this too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13646) HRegion#execService should not try to build incomplete messages
[ https://issues.apache.org/jira/browse/HBASE-13646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615404#comment-14615404 ] Hudson commented on HBASE-13646: SUCCESS: Integrated in HBase-1.2 #52 (See [https://builds.apache.org/job/HBase-1.2/52/]) HBASE-13646 HRegion#execService should not try to build incomplete messages (busbey: rev 042f53b2f50b7c57fcf2eec62f8c67be57b0d850) * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java * hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestCoprocessorEndpoint.java * hbase-server/src/test/protobuf/DummyRegionServerEndpoint.proto * hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/protobuf/generated/DummyRegionServerEndpointProtos.java * hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ResponseConverter.java * hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestRegionServerCoprocessorEndpoint.java * hbase-server/pom.xml HRegion#execService should not try to build incomplete messages --- Key: HBASE-13646 URL: https://issues.apache.org/jira/browse/HBASE-13646 Project: HBase Issue Type: Bug Components: Coprocessors, regionserver Affects Versions: 2.0.0, 1.2.0, 1.1.1 Reporter: Andrey Stepachev Assignee: Andrey Stepachev Fix For: 2.0.0, 0.98.14, 1.2.0 Attachments: HBASE-13646-branch-1.patch, HBASE-13646.patch, HBASE-13646.v2.patch, HBASE-13646.v2.patch If some RPC service, called on region throws exception, execService still tries to build Message. In case of complex messages with required fields it complicates service code because service need to pass fake protobuf objects, so they can be barely buildable. To mitigate that I propose to check that controller was failed and return null from call instead of failing with exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-7912) HBase Backup/Restore Based on HBase Snapshot
[ https://issues.apache.org/jira/browse/HBASE-7912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615385#comment-14615385 ] Vladimir Rodionov commented on HBASE-7912: -- So, the deduplication can be implemented either during read (original approach mentioned in the doc) or during restore (see above) or even during backup. Deduplication during backup will require WAL filtering during copy support, but this is a feature which is on the roadmap. There are some pro- and contra- for all of them. READ: very simple to implement, but we will have duplication of some data in a file system of HBase after restore. RESTORE: not that simple, but no data duplication in HBase cluster after restore, but there will be some data duplication in backup location. BACKUP: not that simple as well, but no duplication at all ... but relies on WAL filtering during copy support. HBase Backup/Restore Based on HBase Snapshot Key: HBASE-7912 URL: https://issues.apache.org/jira/browse/HBASE-7912 Project: HBase Issue Type: Sub-task Reporter: Richard Ding Assignee: Vladimir Rodionov Labels: backup Fix For: 2.0.0 Attachments: HBaseBackupRestore-Jira-7912-DesignDoc-v1.pdf, HBaseBackupRestore-Jira-7912-DesignDoc-v2.pdf, HBaseBackupRestore-Jira-7912-v4.pdf, HBaseBackupRestore-Jira-7912-v5 .pdf, HBase_BackupRestore-Jira-7912-CLI-v1.pdf Finally, we completed the implementation of our backup/restore solution, and would like to share with community through this jira. We are leveraging existing hbase snapshot feature, and provide a general solution to common users. Our full backup is using snapshot to capture metadata locally and using exportsnapshot to move data to another cluster; the incremental backup is using offline-WALplayer to backup HLogs; we also leverage global distribution rolllog and flush to improve performance; other added-on values such as convert, merge, progress report, and CLI commands. So that a common user can backup hbase data without in-depth knowledge of hbase. Our solution also contains some usability features for enterprise users. The detail design document and CLI command will be attached in this jira. We plan to use 10~12 subtasks to share each of the following features, and document the detail implement in the subtasks: * *Full Backup* : provide local and remote back/restore for a list of tables * *offline-WALPlayer* to convert HLog to HFiles offline (for incremental backup) * *distributed* Logroll and distributed flush * Backup *Manifest* and history * *Incremental* backup: to build on top of full backup as daily/weekly backup * *Convert* incremental backup WAL files into hfiles * *Merge* several backup images into one(like merge weekly into monthly) * *add and remove* table to and from Backup image * *Cancel* a backup process * backup progress *status* * full backup based on *existing snapshot* *-* *Below is the original description, to keep here as the history for the design and discussion back in 2013* There have been attempts in the past to come up with a viable HBase backup/restore solution (e.g., HBASE-4618). Recently, there are many advancements and new features in HBase, for example, FileLink, Snapshot, and Distributed Barrier Procedure. This is a proposal for a backup/restore solution that utilizes these new features to achieve better performance and consistency. A common practice of backup and restore in database is to first take full baseline backup, and then periodically take incremental backup that capture the changes since the full baseline backup. HBase cluster can store massive amount data. Combination of full backups with incremental backups has tremendous benefit for HBase as well. The following is a typical scenario for full and incremental backup. # The user takes a full backup of a table or a set of tables in HBase. # The user schedules periodical incremental backups to capture the changes from the full backup, or from last incremental backup. # The user needs to restore table data to a past point of time. # The full backup is restored to the table(s) or to different table name(s). Then the incremental backups that are up to the desired point in time are applied on top of the full backup. We would support the following key features and capabilities. * Full backup uses HBase snapshot to capture HFiles. * Use HBase WALs to capture incremental changes, but we use bulk load of HFiles for fast incremental restore. * Support single table or a set of tables, and column family level backup and restore. * Restore to different
[jira] [Commented] (HBASE-14012) Double Assignment and Dataloss when ServerCrashProcedure runs during Master failover
[ https://issues.apache.org/jira/browse/HBASE-14012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615333#comment-14615333 ] Hudson commented on HBASE-14012: SUCCESS: Integrated in HBase-1.2-IT #38 (See [https://builds.apache.org/job/HBase-1.2-IT/38/]) HBASE-14012 Double Assignment and Dataloss when ServerCrashProcedure runs during Master failover (stack: rev 8660a6004c7bc500536e43c0d35498cfc16c9867) * hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStateStore.java * hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java * hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/MasterProcedureProtos.java * hbase-protocol/src/main/protobuf/MasterProcedure.proto * hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java * hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java Double Assignment and Dataloss when ServerCrashProcedure runs during Master failover Key: HBASE-14012 URL: https://issues.apache.org/jira/browse/HBASE-14012 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 2.0.0, 1.2.0 Reporter: stack Assignee: stack Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.3.0 Attachments: 14012.txt, 14012v2.txt (Rewrite to be more explicit about what the problem is) ITBLL. Master comes up (It is being killed every 1-5 minutes or so). It is joining a running cluster (all servers up except Master with most regions assigned out on cluster). ProcedureStore has two ServerCrashProcedures unfinished (RUNNABLE state) for two separate servers. One SCP is in the middle of the assign step when master crashes (SERVER_CRASH_ASSIGN). This SCP step has this comment on it: {code} // Assign may not be idempotent. SSH used to requeue the SSH if we got an IOE assigning // which is what we are mimicing here but it looks prone to double assignment if assign // fails midway. TODO: Test. {code} This issue is 1.2+ only since it is ServerCrashProcedure (Added in HBASE-13616, post hbase-1.1.x). Looking at ServerShutdownHandler, how we used to do crash processing before we moved over to the Pv2 framework, SSH may have (accidentally) avoided this issue since it does its processing in a big blob starting over if killed mid-crash. In particular, post-crash, SSH scans hbase:meta to find servers that were on the downed server. SCP scanneds Meta in one step, saves off the regions it finds into the ProcedureStore, and then in the next step, does actual assign. In this case, we crashed post-meta scan and during assign. Assign is a bulk assign. It mostly succeeded but got this: {code} 809622 2015-06-09 20:05:28,576 INFO [ProcedureExecutorThread-9] master.GeneralBulkAssigner: Failed assigning 3 regions to server c2021.halxg.cloudera.com,16020,1433905510696, reassigning them {code} So, most regions actually made it to new locations except for a few stragglers. All of the successfully assigned regions then are reassigned on other side of master restart when we replay the SCP assign step. Let me put together the scan meta and assign steps in SCP; this should do until we redo all of assign to run on Pv2. A few other things I noticed: In SCP, we only check if failover in first step, not for every step, which means ServerCrashProcedure will run if on reload it is beyond the first step. {code} // Is master fully online? If not, yield. No processing of servers unless master is up if (!services.getAssignmentManager().isFailoverCleanupDone()) { throwProcedureYieldException(Waiting on master failover to complete); } {code} This means we are assigning while Master is still coming up, a no-no (though it does not seem to have caused problem here). Fix. Also, I see that over the 8 hours of this particular log, each time the master crashes and comes back up, we queue a ServerCrashProcedure for c2022 because an empty dir never gets cleaned up: {code} 39 2015-06-09 22:15:33,074 WARN [ProcedureExecutorThread-0] master.SplitLogManager: returning success without actually splitting and deleting all the log files in path hdfs://c2020.halxg.cloudera.com:8020/hbase/WALs/c2022.halxg.cloudera.com,16020,1433902151857-splitting {code} Fix this too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13849) Remove restore and clone snapshot from the WebUI
[ https://issues.apache.org/jira/browse/HBASE-13849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615331#comment-14615331 ] Hudson commented on HBASE-13849: SUCCESS: Integrated in HBase-1.2-IT #38 (See [https://builds.apache.org/job/HBase-1.2-IT/38/]) HBASE-13849 Remove restore and clone snapshot from the WebUI (busbey: rev 2bc55875cb533af1a4bbe3bb02482b4e9d4bf4c8) * hbase-server/src/main/resources/hbase-webapps/master/snapshot.jsp Remove restore and clone snapshot from the WebUI Key: HBASE-13849 URL: https://issues.apache.org/jira/browse/HBASE-13849 Project: HBase Issue Type: Bug Components: snapshots Affects Versions: 1.0.1, 1.1.0, 0.98.13, 1.2.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Fix For: 2.0.0, 0.98.14, 1.2.0 Attachments: HBASE-13849-v0.patch Remove the clone and restore snapshot buttons from the WebUI. first reason, is that the operation may be too long for having the user wait on the WebUI. second reason is that an action from the webUI does not play well with security. since it is going to be executed by the hbase user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-7912) HBase Backup/Restore Based on HBase Snapshot
[ https://issues.apache.org/jira/browse/HBASE-7912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615352#comment-14615352 ] Vladimir Rodionov commented on HBASE-7912: -- Sequence ID is per Region and WAL is per RegionServer. We will need to store maximum sequence id per region after full backup. When we finish snapshot, we can collect all maximum sequence ids from store files and store them as a MapRegion, long. During first incremental restore, we will need to check sequence id of every WAL Entry. If its below or equals to maximum seq id for this region - skip this entry. The logic will remains unchanged for all other incremental restores after the first one. HBase Backup/Restore Based on HBase Snapshot Key: HBASE-7912 URL: https://issues.apache.org/jira/browse/HBASE-7912 Project: HBase Issue Type: Sub-task Reporter: Richard Ding Assignee: Vladimir Rodionov Labels: backup Fix For: 2.0.0 Attachments: HBaseBackupRestore-Jira-7912-DesignDoc-v1.pdf, HBaseBackupRestore-Jira-7912-DesignDoc-v2.pdf, HBaseBackupRestore-Jira-7912-v4.pdf, HBaseBackupRestore-Jira-7912-v5 .pdf, HBase_BackupRestore-Jira-7912-CLI-v1.pdf Finally, we completed the implementation of our backup/restore solution, and would like to share with community through this jira. We are leveraging existing hbase snapshot feature, and provide a general solution to common users. Our full backup is using snapshot to capture metadata locally and using exportsnapshot to move data to another cluster; the incremental backup is using offline-WALplayer to backup HLogs; we also leverage global distribution rolllog and flush to improve performance; other added-on values such as convert, merge, progress report, and CLI commands. So that a common user can backup hbase data without in-depth knowledge of hbase. Our solution also contains some usability features for enterprise users. The detail design document and CLI command will be attached in this jira. We plan to use 10~12 subtasks to share each of the following features, and document the detail implement in the subtasks: * *Full Backup* : provide local and remote back/restore for a list of tables * *offline-WALPlayer* to convert HLog to HFiles offline (for incremental backup) * *distributed* Logroll and distributed flush * Backup *Manifest* and history * *Incremental* backup: to build on top of full backup as daily/weekly backup * *Convert* incremental backup WAL files into hfiles * *Merge* several backup images into one(like merge weekly into monthly) * *add and remove* table to and from Backup image * *Cancel* a backup process * backup progress *status* * full backup based on *existing snapshot* *-* *Below is the original description, to keep here as the history for the design and discussion back in 2013* There have been attempts in the past to come up with a viable HBase backup/restore solution (e.g., HBASE-4618). Recently, there are many advancements and new features in HBase, for example, FileLink, Snapshot, and Distributed Barrier Procedure. This is a proposal for a backup/restore solution that utilizes these new features to achieve better performance and consistency. A common practice of backup and restore in database is to first take full baseline backup, and then periodically take incremental backup that capture the changes since the full baseline backup. HBase cluster can store massive amount data. Combination of full backups with incremental backups has tremendous benefit for HBase as well. The following is a typical scenario for full and incremental backup. # The user takes a full backup of a table or a set of tables in HBase. # The user schedules periodical incremental backups to capture the changes from the full backup, or from last incremental backup. # The user needs to restore table data to a past point of time. # The full backup is restored to the table(s) or to different table name(s). Then the incremental backups that are up to the desired point in time are applied on top of the full backup. We would support the following key features and capabilities. * Full backup uses HBase snapshot to capture HFiles. * Use HBase WALs to capture incremental changes, but we use bulk load of HFiles for fast incremental restore. * Support single table or a set of tables, and column family level backup and restore. * Restore to different table names. * Support adding additional tables or CF to backup set without interruption of incremental backup schedule. * Support rollup/combining of incremental backups into longer period and
[jira] [Commented] (HBASE-14017) Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion
[ https://issues.apache.org/jira/browse/HBASE-14017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615388#comment-14615388 ] Srikanth Srungarapu commented on HBASE-14017: - +1 lgtm. Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion - Key: HBASE-14017 URL: https://issues.apache.org/jira/browse/HBASE-14017 Project: HBase Issue Type: Sub-task Components: proc-v2 Affects Versions: 2.0.0, 1.2.0, 1.1.1, 1.3.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.2 Attachments: HBASE-14017-v0.patch, HBASE-14017-v0.patch, HBASE-14017.v1-branch1.1.patch, HBASE-14017.v1-branch1.1.patch [~syuanjiang] found a concurrecy issue in the procedure queue delete where we don't have an exclusive lock before deleting the table {noformat} Thread 1: Create table is running - the queue is empty and wlock is false Thread 2: markTableAsDeleted see the queue empty and wlock= false Thread 1: tryWrite() set wlock=true; too late Thread 2: delete the queue Thread 1: never able to release the lock - NPE when trying to get the queue {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low
[ https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matteo Bertozzi updated HBASE-13832: Attachment: HBASE-13832-v5.patch Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low --- Key: HBASE-13832 URL: https://issues.apache.org/jira/browse/HBASE-13832 Project: HBase Issue Type: Sub-task Components: master, proc-v2 Affects Versions: 2.0.0, 1.1.0, 1.2.0 Reporter: Stephen Yuan Jiang Assignee: Matteo Bertozzi Priority: Critical Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, HBASE-13832-v2.patch, HBASE-13832-v4.patch, HBASE-13832-v5.patch, HDFSPipeline.java, hbase-13832-test-hang.patch, hbase-13832-v3.patch when the data node 3, we got failure in WALProcedureStore#syncLoop() during master start. The failure prevents master to get started. {noformat} 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] wal.WALProcedureStore: Sync slot failed, abort. java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK], DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]], original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK], DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983- 490ece56c772,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951) {noformat} One proposal is to implement some similar logic as FSHLog: if IOException is thrown during syncLoop in WALProcedureStore#start(), instead of immediate abort, we could try to roll the log and see whether this resolve the issue; if the new log cannot be created or more exception from rolling the log, we then abort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13927) Allow hbase-daemon.sh to conditionally redirect the log or not
[ https://issues.apache.org/jira/browse/HBASE-13927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated HBASE-13927: Resolution: Fixed Status: Resolved (was: Patch Available) Allow hbase-daemon.sh to conditionally redirect the log or not -- Key: HBASE-13927 URL: https://issues.apache.org/jira/browse/HBASE-13927 Project: HBase Issue Type: Improvement Components: shell Affects Versions: 2.0.0, 1.2.0 Reporter: Elliott Clark Assignee: Elliott Clark Labels: shell Fix For: 2.0.0, 1.2.0 Attachments: HBASE-13927.patch, HBASE-13927.patch Kind of like HBASE_NOEXEC allow hbase-daemon to skip redirecting to a log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13927) Allow hbase-daemon.sh to conditionally redirect the log or not
[ https://issues.apache.org/jira/browse/HBASE-13927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated HBASE-13927: Issue Type: Improvement (was: Bug) Allow hbase-daemon.sh to conditionally redirect the log or not -- Key: HBASE-13927 URL: https://issues.apache.org/jira/browse/HBASE-13927 Project: HBase Issue Type: Improvement Components: shell Affects Versions: 2.0.0, 1.2.0 Reporter: Elliott Clark Assignee: Elliott Clark Labels: shell Fix For: 2.0.0, 1.2.0 Attachments: HBASE-13927.patch, HBASE-13927.patch Kind of like HBASE_NOEXEC allow hbase-daemon to skip redirecting to a log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14012) Double Assignment and Dataloss when ServerCrashProcedure runs during Master failover
[ https://issues.apache.org/jira/browse/HBASE-14012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615400#comment-14615400 ] Hudson commented on HBASE-14012: SUCCESS: Integrated in HBase-TRUNK #6630 (See [https://builds.apache.org/job/HBase-TRUNK/6630/]) HBASE-14012 Double Assignment and Dataloss when ServerCrashProcedure (stack: rev cff1a5f1f5cc8f5e1e99c6aecb39e2e69f86bf7e) * hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java * hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java * hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStateStore.java * hbase-protocol/src/main/protobuf/MasterProcedure.proto * hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/MasterProcedureProtos.java * hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java Double Assignment and Dataloss when ServerCrashProcedure runs during Master failover Key: HBASE-14012 URL: https://issues.apache.org/jira/browse/HBASE-14012 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 2.0.0, 1.2.0 Reporter: stack Assignee: stack Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.3.0 Attachments: 14012.txt, 14012v2.txt (Rewrite to be more explicit about what the problem is) ITBLL. Master comes up (It is being killed every 1-5 minutes or so). It is joining a running cluster (all servers up except Master with most regions assigned out on cluster). ProcedureStore has two ServerCrashProcedures unfinished (RUNNABLE state) for two separate servers. One SCP is in the middle of the assign step when master crashes (SERVER_CRASH_ASSIGN). This SCP step has this comment on it: {code} // Assign may not be idempotent. SSH used to requeue the SSH if we got an IOE assigning // which is what we are mimicing here but it looks prone to double assignment if assign // fails midway. TODO: Test. {code} This issue is 1.2+ only since it is ServerCrashProcedure (Added in HBASE-13616, post hbase-1.1.x). Looking at ServerShutdownHandler, how we used to do crash processing before we moved over to the Pv2 framework, SSH may have (accidentally) avoided this issue since it does its processing in a big blob starting over if killed mid-crash. In particular, post-crash, SSH scans hbase:meta to find servers that were on the downed server. SCP scanneds Meta in one step, saves off the regions it finds into the ProcedureStore, and then in the next step, does actual assign. In this case, we crashed post-meta scan and during assign. Assign is a bulk assign. It mostly succeeded but got this: {code} 809622 2015-06-09 20:05:28,576 INFO [ProcedureExecutorThread-9] master.GeneralBulkAssigner: Failed assigning 3 regions to server c2021.halxg.cloudera.com,16020,1433905510696, reassigning them {code} So, most regions actually made it to new locations except for a few stragglers. All of the successfully assigned regions then are reassigned on other side of master restart when we replay the SCP assign step. Let me put together the scan meta and assign steps in SCP; this should do until we redo all of assign to run on Pv2. A few other things I noticed: In SCP, we only check if failover in first step, not for every step, which means ServerCrashProcedure will run if on reload it is beyond the first step. {code} // Is master fully online? If not, yield. No processing of servers unless master is up if (!services.getAssignmentManager().isFailoverCleanupDone()) { throwProcedureYieldException(Waiting on master failover to complete); } {code} This means we are assigning while Master is still coming up, a no-no (though it does not seem to have caused problem here). Fix. Also, I see that over the 8 hours of this particular log, each time the master crashes and comes back up, we queue a ServerCrashProcedure for c2022 because an empty dir never gets cleaned up: {code} 39 2015-06-09 22:15:33,074 WARN [ProcedureExecutorThread-0] master.SplitLogManager: returning success without actually splitting and deleting all the log files in path hdfs://c2020.halxg.cloudera.com:8020/hbase/WALs/c2022.halxg.cloudera.com,16020,1433902151857-splitting {code} Fix this too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13965) Stochastic Load Balancer JMX Metrics
[ https://issues.apache.org/jira/browse/HBASE-13965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615265#comment-14615265 ] Ted Yu commented on HBASE-13965: {code} 32public void updateStochasticCost(String tableName, String costFunctionName, {code} Should tableName parameter be of type TableName ? For MetricsStochasticBalancerSourceImpl.java , it should be annotated with @InterfaceAudience.Private {code} 56 * The function that report stochastic load balancer costs to JMX {code} 'that report' - 'that reports' {code} 58public void updateStochasticCost(String tableName, String costFunctionName, 59String costFunctionDesc, Double value) { {code} costFunctionDesc isn't used in the method. {code} 82String attrName = tableName + ((tableName.length() = 0) ? : TABLE_FUNCTION_SEP) + key; {code} When would tableName.length() be = 0 ? There is check at the beginning of updateStochasticCost(). {code} 124 private Double[] lastSubcosts; {code} nit: Uppercase 'c' of 'costs' {code} 470 this.lastSubcosts[i] = multiplier*cost; 471 total += multiplier * cost; {code} There is no need to perform the same multiplication twice. Stochastic Load Balancer JMX Metrics Key: HBASE-13965 URL: https://issues.apache.org/jira/browse/HBASE-13965 Project: HBase Issue Type: Improvement Components: Balancer, metrics Reporter: Lei Chen Assignee: Lei Chen Attachments: HBASE-13965-v3.patch, HBASE-13965_v2.patch, HBase-13965-v1.patch, stochasticloadbalancerclasses_v2.png Today’s default HBase load balancer (the Stochastic load balancer) is cost function based. The cost function weights are tunable but no visibility into those cost function results is directly provided. A driving example is a cluster we have been tuning which has skewed rack size (one rack has half the nodes of the other few racks). We are tuning the cluster for uniform response time from all region servers with the ability to tolerate a rack failure. Balancing LocalityCost, RegionReplicaRack Cost and RegionCountSkew Cost is difficult without a way to attribute each cost function’s contribution to overall cost. What this jira proposes is to provide visibility via JMX into each cost function of the stochastic load balancer, as well as the overall cost of the balancing plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-7912) HBase Backup/Restore Based on HBase Snapshot
[ https://issues.apache.org/jira/browse/HBASE-7912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615326#comment-14615326 ] Vladimir Rodionov commented on HBASE-7912: -- [~jerryhe] wrote: {quote} Regarding the section First incremental after full backup restore, yes, there could data duplicated in two backups (the full and the incr). It is better to fix it during the backup. {quote} Doing this during backup requires support of backup/restore up to a sequence Id (mvcc number). Doing this in a read path is trivial (several lines of code in a StoreScanner). HBase Backup/Restore Based on HBase Snapshot Key: HBASE-7912 URL: https://issues.apache.org/jira/browse/HBASE-7912 Project: HBase Issue Type: Sub-task Reporter: Richard Ding Assignee: Vladimir Rodionov Labels: backup Fix For: 2.0.0 Attachments: HBaseBackupRestore-Jira-7912-DesignDoc-v1.pdf, HBaseBackupRestore-Jira-7912-DesignDoc-v2.pdf, HBaseBackupRestore-Jira-7912-v4.pdf, HBaseBackupRestore-Jira-7912-v5 .pdf, HBase_BackupRestore-Jira-7912-CLI-v1.pdf Finally, we completed the implementation of our backup/restore solution, and would like to share with community through this jira. We are leveraging existing hbase snapshot feature, and provide a general solution to common users. Our full backup is using snapshot to capture metadata locally and using exportsnapshot to move data to another cluster; the incremental backup is using offline-WALplayer to backup HLogs; we also leverage global distribution rolllog and flush to improve performance; other added-on values such as convert, merge, progress report, and CLI commands. So that a common user can backup hbase data without in-depth knowledge of hbase. Our solution also contains some usability features for enterprise users. The detail design document and CLI command will be attached in this jira. We plan to use 10~12 subtasks to share each of the following features, and document the detail implement in the subtasks: * *Full Backup* : provide local and remote back/restore for a list of tables * *offline-WALPlayer* to convert HLog to HFiles offline (for incremental backup) * *distributed* Logroll and distributed flush * Backup *Manifest* and history * *Incremental* backup: to build on top of full backup as daily/weekly backup * *Convert* incremental backup WAL files into hfiles * *Merge* several backup images into one(like merge weekly into monthly) * *add and remove* table to and from Backup image * *Cancel* a backup process * backup progress *status* * full backup based on *existing snapshot* *-* *Below is the original description, to keep here as the history for the design and discussion back in 2013* There have been attempts in the past to come up with a viable HBase backup/restore solution (e.g., HBASE-4618). Recently, there are many advancements and new features in HBase, for example, FileLink, Snapshot, and Distributed Barrier Procedure. This is a proposal for a backup/restore solution that utilizes these new features to achieve better performance and consistency. A common practice of backup and restore in database is to first take full baseline backup, and then periodically take incremental backup that capture the changes since the full baseline backup. HBase cluster can store massive amount data. Combination of full backups with incremental backups has tremendous benefit for HBase as well. The following is a typical scenario for full and incremental backup. # The user takes a full backup of a table or a set of tables in HBase. # The user schedules periodical incremental backups to capture the changes from the full backup, or from last incremental backup. # The user needs to restore table data to a past point of time. # The full backup is restored to the table(s) or to different table name(s). Then the incremental backups that are up to the desired point in time are applied on top of the full backup. We would support the following key features and capabilities. * Full backup uses HBase snapshot to capture HFiles. * Use HBase WALs to capture incremental changes, but we use bulk load of HFiles for fast incremental restore. * Support single table or a set of tables, and column family level backup and restore. * Restore to different table names. * Support adding additional tables or CF to backup set without interruption of incremental backup schedule. * Support rollup/combining of incremental backups into longer period and bigger incremental backups. * Unified command line interface for all the above. The solution will support HBase
[jira] [Commented] (HBASE-12596) bulkload needs to follow locality
[ https://issues.apache.org/jira/browse/HBASE-12596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615452#comment-14615452 ] Hadoop QA commented on HBASE-12596: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12743625/HBASE-12596-master-v3.patch against master branch at commit cff1a5f1f5cc8f5e1e99c6aecb39e2e69f86bf7e. ATTACHMENT ID: 12743625 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.7.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 protoc{color}. The applied patch does not increase the total number of protoc compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:red}-1 checkstyle{color}. The applied patch generated 1899 checkstyle errors (more than the master's current 1898 errors). {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn post-site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.rest.TestTableResource org.apache.hadoop.hbase.rest.TestTableScan org.apache.hadoop.hbase.rest.TestScannerResource org.apache.hadoop.hbase.rest.TestDeleteRow {color:red}-1 core zombie tests{color}. There are 5 zombie test(s): Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/14677//testReport/ Release Findbugs (version 2.0.3)warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/14677//artifact/patchprocess/newFindbugsWarnings.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/14677//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14677//console This message is automatically generated. bulkload needs to follow locality - Key: HBASE-12596 URL: https://issues.apache.org/jira/browse/HBASE-12596 Project: HBase Issue Type: Improvement Components: HFile, regionserver Affects Versions: 0.98.8 Environment: hadoop-2.3.0, hbase-0.98.8, jdk1.7 Reporter: Victor Xu Assignee: Victor Xu Fix For: 0.98.14 Attachments: HBASE-12596-0.98-v1.patch, HBASE-12596-0.98-v2.patch, HBASE-12596-0.98-v3.patch, HBASE-12596-master-v1.patch, HBASE-12596-master-v2.patch, HBASE-12596-master-v3.patch, HBASE-12596.patch Normally, we have 2 steps to perform a bulkload: 1. use a job to write HFiles to be loaded; 2. Move these HFiles to the right hdfs directory. However, the locality could be loss during the first step. Why not just write the HFiles directly into the right place? We can do this easily because StoreFile.WriterBuilder has the withFavoredNodes method, and we just need to call it in HFileOutputFormat's getNewWriter(). This feature is enabled by default, and we could use 'hbase.bulkload.locality.sensitive.enabled=false' to disable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13965) Stochastic Load Balancer JMX Metrics
[ https://issues.apache.org/jira/browse/HBASE-13965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615518#comment-14615518 ] Lei Chen commented on HBASE-13965: -- Thanks for your review and great feedback. I will update an updated patch. Stochastic Load Balancer JMX Metrics Key: HBASE-13965 URL: https://issues.apache.org/jira/browse/HBASE-13965 Project: HBase Issue Type: Improvement Components: Balancer, metrics Reporter: Lei Chen Assignee: Lei Chen Attachments: HBASE-13965-v3.patch, HBASE-13965_v2.patch, HBase-13965-v1.patch, stochasticloadbalancerclasses_v2.png Today’s default HBase load balancer (the Stochastic load balancer) is cost function based. The cost function weights are tunable but no visibility into those cost function results is directly provided. A driving example is a cluster we have been tuning which has skewed rack size (one rack has half the nodes of the other few racks). We are tuning the cluster for uniform response time from all region servers with the ability to tolerate a rack failure. Balancing LocalityCost, RegionReplicaRack Cost and RegionCountSkew Cost is difficult without a way to attribute each cost function’s contribution to overall cost. What this jira proposes is to provide visibility via JMX into each cost function of the stochastic load balancer, as well as the overall cost of the balancing plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14017) Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion
[ https://issues.apache.org/jira/browse/HBASE-14017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615437#comment-14615437 ] Stephen Yuan Jiang commented on HBASE-14017: I tried the branch-1 patch locally and had no problem applying it. Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion - Key: HBASE-14017 URL: https://issues.apache.org/jira/browse/HBASE-14017 Project: HBase Issue Type: Sub-task Components: proc-v2 Affects Versions: 2.0.0, 1.2.0, 1.1.1, 1.3.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.2 Attachments: HBASE-14017-v0.patch, HBASE-14017-v0.patch, HBASE-14017.v1-branch1.1.patch, HBASE-14017.v1-branch1.1.patch [~syuanjiang] found a concurrecy issue in the procedure queue delete where we don't have an exclusive lock before deleting the table {noformat} Thread 1: Create table is running - the queue is empty and wlock is false Thread 2: markTableAsDeleted see the queue empty and wlock= false Thread 1: tryWrite() set wlock=true; too late Thread 2: delete the queue Thread 1: never able to release the lock - NPE when trying to get the queue {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13965) Stochastic Load Balancer JMX Metrics
[ https://issues.apache.org/jira/browse/HBASE-13965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615467#comment-14615467 ] Hadoop QA commented on HBASE-13965: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12743741/HBASE-13965-v3.patch against master branch at commit cff1a5f1f5cc8f5e1e99c6aecb39e2e69f86bf7e. ATTACHMENT ID: 12743741 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.7.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 protoc{color}. The applied patch does not increase the total number of protoc compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 checkstyle{color}. The applied patch does not increase the total number of checkstyle errors {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn post-site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/14678//testReport/ Release Findbugs (version 2.0.3)warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/14678//artifact/patchprocess/newFindbugsWarnings.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/14678//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14678//console This message is automatically generated. Stochastic Load Balancer JMX Metrics Key: HBASE-13965 URL: https://issues.apache.org/jira/browse/HBASE-13965 Project: HBase Issue Type: Improvement Components: Balancer, metrics Reporter: Lei Chen Assignee: Lei Chen Attachments: HBASE-13965-v3.patch, HBASE-13965_v2.patch, HBase-13965-v1.patch, stochasticloadbalancerclasses_v2.png Today’s default HBase load balancer (the Stochastic load balancer) is cost function based. The cost function weights are tunable but no visibility into those cost function results is directly provided. A driving example is a cluster we have been tuning which has skewed rack size (one rack has half the nodes of the other few racks). We are tuning the cluster for uniform response time from all region servers with the ability to tolerate a rack failure. Balancing LocalityCost, RegionReplicaRack Cost and RegionCountSkew Cost is difficult without a way to attribute each cost function’s contribution to overall cost. What this jira proposes is to provide visibility via JMX into each cost function of the stochastic load balancer, as well as the overall cost of the balancing plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13646) HRegion#execService should not try to build incomplete messages
[ https://issues.apache.org/jira/browse/HBASE-13646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615468#comment-14615468 ] Hudson commented on HBASE-13646: FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #1000 (See [https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/1000/]) HBASE-13646 HRegion#execService should not try to build incomplete messages (busbey: rev 2353c1bcf7debcc90e1f6d47787404c42a9d53b0) * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestRegionServerCoprocessorEndpoint.java * hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestCoprocessorEndpoint.java * hbase-server/pom.xml * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java * hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/protobuf/generated/DummyRegionServerEndpointProtos.java * hbase-server/src/test/protobuf/DummyRegionServerEndpoint.proto * hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ResponseConverter.java HRegion#execService should not try to build incomplete messages --- Key: HBASE-13646 URL: https://issues.apache.org/jira/browse/HBASE-13646 Project: HBase Issue Type: Bug Components: Coprocessors, regionserver Affects Versions: 2.0.0, 1.2.0, 1.1.1 Reporter: Andrey Stepachev Assignee: Andrey Stepachev Fix For: 2.0.0, 0.98.14, 1.2.0 Attachments: HBASE-13646-branch-1.patch, HBASE-13646.patch, HBASE-13646.v2.patch, HBASE-13646.v2.patch If some RPC service, called on region throws exception, execService still tries to build Message. In case of complex messages with required fields it complicates service code because service need to pass fake protobuf objects, so they can be barely buildable. To mitigate that I propose to check that controller was failed and return null from call instead of failing with exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13879) Add hbase.hstore.compactionThreshold to HConstants
[ https://issues.apache.org/jira/browse/HBASE-13879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615444#comment-14615444 ] Anoop Sam John commented on HBASE-13879: CompactionConfiguration.java seems the best place for this config constant no? Why we need change? I would say better dont go with this jira Add hbase.hstore.compactionThreshold to HConstants -- Key: HBASE-13879 URL: https://issues.apache.org/jira/browse/HBASE-13879 Project: HBase Issue Type: Improvement Reporter: Gabor Liptak Priority: Minor Attachments: HBASE-13879.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13965) Stochastic Load Balancer JMX Metrics
[ https://issues.apache.org/jira/browse/HBASE-13965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei Chen updated HBASE-13965: - Attachment: HBASE-13965-v4.patch Updates: 1. report - reports 2. costFunctionDesc added to JMX 3. Unnecessary table name length check is removed. 4. lastSubcosts - lastSubCosts 5. total += this.lastSubCosts[i]; TODO: 1. Make hard-coded map size configurable? Stochastic Load Balancer JMX Metrics Key: HBASE-13965 URL: https://issues.apache.org/jira/browse/HBASE-13965 Project: HBase Issue Type: Improvement Components: Balancer, metrics Reporter: Lei Chen Assignee: Lei Chen Attachments: HBASE-13965-v3.patch, HBASE-13965-v4.patch, HBASE-13965_v2.patch, HBase-13965-v1.patch, stochasticloadbalancerclasses_v2.png Today’s default HBase load balancer (the Stochastic load balancer) is cost function based. The cost function weights are tunable but no visibility into those cost function results is directly provided. A driving example is a cluster we have been tuning which has skewed rack size (one rack has half the nodes of the other few racks). We are tuning the cluster for uniform response time from all region servers with the ability to tolerate a rack failure. Balancing LocalityCost, RegionReplicaRack Cost and RegionCountSkew Cost is difficult without a way to attribute each cost function’s contribution to overall cost. What this jira proposes is to provide visibility via JMX into each cost function of the stochastic load balancer, as well as the overall cost of the balancing plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13965) Stochastic Load Balancer JMX Metrics
[ https://issues.apache.org/jira/browse/HBASE-13965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615805#comment-14615805 ] Ted Yu commented on HBASE-13965: {code} 79 costs.put(costFunctionName, descAndValue); 80 stochasticCosts.put(tableName, costs); {code} For any specific cost function, its description should be fixed. In the above model, description for the same cost function is stored once per table. Can we reduce the number of times description for same function is stored ? Stochastic Load Balancer JMX Metrics Key: HBASE-13965 URL: https://issues.apache.org/jira/browse/HBASE-13965 Project: HBase Issue Type: Improvement Components: Balancer, metrics Reporter: Lei Chen Assignee: Lei Chen Attachments: HBASE-13965-v3.patch, HBASE-13965-v4.patch, HBASE-13965_v2.patch, HBase-13965-v1.patch, stochasticloadbalancerclasses_v2.png Today’s default HBase load balancer (the Stochastic load balancer) is cost function based. The cost function weights are tunable but no visibility into those cost function results is directly provided. A driving example is a cluster we have been tuning which has skewed rack size (one rack has half the nodes of the other few racks). We are tuning the cluster for uniform response time from all region servers with the ability to tolerate a rack failure. Balancing LocalityCost, RegionReplicaRack Cost and RegionCountSkew Cost is difficult without a way to attribute each cost function’s contribution to overall cost. What this jira proposes is to provide visibility via JMX into each cost function of the stochastic load balancer, as well as the overall cost of the balancing plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13330) Region left unassigned due to AM SSH each thinking the assignment would be done by the other
[ https://issues.apache.org/jira/browse/HBASE-13330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-13330: -- Fix Version/s: (was: 1.0.2) 1.0.3 Region left unassigned due to AM SSH each thinking the assignment would be done by the other -- Key: HBASE-13330 URL: https://issues.apache.org/jira/browse/HBASE-13330 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 1.0.0 Reporter: Devaraj Das Assignee: Devaraj Das Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1, 1.0.3 Attachments: 13330-branch-1.txt Here is what I found during analysis of an issue. Raising this jira and a fix will follow. The TL;DR of this is that the AssignmentManager thinks the ServerShutdownHandler would assign the region and the ServerShutdownHandler thinks that the AssignmentManager would assign the region. The region (0d6cf37c18c54c6f4744750c6a7be837) ultimately never gets assigned. Below is an analysis from the logs that captures the flow of events. 1. The AssignmentManager had initially assigned this region to dnj1-bcpc-r3n8.example.com,60020,1425598187703 2. When the master restarted it did a scan of the meta to learn about the regions in the cluster. It found this region being assigned to dnj1-bcpc-r3n8.example.com,60020,1425598187703 from the meta record. 3. However, this server (dnj1-bcpc-r3n8.example.com,60020,1425598187703) was not alive anymore. So, the AssignmentManager queued up a ServerShutdownHandling task for this (that asynchronously executes): {noformat} 2015-03-06 14:09:31,355 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=dnj1-bcpc-r3n8.example.com,60020,1425598187703 to dead servers, submitted shutdown handler to be executed meta=false {noformat} 4. The AssignmentManager proceeded to read the RIT nodes from ZK. It found this region as well: {noformat} 2015-03-06 14:09:31,527 INFO org.apache.hadoop.hbase.master.AssignmentManager: Processing 0d6cf37c18c54c6f4744750c6a7be837 in state: RS_ZK_REGION_FAILED_OPEN {noformat} 5. The region was moved to CLOSED state: {noformat} 2015-03-06 14:09:31,527 WARN org.apache.hadoop.hbase.master.RegionStates: 0d6cf37c18c54c6f4744750c6a7be837 moved to CLOSED on dnj1-bcpc-r3n2.example.com,60020,1425603618259, expected dnj1-bcpc-r3n8.example.com,60020,1425598187703 {noformat} Note the reference to dnj1-bcpc-r3n2.example.com,60020,1425603618259. This means that the region was assigned to dnj1-bcpc-r3n2.example.com,60020,1425603618259 but that regionserver couldn't open the region for some reason, and it changed the state to RS_ZK_REGION_FAILED_OPEN in RIT znode on ZK. 6. After that the AssignmentManager tried to assign it again. However, the assignment didn't happen because the ServerShutdownHandling task queued earlier didn't yet execute: {noformat} 2015-03-06 14:09:31,527 INFO org.apache.hadoop.hbase.master.AssignmentManager: Skip assigning phMonthlyVersion,\x89\x80\x00\x00,1423149098980.0d6cf37c18c54c6f4744750c6a7be837., it's host dnj1-bcpc-r3n8.example.com,60020,1425598187703 is dead but not processed yet {noformat} 7. Eventually the ServerShutdownHandling task executed. {noformat} 2015-03-06 14:09:35,188 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for dnj1-bcpc-r3n8.example.com,60020,1425598187703 before assignment. 2015-03-06 14:09:35,209 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 19 region(s) that dnj1-bcpc-r3n8.example.com,60020,1425598187703 was carrying (and 0 regions(s) that were opening on this server) 2015-03-06 14:09:35,211 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished processing of shutdown of dnj1-bcpc-r3n8.example.com,60020,1425598187703 {noformat} 8. However, the ServerShutdownHandling task skipped the region in question. This was because this region was in RIT, and the ServerShutdownHandling task thinks that the AssignmentManager would assign it as part of handling the RIT nodes: {noformat} 2015-03-06 14:09:35,210 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Skip assigning region in transition on other server{0d6cf37c18c54c6f4744750c6a7be837 state=CLOSED, ts=1425668971527, server=dnj1-bcpc-r3n2.example.com,60020,1425603618259} {noformat} 9. At some point in the future, when the server dnj1-bcpc-r3n2.example.com,60020,1425603618259 dies, the ServerShutdownHandling for it gets queued up (from the log hbase-hbase-master-dnj1-bcpc-r3n1.log): {noformat} 2015-03-09 11:35:10,607 INFO org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer ephemeral node
[jira] [Commented] (HBASE-14017) Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion
[ https://issues.apache.org/jira/browse/HBASE-14017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615864#comment-14615864 ] Hudson commented on HBASE-14017: SUCCESS: Integrated in HBase-1.3 #38 (See [https://builds.apache.org/job/HBase-1.3/38/]) HBASE-14017 Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion (busbey: rev 80b0a3e914c8f7b2600de93a27cc5d050d36ebf7) * hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestMasterProcedureQueue.java * hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureQueue.java Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion - Key: HBASE-14017 URL: https://issues.apache.org/jira/browse/HBASE-14017 Project: HBase Issue Type: Sub-task Components: proc-v2 Affects Versions: 2.0.0, 1.2.0, 1.1.1, 1.3.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.2 Attachments: HBASE-14017-v0.patch, HBASE-14017-v0.patch, HBASE-14017.as-pushed-master.patch, HBASE-14017.v1-branch1.1.patch, HBASE-14017.v1-branch1.1.patch [~syuanjiang] found a concurrecy issue in the procedure queue delete where we don't have an exclusive lock before deleting the table {noformat} Thread 1: Create table is running - the queue is empty and wlock is false Thread 2: markTableAsDeleted see the queue empty and wlock= false Thread 1: tryWrite() set wlock=true; too late Thread 2: delete the queue Thread 1: never able to release the lock - NPE when trying to get the queue {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14017) Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion
[ https://issues.apache.org/jira/browse/HBASE-14017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615925#comment-14615925 ] Hudson commented on HBASE-14017: SUCCESS: Integrated in HBase-1.1 #574 (See [https://builds.apache.org/job/HBase-1.1/574/]) HBASE-14017 Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion (busbey: rev 38014398eda48891529258111b0f8c1491a0e9fa) * hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureQueue.java * hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestMasterProcedureQueue.java Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion - Key: HBASE-14017 URL: https://issues.apache.org/jira/browse/HBASE-14017 Project: HBase Issue Type: Sub-task Components: proc-v2 Affects Versions: 2.0.0, 1.2.0, 1.1.1, 1.3.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.2 Attachments: HBASE-14017-v0.patch, HBASE-14017-v0.patch, HBASE-14017.as-pushed-master.patch, HBASE-14017.v1-branch1.1.patch, HBASE-14017.v1-branch1.1.patch [~syuanjiang] found a concurrecy issue in the procedure queue delete where we don't have an exclusive lock before deleting the table {noformat} Thread 1: Create table is running - the queue is empty and wlock is false Thread 2: markTableAsDeleted see the queue empty and wlock= false Thread 1: tryWrite() set wlock=true; too late Thread 2: delete the queue Thread 1: never able to release the lock - NPE when trying to get the queue {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13981) Fix ImportTsv spelling and usage issues
[ https://issues.apache.org/jira/browse/HBASE-13981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615940#comment-14615940 ] Apekshit Sharma commented on HBASE-13981: - [~gliptak] not sure why ATTRIBUTE_SEPERATOR_CONF_KEY or attributes.seperator are there. Since ImportTsv is @InterfaceAudience.Public, we can not simply delete it but can mark it @Deprecated. Patch looks good. Just last two things. 1. {code} - public final static String ATTRIBUTE_SEPERATOR_CONF_KEY = attributes.seperator; + public final static String ATTRIBUTE_SEPARATOR_CONF_KEY = attributes.separator; ... - final static String DEFAULT_ATTRIBUTES_SEPERATOR = =; - final static String DEFAULT_MULTIPLE_ATTRIBUTES_SEPERATOR = ,; + final static String DEFAULT_ATTRIBUTES_SEPARATOR = =; + final static String DEFAULT_MULTIPLE_ATTRIBUTES_SEPARATOR = ,; {code} Again, since ImportTsv is @InterfaceAudience.Public (read more about this [here|http://hbase.apache.org/book.html#hbase.client.api.surface]), we can not simply change the name. The right thing to do here would be {code} + @Deprecated public final static String ATTRIBUTE_SEPERATOR_CONF_KEY = attributes.seperator; + public final static String ATTRIBUTE_SEPERATOR_CONF_KEY = attributes.separator; {code} and replacing all uses of ATTRIBUTE_SEPERATOR_CONF_KEY with ATTRIBUTE_SEPERATOR_CONF_KEY. Later in 2.0, the deprecated will be deleted. 2. Readability: there is no need for indentation here. {code} + The column names of the TSV data must be specified using the option:\n + +-D + COLUMNS_CONF_KEY + option. This option takes the form of + + comma-separated column names, where each column name is either + {code} Please align like this {code} + The column names of the TSV data must be specified using the option:\n + +-D + COLUMNS_CONF_KEY + option. This option takes the form of + + comma-separated column names, where each column name is either + {code} here and everywhere else. Fix ImportTsv spelling and usage issues --- Key: HBASE-13981 URL: https://issues.apache.org/jira/browse/HBASE-13981 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 1.1.0.1 Reporter: Lars George Assignee: Gabor Liptak Labels: beginner Fix For: 2.0.0, 1.3.0 Attachments: HBASE-13981.1.patch, HBASE-13981.2.patch, HBASE-13981.3.patch The {{ImportTsv}} tool has various spelling and formatting issues. Fix those. In code: {noformat} public final static String ATTRIBUTE_SEPERATOR_CONF_KEY = attributes.seperator; {noformat} It is separator. In usage text: {noformat} input data. Another special columnHBASE_TS_KEY designates that this column should be {noformat} Space missing. {noformat} Record with invalid timestamps (blank, non-numeric) will be treated as bad record. {noformat} Records ... as bad records - plural missing twice. {noformat} HBASE_ATTRIBUTES_KEY can be used to specify Operation Attributes per record. Should be specified as key=value where -1 is used as the seperator. Note that more than one OperationAttributes can be specified. {noformat} - Remove line wraps and indentation. - Fix separator. - Fix wrong separator being output, it is not -1 (wrong constant use in code) - General wording/style could be better (eg. last sentence now uses OperationAttributes without a space). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14017) Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion
[ https://issues.apache.org/jira/browse/HBASE-14017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615964#comment-14615964 ] Hudson commented on HBASE-14017: SUCCESS: Integrated in HBase-1.2 #54 (See [https://builds.apache.org/job/HBase-1.2/54/]) HBASE-14017 Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion (busbey: rev 8e65b9f86d63b61177170658f0e1a86ef7b2d51f) * hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureQueue.java * hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestMasterProcedureQueue.java Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion - Key: HBASE-14017 URL: https://issues.apache.org/jira/browse/HBASE-14017 Project: HBase Issue Type: Sub-task Components: proc-v2 Affects Versions: 2.0.0, 1.2.0, 1.1.1, 1.3.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.2 Attachments: HBASE-14017-v0.patch, HBASE-14017-v0.patch, HBASE-14017.as-pushed-master.patch, HBASE-14017.v1-branch1.1.patch, HBASE-14017.v1-branch1.1.patch [~syuanjiang] found a concurrecy issue in the procedure queue delete where we don't have an exclusive lock before deleting the table {noformat} Thread 1: Create table is running - the queue is empty and wlock is false Thread 2: markTableAsDeleted see the queue empty and wlock= false Thread 1: tryWrite() set wlock=true; too late Thread 2: delete the queue Thread 1: never able to release the lock - NPE when trying to get the queue {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13965) Stochastic Load Balancer JMX Metrics
[ https://issues.apache.org/jira/browse/HBASE-13965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615800#comment-14615800 ] Hadoop QA commented on HBASE-13965: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12743774/HBASE-13965-v4.patch against master branch at commit 608c3aa15c34b9014f99e857b374645db58cbbe3. ATTACHMENT ID: 12743774 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.7.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 protoc{color}. The applied patch does not increase the total number of protoc compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 checkstyle{color}. The applied patch does not increase the total number of checkstyle errors {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn post-site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/14680//testReport/ Release Findbugs (version 2.0.3)warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/14680//artifact/patchprocess/newFindbugsWarnings.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/14680//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14680//console This message is automatically generated. Stochastic Load Balancer JMX Metrics Key: HBASE-13965 URL: https://issues.apache.org/jira/browse/HBASE-13965 Project: HBase Issue Type: Improvement Components: Balancer, metrics Reporter: Lei Chen Assignee: Lei Chen Attachments: HBASE-13965-v3.patch, HBASE-13965-v4.patch, HBASE-13965_v2.patch, HBase-13965-v1.patch, stochasticloadbalancerclasses_v2.png Today’s default HBase load balancer (the Stochastic load balancer) is cost function based. The cost function weights are tunable but no visibility into those cost function results is directly provided. A driving example is a cluster we have been tuning which has skewed rack size (one rack has half the nodes of the other few racks). We are tuning the cluster for uniform response time from all region servers with the ability to tolerate a rack failure. Balancing LocalityCost, RegionReplicaRack Cost and RegionCountSkew Cost is difficult without a way to attribute each cost function’s contribution to overall cost. What this jira proposes is to provide visibility via JMX into each cost function of the stochastic load balancer, as well as the overall cost of the balancing plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13991) Hierarchical Layout for Humongous Tables
[ https://issues.apache.org/jira/browse/HBASE-13991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615813#comment-14615813 ] Ben Lau commented on HBASE-13991: - Hi guys, hope you had a happy 4th of July. We would like to do something akin to Lars’ last idea. That is, we will have code to support both the old layout and the new layout, but it will be on a per HBase cluster basis. You will be able to migrate a cluster entirely to the hierarchical layout or leave it on the old layout. This approach has the following pros: - If HBase users do not need/want the new layout, they will not have to do an offline upgrade in order to use new HBase code. The alternative is to make an online upgrade for the hierarchical layout, but this would require some very messy changes to the codebase and also be tricky to test fully. - HBase code will not have to ‘detect’ whether tables/paths/regions are hierarchical or not. The master or region server can simply look at the root table at startup and use that to determine if the cluster has migrated to the hierarchical layout. This single source of truth would make code less ugly since you don’t need to do in-context per-region/path checks in different parts of the codebase. What do you guys think about this approach? Hierarchical Layout for Humongous Tables Key: HBASE-13991 URL: https://issues.apache.org/jira/browse/HBASE-13991 Project: HBase Issue Type: Sub-task Reporter: Ben Lau Assignee: Ben Lau Attachments: HBASE-13991-master.patch, HumongousTableDoc.pdf Add support for humongous tables via a hierarchical layout for regions on filesystem. Credit for most of this code goes to Huaiyu Zhu. Latest version of the patch is available on the review board: https://reviews.apache.org/r/36029/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14028) DistributedLogReplay drops edits when ITBLL 125M
[ https://issues.apache.org/jira/browse/HBASE-14028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615868#comment-14615868 ] Vladimir Rodionov commented on HBASE-14028: --- This -recovery-from-failure-during-recovery-from-failure thing looks quite complicated to me. I am working on HBASE-7912 and one of the improvements which is on the list is WALPlayer into HFiles followed by a bulk load. Pounding HBase with millions of puts is not the right approach. DistributedLogReplay drops edits when ITBLL 125M Key: HBASE-14028 URL: https://issues.apache.org/jira/browse/HBASE-14028 Project: HBase Issue Type: Bug Components: Recovery Affects Versions: 1.2.0 Reporter: stack Testing DLR before 1.2.0RC gets cut, we are dropping edits. Issue seems to be around replay into a deployed region that is on a server that dies before all edits have finished replaying. Logging is sparse on sequenceid accounting so can't tell for sure how it is happening (and if our now accounting by Store is messing up DLR). Digging. I notice also that DLR does not refresh its cache of region location on error -- it just keeps trying till whole WAL fails 8 retries...about 30 seconds. We could do a bit of refactor and have the replay find region in new location if moved during DLR replay. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13890) Get/Scan from MemStore only (Client API)
[ https://issues.apache.org/jira/browse/HBASE-13890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Rodionov updated HBASE-13890: -- Attachment: HBASE-13890-v1.patch First cut. Get/Scan from MemStore only (Client API) Key: HBASE-13890 URL: https://issues.apache.org/jira/browse/HBASE-13890 Project: HBase Issue Type: New Feature Components: API, Client, Scanners Reporter: Vladimir Rodionov Assignee: Vladimir Rodionov Attachments: HBASE-13890-v1.patch This is short-circuit read for get/scan when recent data (version) of a cell can be found only in MemStore (with very high probability). Good examples are: Atomic counters and appends. This feature will allow to bypass completely store file scanners and improve performance and latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13890) Get/Scan from MemStore only (Client API)
[ https://issues.apache.org/jira/browse/HBASE-13890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Rodionov updated HBASE-13890: -- Status: Patch Available (was: Open) Get/Scan from MemStore only (Client API) Key: HBASE-13890 URL: https://issues.apache.org/jira/browse/HBASE-13890 Project: HBase Issue Type: New Feature Components: API, Client, Scanners Reporter: Vladimir Rodionov Assignee: Vladimir Rodionov Attachments: HBASE-13890-v1.patch This is short-circuit read for get/scan when recent data (version) of a cell can be found only in MemStore (with very high probability). Good examples are: Atomic counters and appends. This feature will allow to bypass completely store file scanners and improve performance and latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13890) Get/Scan from MemStore only (Client API)
[ https://issues.apache.org/jira/browse/HBASE-13890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615894#comment-14615894 ] Vladimir Rodionov commented on HBASE-13890: --- The new API was added to *OperationsWithAttributes*: {code} /** * This method allows you to set in-memory-only operation mode. * Queries: Get and Scan as well as Increment and Append will * work only on data in RAM (MemStore). * If data is missing in RAM, the following will be returned: * Empty Result for Get, Increment, Append and Scan * */ public OperationWithAttributes setMemstoreOnly(boolean v){ setAttribute(MEMSTORE_ONLY_ATTRIBUTE, Bytes.toBytes(v)); return this; } /** * Checks if we are in-memory-only mode * @return true, if yes */ public boolean isMemstoreOnly(){ byte[] attr = getAttribute(MEMSTORE_ONLY_ATTRIBUTE); return attr != null Bytes.toBoolean(attr) == true; } {code} Get/Scan from MemStore only (Client API) Key: HBASE-13890 URL: https://issues.apache.org/jira/browse/HBASE-13890 Project: HBase Issue Type: New Feature Components: API, Client, Scanners Reporter: Vladimir Rodionov Assignee: Vladimir Rodionov Attachments: HBASE-13890-v1.patch This is short-circuit read for get/scan when recent data (version) of a cell can be found only in MemStore (with very high probability). Good examples are: Atomic counters and appends. This feature will allow to bypass completely store file scanners and improve performance and latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14025) Update CHANGES.txt for 1.2
[ https://issues.apache.org/jira/browse/HBASE-14025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615908#comment-14615908 ] Enis Soztutar commented on HBASE-14025: --- i think the discrepancy is due to the fact that we mark all active branches that the fix went in, for example, today we are marking an issue with 1.2.0, 1.3.0 and 2.0.0 if it was committed to branches branch-1.2, branch-1 and master. By the time, 1.3.0 release comes up, the jiras with fixVersion = 1.3.0 will include 1.3 exclusive patches + patches in 1.2.0 and earlier. [~busbey] I have noticed that you have been unmarking some of the fixVersions in jira (for example HBASE-13895). Is this for clean up for CHANGES.txt for 1.3.0 or some general cleanup? I am asking to understand whether we need to refine the process. At the time of the HBASE-13895 commit, there was already branch-1, branch-1.1, branch-1.2 and master branches. Thus, the jira used to bear {{2.0.0, 1.2.0, 1.1.2, 1.3.0}}. Your process above (correct me if I am wrong) removes 1.3.0 from fixVersions since 1.2.0 is the first release that will have that patch? How do we differentiate the fact that the issue has actually been committed after the branch-1.2 is forked, and committed to both branch-1.2 and branch-1? Update CHANGES.txt for 1.2 -- Key: HBASE-14025 URL: https://issues.apache.org/jira/browse/HBASE-14025 Project: HBase Issue Type: Sub-task Components: documentation Affects Versions: 1.2.0 Reporter: Sean Busbey Assignee: Sean Busbey Fix For: 1.2.0 Since it's more effort than I expected, making a ticket to track actually updating CHANGES.txt so that new RMs have an idea what to expect. Maybe will make doc changes if there's enough here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13352) Add hbase.import.version to Import usage.
[ https://issues.apache.org/jira/browse/HBASE-13352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615929#comment-14615929 ] Hudson commented on HBASE-13352: FAILURE: Integrated in HBase-TRUNK #6633 (See [https://builds.apache.org/job/HBase-TRUNK/6633/]) HBASE-13352 Add hbase.import.version to Import usage (Lars Hofhansl) (enis: rev c220635c7893c96db675cb2b80af6ade4a44e3d4) * hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/Import.java Add hbase.import.version to Import usage. - Key: HBASE-13352 URL: https://issues.apache.org/jira/browse/HBASE-13352 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 2.0.0, 0.98.14, 1.0.2, 1.2.0, 1.1.2, 1.3.0 Attachments: 13352-v2.txt, 13352.txt, hbase-13352_v3.patch We just tried to export some (small amount of) data out of an 0.94 cluster to 0.98 cluster. We used Export/Import for that. By default we found that the import M/R job correctly reports the number of records seen, but _silently_ does not import anything. After looking at the 0.98 it's obvious there's an hbase.import.version (-Dhbase.import.version=0.94) to make this work. Two issues: # -Dhbase.import.version=0.94 should be show with the the Import.usage # If not given it should not just silently not import anything In this issue I'll just a trivially add this option to the Import tool's usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13988) Add exception handler for lease thread
[ https://issues.apache.org/jira/browse/HBASE-13988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615944#comment-14615944 ] stack commented on HBASE-13988: --- No other comments. Just a suggestion. I like the way the patch makes use of the existing exception handling mechanism. Add exception handler for lease thread -- Key: HBASE-13988 URL: https://issues.apache.org/jira/browse/HBASE-13988 Project: HBase Issue Type: Bug Affects Versions: 2.0.0 Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Minor Fix For: 2.0.0, 1.0.2, 1.1.2, 0.98.15 Attachments: HBASE-13988-v001.diff In a prod cluster, a region server exited for some important threads were not alive. After excluding other threads from the log, we doubted the lease thread was the root. So we need to add an exception handler to the lease thread to debug why it exited in future. {quote} 2015-06-29,12:46:09,222 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: One or more threads are no longer alive -- stop 2015-06-29,12:46:09,223 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 21600 ... 2015-06-29,12:46:09,330 INFO org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting. 2015-06-29,12:46:09,330 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Thread-37 exiting 2015-06-29,12:46:09,330 INFO org.apache.hadoop.hbase.regionserver.HRegionServer$CompactionChecker: regionserver21600.compactionChecker exiting 2015-06-29,12:46:12,403 INFO org.apache.hadoop.hbase.regionserver.HRegionServer$PeriodicMemstoreFlusher: regionserver21600.periodicFlusher exiting {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-14028) DistributedLogReplay drops edits when ITBLL 125M
stack created HBASE-14028: - Summary: DistributedLogReplay drops edits when ITBLL 125M Key: HBASE-14028 URL: https://issues.apache.org/jira/browse/HBASE-14028 Project: HBase Issue Type: Bug Components: Recovery Affects Versions: 1.2.0 Reporter: stack Testing DLR before 1.2.0RC gets cut, we are dropping edits. Issue seems to be around replay into a deployed region that is on a server that dies before all edits have finished replaying. Logging is sparse on sequenceid accounting so can't tell for sure how it is happening (and if our now accounting by Store is messing up DLR). Digging. I notice also that DLR does not refresh its cache of region location on error -- it just keeps trying till whole WAL fails 8 retries...about 30 seconds. We could do a bit of refactor and have the replay find region in new location if moved during DLR replay. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13387) Add ByteBufferedCell an extension to Cell
[ https://issues.apache.org/jira/browse/HBASE-13387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615843#comment-14615843 ] stack commented on HBASE-13387: --- Should ByteBufferedCell be in hbase-server module? Can it be in the RegionServer package? (It doesn't look like it given we look for it in comparators. I suppose also BBCell doesn't have to have anything to do with Server when in common module. Someone else might want to use it for some other purpose? Patch LGTM but how about tests of the new methods added especially when backed by a BBCell? Nice work. Add ByteBufferedCell an extension to Cell - Key: HBASE-13387 URL: https://issues.apache.org/jira/browse/HBASE-13387 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: Anoop Sam John Fix For: 2.0.0 Attachments: ByteBufferedCell.docx, HBASE-13387_v1.patch, WIP_HBASE-13387_V2.patch, WIP_ServerCell.patch, benchmark.zip This came in btw the discussion abt the parent Jira and recently Stack added as a comment on the E2E patch on the parent Jira. The idea is to add a new Interface 'ByteBufferedCell' in which we can add new buffer based getter APIs and getters for position in components in BB. We will keep this interface @InterfaceAudience.Private. When the Cell is backed by a DBB, we can create an Object implementing this new interface. The Comparators has to be aware abt this new Cell extension and has to use the BB based APIs rather than getXXXArray(). Also give util APIs in CellUtil to abstract the checks for new Cell type. (Like matchingXXX APIs, getValueAstype APIs etc) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HBASE-13352) Add hbase.import.version to Import usage.
[ https://issues.apache.org/jira/browse/HBASE-13352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar resolved HBASE-13352. --- Resolution: Fixed Assignee: Lars Hofhansl Hadoop Flags: Reviewed Fix Version/s: (was: 1.2.1) 1.2.0 Committed this to 0.98+. Thanks Lars. Add hbase.import.version to Import usage. - Key: HBASE-13352 URL: https://issues.apache.org/jira/browse/HBASE-13352 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 2.0.0, 0.98.14, 1.0.2, 1.2.0, 1.1.2, 1.3.0 Attachments: 13352-v2.txt, 13352.txt, hbase-13352_v3.patch We just tried to export some (small amount of) data out of an 0.94 cluster to 0.98 cluster. We used Export/Import for that. By default we found that the import M/R job correctly reports the number of records seen, but _silently_ does not import anything. After looking at the 0.98 it's obvious there's an hbase.import.version (-Dhbase.import.version=0.94) to make this work. Two issues: # -Dhbase.import.version=0.94 should be show with the the Import.usage # If not given it should not just silently not import anything In this issue I'll just a trivially add this option to the Import tool's usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13988) Add exception handler for lease thread
[ https://issues.apache.org/jira/browse/HBASE-13988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615855#comment-14615855 ] Enis Soztutar commented on HBASE-13988: --- [~saint@gmail.com] any further comments? Otherwise, I'll commit this. Add exception handler for lease thread -- Key: HBASE-13988 URL: https://issues.apache.org/jira/browse/HBASE-13988 Project: HBase Issue Type: Bug Affects Versions: 2.0.0 Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Minor Fix For: 2.0.0, 1.0.2, 1.1.2, 0.98.15 Attachments: HBASE-13988-v001.diff In a prod cluster, a region server exited for some important threads were not alive. After excluding other threads from the log, we doubted the lease thread was the root. So we need to add an exception handler to the lease thread to debug why it exited in future. {quote} 2015-06-29,12:46:09,222 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: One or more threads are no longer alive -- stop 2015-06-29,12:46:09,223 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 21600 ... 2015-06-29,12:46:09,330 INFO org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting. 2015-06-29,12:46:09,330 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Thread-37 exiting 2015-06-29,12:46:09,330 INFO org.apache.hadoop.hbase.regionserver.HRegionServer$CompactionChecker: regionserver21600.compactionChecker exiting 2015-06-29,12:46:12,403 INFO org.apache.hadoop.hbase.regionserver.HRegionServer$PeriodicMemstoreFlusher: regionserver21600.periodicFlusher exiting {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low
[ https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615871#comment-14615871 ] Enis Soztutar commented on HBASE-13832: --- bq. org.apache.hadoop.hbase.master.procedure.TestWALProcedureStoreOnHDFS This is the new test from the patch. Seems related. bq. before with the while (isRunning()) we were spinning after the signal, to make clear that there were no other run of the syncLoop(). in this case we may do another round of the loop and execute stuff which in theory is not what you expect after sending the abort signal. I guess we can rethrow the exception after here: {code} +} catch (Throwable t) { + syncException.compareAndSet(null, t); {code} Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low --- Key: HBASE-13832 URL: https://issues.apache.org/jira/browse/HBASE-13832 Project: HBase Issue Type: Sub-task Components: master, proc-v2 Affects Versions: 2.0.0, 1.1.0, 1.2.0 Reporter: Stephen Yuan Jiang Assignee: Matteo Bertozzi Priority: Critical Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1 Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, HBASE-13832-v2.patch, HBASE-13832-v4.patch, HBASE-13832-v5.patch, HDFSPipeline.java, hbase-13832-test-hang.patch, hbase-13832-v3.patch when the data node 3, we got failure in WALProcedureStore#syncLoop() during master start. The failure prevents master to get started. {noformat} 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] wal.WALProcedureStore: Sync slot failed, abort. java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK], DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]], original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3ced-93f4-47b6-9c23-1426f7a6acdc,DISK], DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983- 490ece56c772,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951) {noformat} One proposal is to implement some similar logic as FSHLog: if IOException is thrown during syncLoop in WALProcedureStore#start(), instead of immediate abort, we could try to roll the log and see whether this resolve the issue; if the new log cannot be created or more exception from rolling the log, we then abort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13890) Get/Scan from MemStore only (Client API)
[ https://issues.apache.org/jira/browse/HBASE-13890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615905#comment-14615905 ] Vladimir Rodionov commented on HBASE-13890: --- The new API contract is the following: * No guarantee on a data (if it is not in MemStore - nothing you get) * Data can be partial * Always check a result of an operation - call regular operation if result is empty or partial. Get/Scan from MemStore only (Client API) Key: HBASE-13890 URL: https://issues.apache.org/jira/browse/HBASE-13890 Project: HBase Issue Type: New Feature Components: API, Client, Scanners Reporter: Vladimir Rodionov Assignee: Vladimir Rodionov Attachments: HBASE-13890-v1.patch This is short-circuit read for get/scan when recent data (version) of a cell can be found only in MemStore (with very high probability). Good examples are: Atomic counters and appends. This feature will allow to bypass completely store file scanners and improve performance and latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13965) Stochastic Load Balancer JMX Metrics
[ https://issues.apache.org/jira/browse/HBASE-13965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615909#comment-14615909 ] Lei Chen commented on HBASE-13965: -- Good point. It can be more memory efficient if description is stored only once for each cost function. Patch will be updated. Stochastic Load Balancer JMX Metrics Key: HBASE-13965 URL: https://issues.apache.org/jira/browse/HBASE-13965 Project: HBase Issue Type: Improvement Components: Balancer, metrics Reporter: Lei Chen Assignee: Lei Chen Attachments: HBASE-13965-v3.patch, HBASE-13965-v4.patch, HBASE-13965_v2.patch, HBase-13965-v1.patch, stochasticloadbalancerclasses_v2.png Today’s default HBase load balancer (the Stochastic load balancer) is cost function based. The cost function weights are tunable but no visibility into those cost function results is directly provided. A driving example is a cluster we have been tuning which has skewed rack size (one rack has half the nodes of the other few racks). We are tuning the cluster for uniform response time from all region servers with the ability to tolerate a rack failure. Balancing LocalityCost, RegionReplicaRack Cost and RegionCountSkew Cost is difficult without a way to attribute each cost function’s contribution to overall cost. What this jira proposes is to provide visibility via JMX into each cost function of the stochastic load balancer, as well as the overall cost of the balancing plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13352) Add hbase.import.version to Import usage.
[ https://issues.apache.org/jira/browse/HBASE-13352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615912#comment-14615912 ] Hudson commented on HBASE-13352: FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #1002 (See [https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/1002/]) HBASE-13352 Add hbase.import.version to Import usage (Lars Hofhansl) (enis: rev deba1d0a1075c62647eb63be09b87b20329fe44a) * hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/Import.java Add hbase.import.version to Import usage. - Key: HBASE-13352 URL: https://issues.apache.org/jira/browse/HBASE-13352 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 2.0.0, 0.98.14, 1.0.2, 1.2.0, 1.1.2, 1.3.0 Attachments: 13352-v2.txt, 13352.txt, hbase-13352_v3.patch We just tried to export some (small amount of) data out of an 0.94 cluster to 0.98 cluster. We used Export/Import for that. By default we found that the import M/R job correctly reports the number of records seen, but _silently_ does not import anything. After looking at the 0.98 it's obvious there's an hbase.import.version (-Dhbase.import.version=0.94) to make this work. Two issues: # -Dhbase.import.version=0.94 should be show with the the Import.usage # If not given it should not just silently not import anything In this issue I'll just a trivially add this option to the Import tool's usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13965) Stochastic Load Balancer JMX Metrics
[ https://issues.apache.org/jira/browse/HBASE-13965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei Chen updated HBASE-13965: - Attachment: HBASE-13965-v5.patch Updates: 1. One copy of description is saved for each cost function, in a separate map TODO: 1. Make hard-coded map size configurable? Stochastic Load Balancer JMX Metrics Key: HBASE-13965 URL: https://issues.apache.org/jira/browse/HBASE-13965 Project: HBase Issue Type: Improvement Components: Balancer, metrics Reporter: Lei Chen Assignee: Lei Chen Attachments: HBASE-13965-v3.patch, HBASE-13965-v4.patch, HBASE-13965-v5.patch, HBASE-13965_v2.patch, HBase-13965-v1.patch, stochasticloadbalancerclasses_v2.png Today’s default HBase load balancer (the Stochastic load balancer) is cost function based. The cost function weights are tunable but no visibility into those cost function results is directly provided. A driving example is a cluster we have been tuning which has skewed rack size (one rack has half the nodes of the other few racks). We are tuning the cluster for uniform response time from all region servers with the ability to tolerate a rack failure. Balancing LocalityCost, RegionReplicaRack Cost and RegionCountSkew Cost is difficult without a way to attribute each cost function’s contribution to overall cost. What this jira proposes is to provide visibility via JMX into each cost function of the stochastic load balancer, as well as the overall cost of the balancing plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14017) Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion
[ https://issues.apache.org/jira/browse/HBASE-14017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615734#comment-14615734 ] Hudson commented on HBASE-14017: FAILURE: Integrated in HBase-TRUNK #6632 (See [https://builds.apache.org/job/HBase-TRUNK/6632/]) HBASE-14017 Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion (busbey: rev 1713f1fcaf9d721a97bc564faaf070f2e6b0b1d1) * hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestMasterProcedureQueue.java * hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureQueue.java Procedure v2 - MasterProcedureQueue fix concurrency issue on table queue deletion - Key: HBASE-14017 URL: https://issues.apache.org/jira/browse/HBASE-14017 Project: HBase Issue Type: Sub-task Components: proc-v2 Affects Versions: 2.0.0, 1.2.0, 1.1.1, 1.3.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.2 Attachments: HBASE-14017-v0.patch, HBASE-14017-v0.patch, HBASE-14017.as-pushed-master.patch, HBASE-14017.v1-branch1.1.patch, HBASE-14017.v1-branch1.1.patch [~syuanjiang] found a concurrecy issue in the procedure queue delete where we don't have an exclusive lock before deleting the table {noformat} Thread 1: Create table is running - the queue is empty and wlock is false Thread 2: markTableAsDeleted see the queue empty and wlock= false Thread 1: tryWrite() set wlock=true; too late Thread 2: delete the queue Thread 1: never able to release the lock - NPE when trying to get the queue {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-14027) Clean up netty dependencies
Sean Busbey created HBASE-14027: --- Summary: Clean up netty dependencies Key: HBASE-14027 URL: https://issues.apache.org/jira/browse/HBASE-14027 Project: HBase Issue Type: Improvement Components: build Reporter: Sean Busbey Assignee: Sean Busbey Fix For: 2.0.0, 1.2.0 We have multiple copies of Netty (3?) getting shipped around. clean some up. -- This message was sent by Atlassian JIRA (v6.3.4#6332)