[jira] Commented: (HBASE-50) Snapshot of table
[ https://issues.apache.org/jira/browse/HBASE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898127#action_12898127 ] HBase Review Board commented on HBASE-50: - Message from: Chongxin Li lichong...@zju.edu.cn bq. On 2010-08-12 10:33:25, Ted Yu wrote: bq. src/main/java/org/apache/hadoop/hbase/master/DeleteSnapshot.java, line 98 bq. http://review.cloudera.org/r/467/diff/4/?file=6589#file6589line98 bq. bq. Is there more to be done here ? Deleting the region dir? bq. On 2010-08-12 10:33:25, Ted Yu wrote: bq. src/main/java/org/apache/hadoop/hbase/master/DeleteSnapshot.java, line 94 bq. http://review.cloudera.org/r/467/diff/4/?file=6589#file6589line94 bq. bq. Should return value be checked ? Deleting the snapshot directory at last would delete all snapshot files anyway. Do we still have to check the return value? What if the return value if false, just log it? - Chongxin --- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/467/#review874 --- Snapshot of table - Key: HBASE-50 URL: https://issues.apache.org/jira/browse/HBASE-50 Project: HBase Issue Type: New Feature Reporter: Billy Pearson Assignee: Li Chongxin Priority: Minor Attachments: HBase Snapshot Design Report V2.pdf, HBase Snapshot Design Report V3.pdf, HBase Snapshot Implementation Plan.pdf, Snapshot Class Diagram.png Havening an option to take a snapshot of a table would be vary useful in production. What I would like to see this option do is do a merge of all the data into one or more files stored in the same folder on the dfs. This way we could save data in case of a software bug in hadoop or user code. The other advantage would be to be able to export a table to multi locations. Say I had a read_only table that must be online. I could take a snapshot of it when needed and export it to a separate data center and have it loaded there and then i would have it online at multi data centers for load balancing and failover. I understand that hadoop takes the need out of havening backup to protect from failed servers, but this does not protect use from software bugs that might delete or alter data in ways we did not plan. We should have a way we can roll back a dataset. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-50) Snapshot of table
[ https://issues.apache.org/jira/browse/HBASE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898203#action_12898203 ] HBase Review Board commented on HBASE-50: - Message from: Ted Yu ted...@yahoo.com --- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/467/#review897 --- src/main/java/org/apache/hadoop/hbase/master/DeleteSnapshot.java http://review.cloudera.org/r/467/#comment2925 We should log if we fail to delete. src/main/java/org/apache/hadoop/hbase/master/DeleteSnapshot.java http://review.cloudera.org/r/467/#comment2924 Yes. - Ted Snapshot of table - Key: HBASE-50 URL: https://issues.apache.org/jira/browse/HBASE-50 Project: HBase Issue Type: New Feature Reporter: Billy Pearson Assignee: Li Chongxin Priority: Minor Attachments: HBase Snapshot Design Report V2.pdf, HBase Snapshot Design Report V3.pdf, HBase Snapshot Implementation Plan.pdf, Snapshot Class Diagram.png Havening an option to take a snapshot of a table would be vary useful in production. What I would like to see this option do is do a merge of all the data into one or more files stored in the same folder on the dfs. This way we could save data in case of a software bug in hadoop or user code. The other advantage would be to be able to export a table to multi locations. Say I had a read_only table that must be online. I could take a snapshot of it when needed and export it to a separate data center and have it loaded there and then i would have it online at multi data centers for load balancing and failover. I understand that hadoop takes the need out of havening backup to protect from failed servers, but this does not protect use from software bugs that might delete or alter data in ways we did not plan. We should have a way we can roll back a dataset. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HBASE-2907) [rest/stargate] Improve error response when trying to create a scanner on a nonexistant table
[ https://issues.apache.org/jira/browse/HBASE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Purtell reassigned HBASE-2907: - Assignee: Andrew Purtell [rest/stargate] Improve error response when trying to create a scanner on a nonexistant table - Key: HBASE-2907 URL: https://issues.apache.org/jira/browse/HBASE-2907 Project: HBase Issue Type: Improvement Components: rest Reporter: Kieron Briggs Assignee: Andrew Purtell Priority: Minor Since 0.20.4, an attempt to create a scanner for a nonexistant table receives a 400 Bad Request response with no furthur information. Prior to 0.20.4 it would receive a 500 org.apache.hadoop.hbase.TableNotFoundException: table response with a stack trace in the body. Neither of these is ideal - the 400 fails to identify what aspect of the request was bad, and the 500 incorrectly suggests that the error was internal. Ideally the error should be a 400 error with information in the body identifying the nature of the problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HBASE-2911) [stargate] Fix JSON handling of META and ROOT
[ https://issues.apache.org/jira/browse/HBASE-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Purtell reassigned HBASE-2911: - Assignee: Andrew Purtell [stargate] Fix JSON handling of META and ROOT - Key: HBASE-2911 URL: https://issues.apache.org/jira/browse/HBASE-2911 Project: HBase Issue Type: Bug Components: rest Reporter: Lars George Assignee: Andrew Purtell While working on the HBase Explorer front end in Hue I found a few inconsistencies between the plain text version of values versus the JSON representation. From an email conversation: Plain Text --- $ curl -H curl -H Accept: text/plain localhost:/status/cluster 1 live servers, 0 dead servers, 5. average load 1 live servers de1-app-mbp-2.fritz.box:62884 1280924907616 requests=0, regions=5 heapSizeMB=27 maxHeapSizeMB=995 t2,,1280917558997 stores=3 storefiless=0 storefileSizeMB=0 memstoreSizeMB=0 storefileIndexSizeMB=0 usertable,,1280917566604 stores=3 storefiless=2 storefileSizeMB=224 memstoreSizeMB=0 storefileIndexSizeMB=0 .META.,,1 stores=2 storefiless=1 storefileSizeMB=0 memstoreSizeMB=0 storefileIndexSizeMB=0 t1,,1280917554475 stores=3 storefiless=0 storefileSizeMB=0 memstoreSizeMB=0 storefileIndexSizeMB=0 \-ROOT\-,,0 stores=1 storefiless=1 storefileSizeMB=0 memstoreSizeMB=0 storefileIndexSizeMB=0 JSON - And curling the JSON yields: $ curl -H Accept: application/json localhost:/status/cluster {requests:0,regions:5,averageLoad:5.0,DeadNodes:[null],LiveNodes:[{Node:{startCode:1280924907616,requests:0,name:de1-app-mbp-2.fritz.box:62884,maxHeapSizeMB:995,heapSizeMB:27,Region:[{stores:3,storefiles:0,storefileSizeMB:0,storefileIndexSizeMB:0,name:dDIsLDEyODA5MTc1NTg5OTc=,memstoreSizeMB:0},{stores:3,storefiles:2,storefileSizeMB:224,storefileIndexSizeMB:0,name:dXNlcnRhYmxlLCwxMjgwOTE3NTY2NjA0,memstoreSizeMB:0},{stores:2,storefiles:1,storefileSizeMB:0,storefileIndexSizeMB:0,name:Lk1FVEEuLCwx,memstoreSizeMB:0},{stores:3,storefiles:0,storefileSizeMB:0,storefileIndexSizeMB:0,name:dDEsLDEyODA5MTc1NTQ0NzU=,memstoreSizeMB:0},{stores:1,storefiles:1,storefileSizeMB:0,storefileIndexSizeMB:0,name:LVJPT1QtLCww,memstoreSizeMB:0}]}}]} And another one: I have another one with .META. and \-ROOT\-, in my small sample setup (all local, /tmp etc.) I see this in the master UI: Name Region Server Encoded NameStart Key End Key .META.,,1 10.0.0.43:60030 - But running the same against Stargate I get: $ curl -H Accept: application/json http://localhost:/.META./regions {name:.META.} while a normal user table with a single row has Name Region Server Encoded NameStart Key End Key t1,,128615489 10.0.0.43:60030 1127696125 and through Stargate: $ curl -H Accept: application/json http://localhost:/t1/regions {name:t1,Region:[{location:10.0.0.43:54988,endKey:,startKey:,id:128615489,name:t1,,128615489}]} So the internal tables are not reported right. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-2914) Profiling indicates that ThriftUtilities.rowResultFromHBase is quite inefficient
Profiling indicates that ThriftUtilities.rowResultFromHBase is quite inefficient Key: HBASE-2914 URL: https://issues.apache.org/jira/browse/HBASE-2914 Project: HBase Issue Type: Bug Affects Versions: 0.89.20100621, 0.20.6 Reporter: ryan rawson Fix For: 0.90.0 Attachments: HBASE-2914.patch Profiling of ThriftServer here at SU has indicated that the call ThriftUtilities.rowResultFromHBase() is quite inefficient. It first calls Result.getRowResult() which is inefficient and slow. Instead by reimplementing to create the TRowResult (the thrift return type) straight from the KeyValue[] array the performance boost is substantial, reducing time serializing the results. In my profiling the time spent in scannerGetList() went from 1100ms to 108ms on similar test runs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HBASE-2914) Profiling indicates that ThriftUtilities.rowResultFromHBase is quite inefficient
[ https://issues.apache.org/jira/browse/HBASE-2914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ryan rawson reassigned HBASE-2914: -- Assignee: ryan rawson Profiling indicates that ThriftUtilities.rowResultFromHBase is quite inefficient Key: HBASE-2914 URL: https://issues.apache.org/jira/browse/HBASE-2914 Project: HBase Issue Type: Bug Affects Versions: 0.20.6, 0.89.20100621 Reporter: ryan rawson Assignee: ryan rawson Fix For: 0.90.0 Attachments: HBASE-2914.patch Profiling of ThriftServer here at SU has indicated that the call ThriftUtilities.rowResultFromHBase() is quite inefficient. It first calls Result.getRowResult() which is inefficient and slow. Instead by reimplementing to create the TRowResult (the thrift return type) straight from the KeyValue[] array the performance boost is substantial, reducing time serializing the results. In my profiling the time spent in scannerGetList() went from 1100ms to 108ms on similar test runs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-2915) Deadlock between HRegion.ICV and HRegion.close
Deadlock between HRegion.ICV and HRegion.close -- Key: HBASE-2915 URL: https://issues.apache.org/jira/browse/HBASE-2915 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Priority: Blocker Fix For: 0.90.0 HRegion.ICV gets a row lock then gets a newScanner lock. HRegion.close gets a newScanner lock, slitCloseLock and finally waits for all row locks to finish. If the ICV got the row lock and then close got the newScannerLock, both end up waiting on the other. This was introduced when Get became a Scan. Stack thinks we can get rid of the newScannerLock in close since we setClosing to true. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-2916) Reseek directly to next column
Reseek directly to next column -- Key: HBASE-2916 URL: https://issues.apache.org/jira/browse/HBASE-2916 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Pranav Khaitan When done with the current column, reseek directly to the next column rather than spending time reading more keys of current row-column which are not required. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-2917) Reseek directly to next row
Reseek directly to next row --- Key: HBASE-2917 URL: https://issues.apache.org/jira/browse/HBASE-2917 Project: HBase Issue Type: Improvement Reporter: Pranav Khaitan When done with the current row, reseek directly to the next row rather than spending time reading more keys of current row which are not required. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2910) [stargate] Add /config/cluster endpoint to retrieve the current configuration
[ https://issues.apache.org/jira/browse/HBASE-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898395#action_12898395 ] Lars George commented on HBASE-2910: Mainly HBase. But the more the better. The idea is to be able to tell the UI user what the current cluster configuration looks like. [stargate] Add /config/cluster endpoint to retrieve the current configuration - Key: HBASE-2910 URL: https://issues.apache.org/jira/browse/HBASE-2910 Project: HBase Issue Type: Improvement Components: rest Reporter: Lars George Attachments: Hue HBase Explorer.jpg I am working on the Hue based front end called the HBase Explorer. It would be good to be able to also display the current cluster configuration. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2917) Reseek directly to next row
[ https://issues.apache.org/jira/browse/HBASE-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898396#action_12898396 ] Pranav Khaitan commented on HBASE-2917: --- Ryan says: we should be doing an optimized reseek here, by using KeyValue.createLastOnRow() which will take us to the next row. Reseek directly to next row --- Key: HBASE-2917 URL: https://issues.apache.org/jira/browse/HBASE-2917 Project: HBase Issue Type: Improvement Reporter: Pranav Khaitan When done with the current row, reseek directly to the next row rather than spending time reading more keys of current row which are not required. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2916) Reseek directly to next column
[ https://issues.apache.org/jira/browse/HBASE-2916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898397#action_12898397 ] Pranav Khaitan commented on HBASE-2916: --- Ryan says: I think it is also possible to create a 'last on column' value as well for reseek optimization here. Reseek directly to next column -- Key: HBASE-2916 URL: https://issues.apache.org/jira/browse/HBASE-2916 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Pranav Khaitan When done with the current column, reseek directly to the next column rather than spending time reading more keys of current row-column which are not required. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2315) BookKeeper for write-ahead logging
[ https://issues.apache.org/jira/browse/HBASE-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898424#action_12898424 ] Benjamin Reed commented on HBASE-2315: -- we looked into the problem of figuring out the path to use for the WAL and found the following: it appears that the assumption that the WAL is stored in HDFS is embedded in HBase. when looking up a WAL, for example, the FileSystem object is used to check existence. Deletion of logs also happens outside of the WAL interfaces. to be truly pluggable a WAL interface should be used to enumerate and delete logs. have you guys thought about doing this? BookKeeper for write-ahead logging -- Key: HBASE-2315 URL: https://issues.apache.org/jira/browse/HBASE-2315 Project: HBase Issue Type: New Feature Components: regionserver Reporter: Flavio Junqueira Attachments: bookkeeperOverview.pdf, HBASE-2315.patch, zookeeper-dev-bookkeeper.jar BookKeeper, a contrib of the ZooKeeper project, is a fault tolerant and high throughput write-ahead logging service. This issue provides an implementation of write-ahead logging for hbase using BookKeeper. Apart from expected throughput improvements, BookKeeper also has stronger durability guarantees compared to the implementation currently used by hbase. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2315) BookKeeper for write-ahead logging
[ https://issues.apache.org/jira/browse/HBASE-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898435#action_12898435 ] stack commented on HBASE-2315: -- No. If you want us to switch to an interface, just say (will happen faster if you put up a patch). BookKeeper for write-ahead logging -- Key: HBASE-2315 URL: https://issues.apache.org/jira/browse/HBASE-2315 Project: HBase Issue Type: New Feature Components: regionserver Reporter: Flavio Junqueira Attachments: bookkeeperOverview.pdf, HBASE-2315.patch, zookeeper-dev-bookkeeper.jar BookKeeper, a contrib of the ZooKeeper project, is a fault tolerant and high throughput write-ahead logging service. This issue provides an implementation of write-ahead logging for hbase using BookKeeper. Apart from expected throughput improvements, BookKeeper also has stronger durability guarantees compared to the implementation currently used by hbase. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HBASE-2909) SoftValueSortedMap is broken, can generate NPEs
[ https://issues.apache.org/jira/browse/HBASE-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans resolved HBASE-2909. --- Hadoop Flags: [Reviewed] Assignee: Jean-Daniel Cryans Resolution: Fixed Committed to branch and trunk, thanks for checking it out guys. SoftValueSortedMap is broken, can generate NPEs --- Key: HBASE-2909 URL: https://issues.apache.org/jira/browse/HBASE-2909 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.20.6, 0.89.20100621 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Blocker Fix For: 0.20.7, 0.90.0 Attachments: hbase-2909.patch The way SoftValueSortedMap is using SoftValues, it looks like that it's able to get it's keys garbage collected along with the values themselves. We got this issue in production but I was also able to randomly generate it using YCSB with 300 threads. Here's an example on 0.20 with jdk 1.6u14: {noformat} java.lang.NullPointerException at org.apache.hadoop.hbase.util.Bytes.compareTo(Bytes.java:1036) at org.apache.hadoop.hbase.util.Bytes$ByteArrayComparator.compare(Bytes.java:104) at org.apache.hadoop.hbase.util.Bytes$ByteArrayComparator.compare(Bytes.java:96) at java.util.TreeMap.cmp(TreeMap.java:1911) at java.util.TreeMap.get(TreeMap.java:1835) at org.apache.hadoop.hbase.util.SoftValueSortedMap.get(SoftValueSortedMap.java:91) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getCachedLocation(HConnectionManager.java:788) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:651) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:634) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:601) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:128) at org.apache.hadoop.hbase.thrift.ThriftServer$HBaseHandler.getTable(ThriftServer.java:262) at org.apache.hadoop.hbase.thrift.ThriftServer$HBaseHandler.mutateRowTs(ThriftServer.java:585) at org.apache.hadoop.hbase.thrift.ThriftServer$HBaseHandler.mutateRow(ThriftServer.java:578) at org.apache.hadoop.hbase.thrift.generated.Hbase$Processor$mutateRow.process(Hbase.java:2345) at org.apache.hadoop.hbase.thrift.generated.Hbase$Processor.process(Hbase.java:1988) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:259) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) {noformat} In this specific case, the null cannot be the passed key because it's coming from HTable which uses HConstants.EMPTY_START_ROW. It cannot be a null key that was inserted previously because we would have got the NPE at insert time. This can only mean that some key *became* null. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2898) MultiPut makes proper error handling impossible and leads to corrupted data
[ https://issues.apache.org/jira/browse/HBASE-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898486#action_12898486 ] ryan rawson commented on HBASE-2898: I am interested in a new multi-put for 0.90. There are also cases for multi-get and multi just about everything. See HBASE-1845. Improving serialization would be nice, but we'd have to wreck our KeyValue serialization mechanism and have something custom on the wire. There is a similar situation in the Result serialization as well, it's just KeyValues all the way on down. MultiPut makes proper error handling impossible and leads to corrupted data --- Key: HBASE-2898 URL: https://issues.apache.org/jira/browse/HBASE-2898 Project: HBase Issue Type: Bug Components: client, regionserver Affects Versions: 0.89.20100621 Reporter: Benoit Sigoure Priority: Blocker Fix For: 0.90.0 The {{MultiPut}} RPC needs to be completely rewritten. Let's see why step by step. # An HBase user calls any of the {{put}} methods on an {{HTable}} instance. # Eventually, {{HTable#flushCommits}} is invoked to actually send the edits to the RegionServer(s). # This takes us to {{HConnectionManager#processBatchOfPuts}} where all edits are sorted into one or more {{MultiPut}}. Each {{MultiPut}} is aggregating all the edits that are going to a particular RegionServer. # A thread pool is used to send all the {{MultiPut}} in parallel to their respective RegionServer. Let's follow what happens for a single {{MultiPut}}. # The {{MultiPut}} travels through the IPC code on the client and then through the network and then through the IPC code on the RegionServer. # We're now in {{HRegionServer#multiPut}} where a new {{MultiPutResponse}} is created. # Still in {{HRegionServer#multiPut}}. Since a {{MultiPut}} is essentially a map from region name to a list of {{Put}} for that region, there's a {{for}} loop that executes each list of {{Put}} for each region sequentially. Let's follow what happens for a single list of {{Put}} for a particular region. # We're now in {{HRegionServer#put(byte[], ListPut)}}. Each {{Put}} is associated with the row lock that was specified by the client (if any). Then the pairs of {{(Put, lock id)}} are handed to the right {{HRegion}}. # Now we're in {{HRegion#put(PairPut, Integer[])}}, which immediately takes us to {{HRegion#doMiniBatchPut}}. # At this point, let's assume that we're doing just 2 edits. So the {{BatchOperationInProgress}} that {{doMiniBatchPut}} contains just 2 {{Put}}. # The {{while}} loop in {{doMiniBatchPut}} that's going to execute each {{Put}} starts. # The first {{Put}} fails because an exception is thrown when appending the edit to the {{WAL}}. Its {{batchOp.retCodes}} is marked as {{OperationStatusCode.FAILURE}}. # Because there was an exception, we're back to {{HRegion#put(PairPut, Integer[])}} where the {{while}} loop will test that {{batchOp.isDone}} is {{false}} and do another iteration. # {{doMiniBatchPut}} is called again and handles the remaining {{Put}}. # The second {{Put}} succeeds normally, so its {{batchOp.retCodes}} is marked as {{OperationStatusCode.SUCCESS}}. # {{doMiniBatchPut}} is done and returns to {{HRegion#put(PairPut, Integer[])}}, which returns to {{HRegionServer#put(byte[], ListPut)}}. # At this point, {{HRegionServer#put(byte[], ListPut)}} does a {{for}} loop and extracts the index of the *first* {{Put}} that failed out of the {{OperationStatusCode[]}}. In our case, it'll return 0 since the first {{Put}} failed. # This index in the list of {{Put}} of the first that failed (0 in this case) is returned to {{HRegionServer#multiPut}}, which records in the {{MultiPutResponse}} - the client knows that the first {{Put}} failed but has no idea about the other one. So the client has no reliable way of knowing which {{Put}} failed (if any) past the first failure. All it knows is that for a particular region, they succeeded up to a particular {{Put}}, at which point there was a failure, and then the remaining may or may not have succeeded. Its best bet is to retry all the {{Put}} past the index of the first failure for this region. But this has an unintended consequence. The {{Put}} that were successful during the first run will be *re-applied*. This will unexpectedly create extra versions. Now I realize most people don't really care about versions, so they won't notice. But whoever relies on the versions for whatever reason will rightfully consider this to be data corruption. As it is now, {{MultiPut}} makes proper error handling impossible. Since this RPC cannot guarantee any atomicity other than at the individual {{Put}} level, it should
[jira] Updated: (HBASE-2898) MultiPut makes proper error handling impossible and leads to corrupted data
[ https://issues.apache.org/jira/browse/HBASE-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-2898: -- Description: tl;dr version: I think the {{MultiPut}} RPC needs to be completely rewritten. The current code makes it totally impossible for an HBase client to do proper error handling. When an edit fails, the client has no clue as to what the problem was (certain error cases can be retried, others cannot e.g. when using a non-existent family) and the client doesn't even know which of the edits have been applied successfully. So the client often has to retry edits without knowing whether they've been applied or not, which leads to extra unwanted versions for the {{KeyValue}} that were successfully applied (for those who care about versions, this is essentially equivalent to data corruption). In addition, there's no way for a client to properly handle {{NotServingRegionException}}, the client has to unnecessarily invalidate cached locations of some regions and retry *all* edits. h2. Life of a failed multi-put Let's see why step by step what happens when a single edit in a multi-put fails. # An HBase user calls any of the {{put}} methods on an {{HTable}} instance. # Eventually, {{HTable#flushCommits}} is invoked to actually send the edits to the RegionServer(s). # This takes us to {{HConnectionManager#processBatchOfPuts}} where all edits are sorted into one or more {{MultiPut}}. Each {{MultiPut}} is aggregating all the edits that are going to a particular RegionServer. # A thread pool is used to send all the {{MultiPut}} in parallel to their respective RegionServer. Let's follow what happens for a single {{MultiPut}}. # The {{MultiPut}} travels through the IPC code on the client and then through the network and then through the IPC code on the RegionServer. # We're now in {{HRegionServer#multiPut}} where a new {{MultiPutResponse}} is created. # Still in {{HRegionServer#multiPut}}. Since a {{MultiPut}} is essentially a map from region name to a list of {{Put}} for that region, there's a {{for}} loop that executes each list of {{Put}} for each region sequentially. Let's follow what happens for a single list of {{Put}} for a particular region. # We're now in {{HRegionServer#put(byte[], ListPut)}}. Each {{Put}} is associated with the row lock that was specified by the client (if any). Then the pairs of {{(Put, lock id)}} are handed to the right {{HRegion}}. # Now we're in {{HRegion#put(PairPut, Integer[])}}, which immediately takes us to {{HRegion#doMiniBatchPut}}. # At this point, let's assume that we're doing just 2 edits. So the {{BatchOperationInProgress}} that {{doMiniBatchPut}} contains just 2 {{Put}}. # The {{while}} loop in {{doMiniBatchPut}} that's going to execute each {{Put}} starts. # The first {{Put}} fails because an exception is thrown when appending the edit to the {{WAL}}. Its {{batchOp.retCodes}} is marked as {{OperationStatusCode.FAILURE}}. # Because there was an exception, we're back to {{HRegion#put(PairPut, Integer[])}} where the {{while}} loop will test that {{batchOp.isDone}} is {{false}} and do another iteration. # {{doMiniBatchPut}} is called again and handles the remaining {{Put}}. # The second {{Put}} succeeds normally, so its {{batchOp.retCodes}} is marked as {{OperationStatusCode.SUCCESS}}. # {{doMiniBatchPut}} is done and returns to {{HRegion#put(PairPut, Integer[])}}, which returns to {{HRegionServer#put(byte[], ListPut)}}. # At this point, {{HRegionServer#put(byte[], ListPut)}} does a {{for}} loop and extracts the index of the *first* {{Put}} that failed out of the {{OperationStatusCode[]}}. In our case, it'll return 0 since the first {{Put}} failed. # This index in the list of {{Put}} of the first that failed (0 in this case) is returned to {{HRegionServer#multiPut}}, which records in the {{MultiPutResponse}} - the client knows that the first {{Put}} failed but has no idea about the other one. So the client has no reliable way of knowing which {{Put}} failed (if any) past the first failure. All it knows is that for a particular region, they succeeded up to a particular {{Put}}, at which point there was a failure, and then the remaining may or may not have succeeded. Its best bet is to retry all the {{Put}} past the index of the first failure for this region. But this has an unintended consequence. The {{Put}} that were successful during the first run will be *re-applied*. This will unexpectedly create extra versions. Now I realize most people don't really care about versions, so they won't notice. But whoever relies on the versions for whatever reason will rightfully consider this to be data corruption. As it is now, {{MultiPut}} makes proper error handling impossible. Since this RPC cannot guarantee any atomicity other than at the individual {{Put}} level, it should return to the client specific
[jira] Commented: (HBASE-1845) MultiGet, MultiDelete, and MultiPut - batched to the appropriate region servers
[ https://issues.apache.org/jira/browse/HBASE-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898493#action_12898493 ] Benoit Sigoure commented on HBASE-1845: --- Hello, I just became aware of this issue. I haven't read all the comments and haven't looked at the patches yet, but I'd like to draw your attention to HBASE-2898 and so you can make sure that whatever you do, you don't reproduce this issue. It'd be nice if this issue solved HBASE-2898 as a side-effect of rewriting multiPut as part of the multi-everything code. I'll take a look at the code proposed here when time permits. MultiGet, MultiDelete, and MultiPut - batched to the appropriate region servers --- Key: HBASE-1845 URL: https://issues.apache.org/jira/browse/HBASE-1845 Project: HBase Issue Type: New Feature Reporter: Erik Holstad Fix For: 0.90.0 Attachments: batch.patch, hbase-1845_0.20.3.patch, hbase-1845_0.20.5.patch, multi-v1.patch I've started to create a general interface for doing these batch/multi calls and would like to get some input and thoughts about how we should handle this and what the protocol should look like. First naive patch, coming soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HBASE-1849) HTable doesn't work well at the core of a multi-threaded server; e.g. webserver
[ https://issues.apache.org/jira/browse/HBASE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure reassigned HBASE-1849: - Assignee: Benoit Sigoure I've been working on this for the past 2 weeks, although I'm guessing that my solution won't be really satisfactory for this issue. I wrote another HBase client from scratch, and it's been written from the ground up to work well in a multi-threaded environment. I'll open-source it in a few days, stay tuned. HTable doesn't work well at the core of a multi-threaded server; e.g. webserver --- Key: HBASE-1849 URL: https://issues.apache.org/jira/browse/HBASE-1849 Project: HBase Issue Type: Improvement Reporter: stack Assignee: Benoit Sigoure HTable must do the following: + Sit in a shell or simple client -- e.g. Map or Reduce task -- and feed and read from HBase single-threadedly. It does this job OK. + Sit at core of a multithreaded server (100s of threads) -- a webserver or thrift gateway -- and keep the throughput high. Its currently not good at this job. In the way of our achieving the second in the list above are the following: + HTable must seekout and cache region locations. It keeps cache down in HConnectionManager. One is shared by all HTable instances if the HTable instance was made with same HBaseConfiguration instance. Lookups of regions is inside a synchronize block; if the region wanted is in the cache, the lock is held a short time. Otherwise, must wait till trip to server completed (may require retries). Meantime all other work is blocked even if we're using HTablePool. + Regardless of the identity of the HBaseConfiguration, Hadoop RPC has ONE Connection open to a server at a time; request and response are multiplexed over this single connection. Broken stuff: + Puts are synchronized to protect the write buffer so only one thread at a time appends but flushcommit is open for any thread to call it. Once the write buffer is full, all Puts block until its freed again. This looks like hang if hundreds of threads and each write is to a random region in a big table and each write has to have its region looked-up (There may be some other brokenness in here because this bottleneck seems to last longer than it should even if hundreds of threads). Ideas: + Query of the cache does not block all access to the cache. We only block access if wanted region is being looked up so other reads and writes to regions we know the location of can go ahead. + nio'd client and server -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-1849) HTable doesn't work well at the core of a multi-threaded server; e.g. webserver
[ https://issues.apache.org/jira/browse/HBASE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898496#action_12898496 ] ryan rawson commented on HBASE-1849: some of the original complaints have been fixed. HTablePool does some things. The advice has generally been dont share HTable between threads. The granularity of the locks in HCM were improved and while not all better there are substantial improvements since this issue was filed. HTable doesn't work well at the core of a multi-threaded server; e.g. webserver --- Key: HBASE-1849 URL: https://issues.apache.org/jira/browse/HBASE-1849 Project: HBase Issue Type: Improvement Reporter: stack Assignee: Benoit Sigoure HTable must do the following: + Sit in a shell or simple client -- e.g. Map or Reduce task -- and feed and read from HBase single-threadedly. It does this job OK. + Sit at core of a multithreaded server (100s of threads) -- a webserver or thrift gateway -- and keep the throughput high. Its currently not good at this job. In the way of our achieving the second in the list above are the following: + HTable must seekout and cache region locations. It keeps cache down in HConnectionManager. One is shared by all HTable instances if the HTable instance was made with same HBaseConfiguration instance. Lookups of regions is inside a synchronize block; if the region wanted is in the cache, the lock is held a short time. Otherwise, must wait till trip to server completed (may require retries). Meantime all other work is blocked even if we're using HTablePool. + Regardless of the identity of the HBaseConfiguration, Hadoop RPC has ONE Connection open to a server at a time; request and response are multiplexed over this single connection. Broken stuff: + Puts are synchronized to protect the write buffer so only one thread at a time appends but flushcommit is open for any thread to call it. Once the write buffer is full, all Puts block until its freed again. This looks like hang if hundreds of threads and each write is to a random region in a big table and each write has to have its region looked-up (There may be some other brokenness in here because this bottleneck seems to last longer than it should even if hundreds of threads). Ideas: + Query of the cache does not block all access to the cache. We only block access if wanted region is being looked up so other reads and writes to regions we know the location of can go ahead. + nio'd client and server -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-1849) HTable doesn't work well at the core of a multi-threaded server; e.g. webserver
[ https://issues.apache.org/jira/browse/HBASE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898508#action_12898508 ] stack commented on HBASE-1849: -- @BenĂ´it: Bring it on! HTable doesn't work well at the core of a multi-threaded server; e.g. webserver --- Key: HBASE-1849 URL: https://issues.apache.org/jira/browse/HBASE-1849 Project: HBase Issue Type: Improvement Reporter: stack Assignee: Benoit Sigoure HTable must do the following: + Sit in a shell or simple client -- e.g. Map or Reduce task -- and feed and read from HBase single-threadedly. It does this job OK. + Sit at core of a multithreaded server (100s of threads) -- a webserver or thrift gateway -- and keep the throughput high. Its currently not good at this job. In the way of our achieving the second in the list above are the following: + HTable must seekout and cache region locations. It keeps cache down in HConnectionManager. One is shared by all HTable instances if the HTable instance was made with same HBaseConfiguration instance. Lookups of regions is inside a synchronize block; if the region wanted is in the cache, the lock is held a short time. Otherwise, must wait till trip to server completed (may require retries). Meantime all other work is blocked even if we're using HTablePool. + Regardless of the identity of the HBaseConfiguration, Hadoop RPC has ONE Connection open to a server at a time; request and response are multiplexed over this single connection. Broken stuff: + Puts are synchronized to protect the write buffer so only one thread at a time appends but flushcommit is open for any thread to call it. Once the write buffer is full, all Puts block until its freed again. This looks like hang if hundreds of threads and each write is to a random region in a big table and each write has to have its region looked-up (There may be some other brokenness in here because this bottleneck seems to last longer than it should even if hundreds of threads). Ideas: + Query of the cache does not block all access to the cache. We only block access if wanted region is being looked up so other reads and writes to regions we know the location of can go ahead. + nio'd client and server -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.