[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism
[ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258369#comment-13258369 ] Nicholas Telford commented on HBASE-3967: - Anyone know what the status of this issue is? It looks to have been completed in the latest patch (or perhaps even in HBASE-5440, not sure what Lars meant there) but not reviewed/accepted? Support deletes in HFileOutputFormat based bulk import mechanism Key: HBASE-3967 URL: https://issues.apache.org/jira/browse/HBASE-3967 Project: HBase Issue Type: Sub-task Reporter: Kannan Muthukkaruppan Priority: Critical Fix For: 0.96.0 Attachments: diff.patch During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4966) Put/Delete values cannot be tested with MRUnit
[ https://issues.apache.org/jira/browse/HBASE-4966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197938#comment-13197938 ] Nicholas Telford commented on HBASE-4966: - I'm working on a sensible implementation and I have a question. Currently, KeyValue#equals(Object) returns true if both KeyValues have the same row, irrespective of all other fields (family, qualifier, value, ts etc.). This appears to be for the convenience case of using ListKeyValue#contains(KeyValue) to check for an existing KeyValue for a row. The problem I have with this is that it violates the method contract of Object#hashCode() which states: bq. If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result. Since the {{KeyValue#hashCode()}} implementation is derived from {{KeyValue#getBuffer()}}, two KVs with the same key but different values would be considered equal but yield different hashCodes. I can probably work around this, and I imagine it's out of the scope of this ticket to change it, but wouldn't it be a better idea to derive equality from all the KV fields and encapsulate the common use case for {{ListKeyValue#contains(KeyValue)}} somewhere else? Perhaps a sub-class of List that simply provides this useful facility: {code:java} class KVList extends ArrayListKeyValue { public boolean containsRow(byte[] row) { for (KeyValue kv : this) { if (Bytes.equals(kv.getRow(), row)) { return true; } } } } {code} Put/Delete values cannot be tested with MRUnit -- Key: HBASE-4966 URL: https://issues.apache.org/jira/browse/HBASE-4966 Project: HBase Issue Type: Bug Components: client, mapreduce Affects Versions: 0.90.4 Reporter: Nicholas Telford Priority: Minor When using the IdentityTableReducer, which expects input values of either a Put or Delete object, testing with MRUnit the Mapper with MRUnit is not possible because neither Put nor Delete implement equals(). We should implement equals() on both such that equality means: * Both objects are of the same class (in this case, Put or Delete) * Both objects are for the same key. * Both objects contain an equal set of KeyValues (applicable only to Put) KeyValue.equals() appears to already be implemented, but only checks for equality of row key, column family and column qualifier - two KeyValues can be considered equal if they contain different values. This won't work for testing. Instead, the Put.equals() and Delete.equals() implementations should do a deep equality check on their KeyValues, like this: {code:java} myKv.equals(theirKv) Bytes.equals(myKv.getValue(), theirKv.getValue()); {code} NOTE: This would impact any code that relies on the existing identity implementation of Put.equals() and Delete.equals(), therefore cannot be guaranteed to be backwards-compatible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4966) Put/Delete values cannot be tested with MRUnit
[ https://issues.apache.org/jira/browse/HBASE-4966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13198018#comment-13198018 ] Nicholas Telford commented on HBASE-4966: - The changes I'm proposing are to KeyValue#equals(), I think hashCode makes sense as it stands. I did look at changing it and it did break several tests. I began working on a patch to fix the affected tests but quickly realized I had no way to be sure other parts of the codebase don't rely on this behavior. Put/Delete values cannot be tested with MRUnit -- Key: HBASE-4966 URL: https://issues.apache.org/jira/browse/HBASE-4966 Project: HBase Issue Type: Bug Components: client, mapreduce Affects Versions: 0.90.4 Reporter: Nicholas Telford Assignee: Nicholas Telford Priority: Minor When using the IdentityTableReducer, which expects input values of either a Put or Delete object, testing with MRUnit the Mapper with MRUnit is not possible because neither Put nor Delete implement equals(). We should implement equals() on both such that equality means: * Both objects are of the same class (in this case, Put or Delete) * Both objects are for the same key. * Both objects contain an equal set of KeyValues (applicable only to Put) KeyValue.equals() appears to already be implemented, but only checks for equality of row key, column family and column qualifier - two KeyValues can be considered equal if they contain different values. This won't work for testing. Instead, the Put.equals() and Delete.equals() implementations should do a deep equality check on their KeyValues, like this: {code:java} myKv.equals(theirKv) Bytes.equals(myKv.getValue(), theirKv.getValue()); {code} NOTE: This would impact any code that relies on the existing identity implementation of Put.equals() and Delete.equals(), therefore cannot be guaranteed to be backwards-compatible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5208) Allow setting Scan start/stop row individually in TableInputFormat
[ https://issues.apache.org/jira/browse/HBASE-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188406#comment-13188406 ] Nicholas Telford commented on HBASE-5208: - Regarding test length, without my additions (i.e. clean trunk) the tests take a very long time. Most of the time is spent doing the original tests as they spin up 11 MapReduce jobs. I imagine running the tests in parallel might improve things, but I haven't tested that. My additional test adds another MapReduce job, so it will increase the test length, but not substantially. Allow setting Scan start/stop row individually in TableInputFormat -- Key: HBASE-5208 URL: https://issues.apache.org/jira/browse/HBASE-5208 Project: HBase Issue Type: Improvement Components: mapreduce Reporter: Nicholas Telford Priority: Minor Attachments: HBASE-5208-001.txt, HBASE-5208-002.txt, HBASE-5208-003.txt, HBASE-5208-004.txt Currently, TableInputFormat initializes a serialized Scan from hbase.mapreduce.scan. Alternatively, it will instantiate a new Scan using properties defined in hbase.mapreduce.scan.*. However, of these properties the start row and stop row (arguably the most pertinent) are missing. TableInputFormat should permit the specification of a start/stop row as with the other fields using a new pair of properties: hbase.mapreduce.scan.row.start and hbase.mapreduce.scan.row.end The primary use-case for this is to permit Oozie and other job management tools that can't call TableMapReduceUtil.initTableMapperJob() to operate on a contiguous subset of rows. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5208) Allow setting Scan start/stop row individually in TableInputFormat
[ https://issues.apache.org/jira/browse/HBASE-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187683#comment-13187683 ] Nicholas Telford commented on HBASE-5208: - Not entirely sure why there are (unrelated) tests failing. Looking at the error, they all appear to be caused by the following. Can someone verify whether or not this is caused by something in my patch? java.lang.NumberFormatException: For input string: 18446743988250694508 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Long.parseLong(Long.java:422) at java.lang.Long.parseLong(Long.java:468) at org.apache.hadoop.util.ProcfsBasedProcessTree.constructProcessInfo(ProcfsBasedProcessTree.java:413) at org.apache.hadoop.util.ProcfsBasedProcessTree.getProcessTree(ProcfsBasedProcessTree.java:148) at org.apache.hadoop.util.LinuxResourceCalculatorPlugin.getProcResourceValues(LinuxResourceCalculatorPlugin.java:401) at org.apache.hadoop.mapred.Task.initialize(Task.java:536) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:353) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083) at org.apache.hadoop.mapred.Child.main(Child.java:249) As for the findbugs and JavaDoc issues: JavaDoc is reporting a negative number of problems, so I'm disregarding it. Findbugs doesn't seem to be finding anything in my new code, although it's difficult to be sure given the volume of warnings. Allow setting Scan start/stop row individually in TableInputFormat -- Key: HBASE-5208 URL: https://issues.apache.org/jira/browse/HBASE-5208 Project: HBase Issue Type: Improvement Components: mapreduce Reporter: Nicholas Telford Priority: Minor Attachments: HBASE-5208-001.txt, HBASE-5208-002.txt, HBASE-5208-003.txt, HBASE-5208-004.txt Currently, TableInputFormat initializes a serialized Scan from hbase.mapreduce.scan. Alternatively, it will instantiate a new Scan using properties defined in hbase.mapreduce.scan.*. However, of these properties the start row and stop row (arguably the most pertinent) are missing. TableInputFormat should permit the specification of a start/stop row as with the other fields using a new pair of properties: hbase.mapreduce.scan.row.start and hbase.mapreduce.scan.row.end The primary use-case for this is to permit Oozie and other job management tools that can't call TableMapReduceUtil.initTableMapperJob() to operate on a contiguous subset of rows. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5208) Allow setting Scan start/stop row individually in TableInputFormat
[ https://issues.apache.org/jira/browse/HBASE-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186962#comment-13186962 ] Nicholas Telford commented on HBASE-5208: - Tests were excluded from the patch as for now I'm unable to get the large tests to run in my environment, even from a clean trunk. I do have a patch with tests, but I'm not happy submitting them until I can get it working. Allow setting Scan start/stop row individually in TableInputFormat -- Key: HBASE-5208 URL: https://issues.apache.org/jira/browse/HBASE-5208 Project: HBase Issue Type: Improvement Components: mapreduce Reporter: Nicholas Telford Priority: Minor Attachments: HBASE-5208-001.txt, HBASE-5208-002.txt Currently, TableInputFormat initializes a serialized Scan from hbase.mapreduce.scan. Alternatively, it will instantiate a new Scan using properties defined in hbase.mapreduce.scan.*. However, of these properties the start row and stop row (arguably the most pertinent) are missing. TableInputFormat should permit the specification of a start/stop row as with the other fields using a new pair of properties: hbase.mapreduce.scan.row.start and hbase.mapreduce.scan.row.end The primary use-case for this is to permit Oozie and other job management tools that can't call TableMapReduceUtil.initTableMapperJob() to operate on a contiguous subset of rows. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5208) Allow setting Scan start/stop row individually in TableInputFormat
[ https://issues.apache.org/jira/browse/HBASE-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187071#comment-13187071 ] Nicholas Telford commented on HBASE-5208: - That was my intention. I can extract that out to an intermediary method if that's preferable, however that doesn't really solve the problem that doubling the number of MR jobs spun up causes the test to timeout. Any ideas on that one? Allow setting Scan start/stop row individually in TableInputFormat -- Key: HBASE-5208 URL: https://issues.apache.org/jira/browse/HBASE-5208 Project: HBase Issue Type: Improvement Components: mapreduce Reporter: Nicholas Telford Priority: Minor Attachments: HBASE-5208-001.txt, HBASE-5208-002.txt, HBASE-5208-003.txt Currently, TableInputFormat initializes a serialized Scan from hbase.mapreduce.scan. Alternatively, it will instantiate a new Scan using properties defined in hbase.mapreduce.scan.*. However, of these properties the start row and stop row (arguably the most pertinent) are missing. TableInputFormat should permit the specification of a start/stop row as with the other fields using a new pair of properties: hbase.mapreduce.scan.row.start and hbase.mapreduce.scan.row.end The primary use-case for this is to permit Oozie and other job management tools that can't call TableMapReduceUtil.initTableMapperJob() to operate on a contiguous subset of rows. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira