[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

2012-04-20 Thread Nicholas Telford (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258369#comment-13258369
 ] 

Nicholas Telford commented on HBASE-3967:
-

Anyone know what the status of this issue is? It looks to have been completed 
in the latest patch (or perhaps even in HBASE-5440, not sure what Lars meant 
there) but not reviewed/accepted?

 Support deletes in HFileOutputFormat based bulk import mechanism
 

 Key: HBASE-3967
 URL: https://issues.apache.org/jira/browse/HBASE-3967
 Project: HBase
  Issue Type: Sub-task
Reporter: Kannan Muthukkaruppan
Priority: Critical
 Fix For: 0.96.0

 Attachments: diff.patch


 During bulk imports, it'll be useful to be able to do delete mutations 
 (either to delete data that already exists in HBase or was inserted earlier 
 during this run of the import). 
 For example, we have a use case, where we are processing a log of data which 
 may have both inserts and deletes in the mix and we want to upload that into 
 HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4966) Put/Delete values cannot be tested with MRUnit

2012-02-01 Thread Nicholas Telford (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197938#comment-13197938
 ] 

Nicholas Telford commented on HBASE-4966:
-

I'm working on a sensible implementation and I have a question.

Currently, KeyValue#equals(Object) returns true if both KeyValues have the same 
row, irrespective of all other fields (family, qualifier, value, ts etc.).

This appears to be for the convenience case of using 
ListKeyValue#contains(KeyValue) to check for an existing KeyValue for a row.

The problem I have with this is that it violates the method contract of 
Object#hashCode() which states: 

bq. If two objects are equal according to the equals(Object) method, then 
calling the hashCode method on each of the two objects must produce the same 
integer result. 

Since the {{KeyValue#hashCode()}} implementation is derived from 
{{KeyValue#getBuffer()}}, two KVs with the same key but different values would 
be considered equal but yield different hashCodes.

I can probably work around this, and I imagine it's out of the scope of this 
ticket to change it, but wouldn't it be a better idea to derive equality from 
all the KV fields and encapsulate the common use case for 
{{ListKeyValue#contains(KeyValue)}} somewhere else? Perhaps a sub-class of 
List that simply provides this useful facility:

{code:java}
class KVList extends ArrayListKeyValue {
  public boolean containsRow(byte[] row) {
for (KeyValue kv : this) {
  if (Bytes.equals(kv.getRow(), row)) {
return true;
  }
}
  }
}
{code}

 Put/Delete values cannot be tested with MRUnit
 --

 Key: HBASE-4966
 URL: https://issues.apache.org/jira/browse/HBASE-4966
 Project: HBase
  Issue Type: Bug
  Components: client, mapreduce
Affects Versions: 0.90.4
Reporter: Nicholas Telford
Priority: Minor

 When using the IdentityTableReducer, which expects input values of either a 
 Put or Delete object, testing with MRUnit the Mapper with MRUnit is not 
 possible because neither Put nor Delete implement equals().
 We should implement equals() on both such that equality means:
 * Both objects are of the same class (in this case, Put or Delete)
 * Both objects are for the same key.
 * Both objects contain an equal set of KeyValues (applicable only to Put)
 KeyValue.equals() appears to already be implemented, but only checks for 
 equality of row key, column family and column qualifier - two KeyValues can 
 be considered equal if they contain different values. This won't work for 
 testing.
 Instead, the Put.equals() and Delete.equals() implementations should do a 
 deep equality check on their KeyValues, like this:
 {code:java}
 myKv.equals(theirKv)  Bytes.equals(myKv.getValue(), theirKv.getValue());
 {code}
 NOTE: This would impact any code that relies on the existing identity 
 implementation of Put.equals() and Delete.equals(), therefore cannot be 
 guaranteed to be backwards-compatible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4966) Put/Delete values cannot be tested with MRUnit

2012-02-01 Thread Nicholas Telford (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13198018#comment-13198018
 ] 

Nicholas Telford commented on HBASE-4966:
-

The changes I'm proposing are to KeyValue#equals(), I think hashCode makes 
sense as it stands.

I did look at changing it and it did break several tests. I began working on a 
patch to fix the affected tests but quickly realized I had no way to be sure 
other parts of the codebase don't rely on this behavior.

 Put/Delete values cannot be tested with MRUnit
 --

 Key: HBASE-4966
 URL: https://issues.apache.org/jira/browse/HBASE-4966
 Project: HBase
  Issue Type: Bug
  Components: client, mapreduce
Affects Versions: 0.90.4
Reporter: Nicholas Telford
Assignee: Nicholas Telford
Priority: Minor

 When using the IdentityTableReducer, which expects input values of either a 
 Put or Delete object, testing with MRUnit the Mapper with MRUnit is not 
 possible because neither Put nor Delete implement equals().
 We should implement equals() on both such that equality means:
 * Both objects are of the same class (in this case, Put or Delete)
 * Both objects are for the same key.
 * Both objects contain an equal set of KeyValues (applicable only to Put)
 KeyValue.equals() appears to already be implemented, but only checks for 
 equality of row key, column family and column qualifier - two KeyValues can 
 be considered equal if they contain different values. This won't work for 
 testing.
 Instead, the Put.equals() and Delete.equals() implementations should do a 
 deep equality check on their KeyValues, like this:
 {code:java}
 myKv.equals(theirKv)  Bytes.equals(myKv.getValue(), theirKv.getValue());
 {code}
 NOTE: This would impact any code that relies on the existing identity 
 implementation of Put.equals() and Delete.equals(), therefore cannot be 
 guaranteed to be backwards-compatible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5208) Allow setting Scan start/stop row individually in TableInputFormat

2012-01-18 Thread Nicholas Telford (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188406#comment-13188406
 ] 

Nicholas Telford commented on HBASE-5208:
-

Regarding test length, without my additions (i.e. clean trunk) the tests take a 
very long time. Most of the time is spent doing the original tests as they spin 
up 11 MapReduce jobs. I imagine running the tests in parallel might improve 
things, but I haven't tested that.

My additional test adds another MapReduce job, so it will increase the test 
length, but not substantially.

 Allow setting Scan start/stop row individually in TableInputFormat
 --

 Key: HBASE-5208
 URL: https://issues.apache.org/jira/browse/HBASE-5208
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Nicholas Telford
Priority: Minor
 Attachments: HBASE-5208-001.txt, HBASE-5208-002.txt, 
 HBASE-5208-003.txt, HBASE-5208-004.txt


 Currently, TableInputFormat initializes a serialized Scan from 
 hbase.mapreduce.scan. Alternatively, it will instantiate a new Scan using 
 properties defined in hbase.mapreduce.scan.*. However, of these properties 
 the start row and stop row (arguably the most pertinent) are missing.
 TableInputFormat should permit the specification of a start/stop row as with 
 the other fields using a new pair of properties: 
 hbase.mapreduce.scan.row.start and hbase.mapreduce.scan.row.end
 The primary use-case for this is to permit Oozie and other job management 
 tools that can't call TableMapReduceUtil.initTableMapperJob() to operate on a 
 contiguous subset of rows.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5208) Allow setting Scan start/stop row individually in TableInputFormat

2012-01-17 Thread Nicholas Telford (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187683#comment-13187683
 ] 

Nicholas Telford commented on HBASE-5208:
-

Not entirely sure why there are (unrelated) tests failing. Looking at the 
error, they all appear to be caused by the following. Can someone verify 
whether or not this is caused by something in my patch?

java.lang.NumberFormatException: For input string: 18446743988250694508
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Long.parseLong(Long.java:422)
at java.lang.Long.parseLong(Long.java:468)
at 
org.apache.hadoop.util.ProcfsBasedProcessTree.constructProcessInfo(ProcfsBasedProcessTree.java:413)
at 
org.apache.hadoop.util.ProcfsBasedProcessTree.getProcessTree(ProcfsBasedProcessTree.java:148)
at 
org.apache.hadoop.util.LinuxResourceCalculatorPlugin.getProcResourceValues(LinuxResourceCalculatorPlugin.java:401)
at org.apache.hadoop.mapred.Task.initialize(Task.java:536)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:353)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

As for the findbugs and JavaDoc issues: JavaDoc is reporting a negative number 
of problems, so I'm disregarding it. Findbugs doesn't seem to be finding 
anything in my new code, although it's difficult to be sure given the volume of 
warnings.

 Allow setting Scan start/stop row individually in TableInputFormat
 --

 Key: HBASE-5208
 URL: https://issues.apache.org/jira/browse/HBASE-5208
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Nicholas Telford
Priority: Minor
 Attachments: HBASE-5208-001.txt, HBASE-5208-002.txt, 
 HBASE-5208-003.txt, HBASE-5208-004.txt


 Currently, TableInputFormat initializes a serialized Scan from 
 hbase.mapreduce.scan. Alternatively, it will instantiate a new Scan using 
 properties defined in hbase.mapreduce.scan.*. However, of these properties 
 the start row and stop row (arguably the most pertinent) are missing.
 TableInputFormat should permit the specification of a start/stop row as with 
 the other fields using a new pair of properties: 
 hbase.mapreduce.scan.row.start and hbase.mapreduce.scan.row.end
 The primary use-case for this is to permit Oozie and other job management 
 tools that can't call TableMapReduceUtil.initTableMapperJob() to operate on a 
 contiguous subset of rows.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5208) Allow setting Scan start/stop row individually in TableInputFormat

2012-01-16 Thread Nicholas Telford (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186962#comment-13186962
 ] 

Nicholas Telford commented on HBASE-5208:
-

Tests were excluded from the patch as for now I'm unable to get the large 
tests to run in my environment, even from a clean trunk. I do have a patch with 
tests, but I'm not happy submitting them until I can get it working.

 Allow setting Scan start/stop row individually in TableInputFormat
 --

 Key: HBASE-5208
 URL: https://issues.apache.org/jira/browse/HBASE-5208
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Nicholas Telford
Priority: Minor
 Attachments: HBASE-5208-001.txt, HBASE-5208-002.txt


 Currently, TableInputFormat initializes a serialized Scan from 
 hbase.mapreduce.scan. Alternatively, it will instantiate a new Scan using 
 properties defined in hbase.mapreduce.scan.*. However, of these properties 
 the start row and stop row (arguably the most pertinent) are missing.
 TableInputFormat should permit the specification of a start/stop row as with 
 the other fields using a new pair of properties: 
 hbase.mapreduce.scan.row.start and hbase.mapreduce.scan.row.end
 The primary use-case for this is to permit Oozie and other job management 
 tools that can't call TableMapReduceUtil.initTableMapperJob() to operate on a 
 contiguous subset of rows.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5208) Allow setting Scan start/stop row individually in TableInputFormat

2012-01-16 Thread Nicholas Telford (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187071#comment-13187071
 ] 

Nicholas Telford commented on HBASE-5208:
-

That was my intention. I can extract that out to an intermediary method if 
that's preferable, however that doesn't really solve the problem that doubling 
the number of MR jobs spun up causes the test to timeout. Any ideas on that one?

 Allow setting Scan start/stop row individually in TableInputFormat
 --

 Key: HBASE-5208
 URL: https://issues.apache.org/jira/browse/HBASE-5208
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Nicholas Telford
Priority: Minor
 Attachments: HBASE-5208-001.txt, HBASE-5208-002.txt, 
 HBASE-5208-003.txt


 Currently, TableInputFormat initializes a serialized Scan from 
 hbase.mapreduce.scan. Alternatively, it will instantiate a new Scan using 
 properties defined in hbase.mapreduce.scan.*. However, of these properties 
 the start row and stop row (arguably the most pertinent) are missing.
 TableInputFormat should permit the specification of a start/stop row as with 
 the other fields using a new pair of properties: 
 hbase.mapreduce.scan.row.start and hbase.mapreduce.scan.row.end
 The primary use-case for this is to permit Oozie and other job management 
 tools that can't call TableMapReduceUtil.initTableMapperJob() to operate on a 
 contiguous subset of rows.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira