[jira] [Commented] (HBASE-5140) TableInputFormat subclass to allow N number of splits per region during MR jobs

2014-06-10 Thread David Koch (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026369#comment-14026369
 ] 

David Koch commented on HBASE-5140:
---

{quote}
Stale issue. Reopen if still relevant.
{quote}

Why is this deemed irrelevant? Is there new functionality in recent HBase 
versions which supersedes this class? By the way, in method 
{{getMaxByteArrayValue}} the array value assignment should read:

{code}
bytes[i] = (byte) 0xff;
{code}

 TableInputFormat subclass to allow N number of splits per region during MR 
 jobs
 ---

 Key: HBASE-5140
 URL: https://issues.apache.org/jira/browse/HBASE-5140
 Project: HBase
  Issue Type: New Feature
  Components: mapreduce
Affects Versions: 0.90.4
Reporter: Josh Wymer
Priority: Trivial
  Labels: mapreduce, split
 Attachments: 
 Added_functionality_to_TableInputFormat_that_allows_splitting_of_regions.patch,
  
 Added_functionality_to_TableInputFormat_that_allows_splitting_of_regions.patch.1,
  Added_functionality_to_split_n_times_per_region_on_mapreduce_jobs.patch

   Original Estimate: 72h
  Remaining Estimate: 72h

 In regards to [HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I 
 am working on a patch for the TableInputFormat class that overrides getSplits 
 in order to generate N number of splits per regions and/or N number of splits 
 per job. The idea is to convert the startKey and endKey for each region from 
 byte[] to BigDecimal, take the difference, divide by N, convert back to 
 byte[] and generate splits on the resulting values. Assuming your keys are 
 fully distributed this should generate splits at nearly the same number of 
 rows per split. Any suggestions on this issue are welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-8202) MultiTableOutputFormat should support writing to another HBase cluster

2013-03-31 Thread David Koch (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13618389#comment-13618389
 ] 

David Koch commented on HBASE-8202:
---

Hello,

I asked the original question on the mailing list. Here is a minimalist example 
to illustrate the behavior. Run with $quorum != $output_quorum for maximum 
effect ;-).

HBase version was 0.92.1-cdh4.1.1. 

{code:title=Example.java}
package org.hbase.example;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * Test to show how hbase.mapred.output.quorum setting is ignored with {@link 
MultiTableOutputFormat}.
 * 
 * @author davidkoch
 * 
 * See: https://issues.apache.org/jira/browse/HBASE-8202
 * 
 * Hadoop/HBase configurations are read from command line. Replace environment 
variables below.
 * 
 * 1. Test with {@link TableOutputFormat} (Ok):
 *
 *  hadoop jar $jar_name org.hbase.example.Example \
 *  -D hbase.zookeeper.quorum=$quorum \
 *  -D hbase.zookeeper.property.clientPort=2181 \
 *  -D hbase.mapreduce.inputtable=$input_table \
 *  -D hbase.mapreduce.scan.column.family=$colfam \
 *  -D hbase.mapred.outputtable=$output_table \
 *  -D 
mapreduce.outputformat.class=org.apache.hadoop.hbase.mapreduce.TableOutputFormat
 \
 *  -D hbase.mapred.output.quorum=$output_quorum:2181:/hbase
 * 
 * 2. Test with {@link MultiTableOutputFormat} (Fails):
 * 
 *  hadoop jar $jar_name org.hbase.example.Example \
 *  -D hbase.zookeeper.quorum=$quorum \
 *  -D hbase.zookeeper.property.clientPort=2181 \
 *  -D hbase.mapreduce.inputtable=$input_table \
 *  -D hbase.mapreduce.scan.column.family=$colfam \
 *  -D hbase.mapred.outputtable=$output_table \
 *  -D 
mapreduce.outputformat.class=org.apache.hadoop.hbase.mapreduce.MultiTableOutputFormat
 \
 *  -D hbase.mapred.output.quorum=$output_quorum:2181:/hbase
 * 
 * In the second example, the job itself will not fail if $output_table exists 
on $quorum but $output_quorum will
 * be ignored.
 */
public class Example extends Configured implements Tool {

public static class ExampleMapper extends 
TableMapperImmutableBytesWritable, Put {
ImmutableBytesWritable tableName;

@Override
public void setup(Context context) {
tableName = new 
ImmutableBytesWritable(context.getConfiguration().get(hbase.mapred.outputtable)
.getBytes());
}

public void map(ImmutableBytesWritable row, Result value, Context 
context)
throws IOException, InterruptedException {
Put put = new Put(row.get());
for (KeyValue kv : value.raw()) {
put.add(kv);
}
context.write(tableName, put);
}
}

public int run(String[] args) throws Exception {
Configuration conf = getConf();

Scan scan = new Scan();

scan.addFamily(conf.get(hbase.mapreduce.scan.column.family).getBytes());
String inTable =  conf.get(hbase.mapreduce.inputtable);

Job job = new Job(conf);
job.setJobName(Example-HBASE-8202);
TableMapReduceUtil.initTableMapperJob(inTable, scan, 
ExampleMapper.class, null, null, job);
job.setJarByClass(Example.class);
job.setNumReduceTasks(0);

return job.waitForCompletion(true) ? 0 : 1;
}

public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Example(), args);
System.exit(res);
}
}
{code}

 MultiTableOutputFormat should support writing to another HBase cluster
 --

 Key: HBASE-8202
 URL: https://issues.apache.org/jira/browse/HBASE-8202
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Ted Yu

 This was brought up by David Koch in thread 'hbase.mapred.output.quorum 
 ignored in Mapper job with HDFS source and HBase sink' where he wanted to 
 import a file on HDFS from one cluster A (source) into HBase
 tables on a different cluster B (destination) using a Mapper job with an
 HBase sink.
 Here is my analysis:
 MultiTableOutputFormat doesn't extend TableOutputFormat:
 {code}
 public class MultiTableOutputFormat extends 
 OutputFormatImmutableBytesWritable, Mutation {