RE: Run MR job when my data stays in hbase?

Stuart Smith Mon, 19 Jul 2010 11:18:02 -0700

Hello,

  You can ignore this if you're already rock solid on writing M/R jobs, but 
just in case you're as new to this as I am:


Be careful you have all your dependencies lined up in the jar you're creating 
your M/R job in.  If you're using Eclipse this means selecting "Extract 
required libraries into generated jar". 

Without this you get strange "Map class not found errors", similar to when you 
forget to make your map class static or forget to call setJarByClass() on your 
job. 

All the examples I saw that used the *new api* were a little more complicated 
than needed. A stripped down example with the new api:

public static class Mapper extends TableMapper<Text,IntWritable>
{
        @Override
        public void map( ImmutableBytesWritable key, Result value, Context 
context )
        throws IOException, InterruptedException
        {
        //Don't forget to make sure to load this as UTF-8
                String sha256 = new String( key.get(), "UTF-8" );
        //just calling value.value() will NOT give you what you want 
        byte[] valueBuffer = value.getValue(Bytes.toBytes(/*family*/), 
Bytes.toBytes(/*qualifier*/));   
        /**Do stuff**/
        context.write( [some text], [some int] );
    }
}

public static class Reduce extends TableReducer<Text,IntWritable,Text>
{
        @Override
        public void reduce( Text key, Iterable<IntWritable> Values, Context 
context )
        throws IOException, InterruptedException
        {
        /**output of a reduce job needs to be a [something],Put object pair*/
                Put outputRow = new Put( Bytes.toBytes("row key") );
                outputRow.add( Bytes.toBytes(/*output family*/), 
Bytes.toBytes(/*output qualifier*/), Bytes.toBytes(count) );
                context.write( /*some string*/, outputRow );
    }
}

public static void main(String[] argv) throws Exception 
{
        Job validateJob = new Job( configuration, /*job name*/ );
    //don't forget this!
        validateJob.setJarByClass(/*main class*/.class);
                        
        //don't add anything, and it will scan everything (according to docs)
        Scan scan = new Scan();
        scan.addColumn( Bytes.toBytes(/*input family*/), Bytes.toBytes(/*input 
qualifier*/) );
                        
        TableMapReduceUtil.initTableMapperJob(/*input tablename*/, scan, 
Mapper.class, Text.class, IntWritable.class, validateJob);
        TableMapReduceUtil.initTableReducerJob(/*output table name*/, 
Reduce.class, validateJob);
        
        validateJob.waitForCompletion(true);
}

But look at the examples! I just thought some simple highlights might help. 
Don't forget that you can issue Put()'s from your Map() tasks, if you already 
have the data you need assembled (just open a connection in the map 
constructor):

        super();
        this.hbaseConfiguration = new HBaseConfiguration();
        this.hbaseConfiguration.set("hbase.master", "ubuntu-namenode:60000");
        this.fileMetadataTable = new HTable( hbaseConfiguration, /*tableName*/ 
);

and issue the Put() in your map() method. This can take the load of your 
reduce() tasks, which may speed things up a bit.

Caveat emptor:
I just started on all this stuff. ;)

Hope it helps.

Take care,
  -stu



--- On Mon, 7/19/10, Hegner, Travis <[email protected]> wrote:

> From: Hegner, Travis <[email protected]>
> Subject: RE: Run MR job when my data stays in hbase?
> To: "[email protected]" <[email protected]>
> Date: Monday, July 19, 2010, 11:55 AM
> Also make sure that the
> $HBASE_HOME/hbase-<version>.jar,
> $HBASE_HOME/lib/zookeeper-<version>.jar, and the
> $HBASE_HOME/conf/ are all on the classpath in your
> $HADOOP_HOME/conf/hadoop-env.sh file. That configuration
> must be cluster wide.
> 
> With that, your map and reduce tasks can access zookeeper
> and hbase objects. You can then use the TableInputFormat
> with TableOutputFormat, or you can use TableInputFormat, and
> your reduce tasks can write data directly back into Hbase.
> You're problem, and your dataset, will dictate which of
> those methods is more efficient.
> 
> Travis Hegner
> http://www.travishegner.com/

> 
> -----Original Message-----
> From: Andrey Stepachev [mailto:[email protected]]
> Sent: Monday, July 19, 2010 9:28 AM
> To: [email protected]
> Subject: Re: Run MR job when my data stays in hbase?
> 
> 2010/7/19 elton sky <[email protected]>:
> 
> > My question is if I wanna run the backgroup process as
> a MR job, can I get
> > data from hbase, rather than hdfs, with hadoop? How do
> I do that?
> > I appreciate if anyone can provide some simple example
> code.
> 
> Look at org.apache.hadoop.hbase.mapreduce package in hbase
> sources
> and as real example:
> org.apache.hadoop.hbase.mapreduce.RowCounter
> 
> The information contained in this communication is
> confidential and is intended only for the use of the named
> recipient.  Unauthorized use, disclosure, or copying is
> strictly prohibited and may be unlawful.  If you have
> received this communication in error, you should know that
> you are bound to confidentiality, and should please
> immediately notify the sender or our IT Department at 
> 866.459.4599.
>

RE: Run MR job when my data stays in hbase?

Reply via email to