It's supposed to happen automatically.  The JIRA issue below mentions one case 
where it wasn't, and explains how I detected it and worked around it.  To make 
you're getting locality, look at the task tracer and make sure that for your 
map tasks, the host used for executing the task matches the input split 
location.

JVS

On Dec 10, 2010, at 10:10 AM, vlisovsky wrote:

> Thanks for the info. Moreover how can we make sure that our regionservers are 
> running with same Datanodes ( locality). Is there a way we can make sure? 
> 
> On Thu, Dec 9, 2010 at 11:09 PM, John Sichi <jsi...@fb.com> wrote:
> Try
> 
> set hbase.client.scanner.caching=5000;
> 
> Also, check to make sure that you are getting the expected locality so that 
> mappers are running on the same nodes as the region servers they are scanning 
> (assuming that you are running HBase and mapreduce on the same cluster).  
> When I was testing this, I encountered this problem (but it may have been 
> specific to our cluster configurations):
> 
> https://issues.apache.org/jira/browse/HBASE-2535
> 
> JVS
> 
> On Dec 9, 2010, at 10:46 PM, vlisovsky wrote:
> 
> >
> > Hi Guys,
> > Wonder if  anybody could shed some light on how to reduce the load on HBase 
> > cluster when running a full scan.
> > The need is to dump everything I have in HBase and into a Hive table. The 
> > HBase data size is around 500g.
> > The job creates 9000 mappers, after about 1000 maps things go south every 
> > time..
> > If I run below insert it runs for about 30 minutes then starts bringing 
> > down HBase cluster after which region servers need to be restarted..
> > Wonder if there is a way to throttle it somehow or otherwise if there is 
> > any other method of getting structured data out?
> > Any help is appreciated,
> > Thanks,
> > -Vitaly
> >
> > create external table hbase_linked_table (
> > mykey        string,
> > info        map<string, string>,
> > )
> > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> > WITH
> > SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:")
> > TBLPROPERTIES ("hbase.table.name" = "hbase_table2");
> >
> > set hive.exec.compress.output=true;
> > set io.seqfile.compression.type=BLOCK;
> > set mapred.output.compression.type=BLOCK;
> > set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
> >
> > set mapred.reduce.tasks=40;
> > set mapred.map.tasks=25;
> >
> > INSERT overwrite table tmp_hive_destination
> > select * from hbase_linked_table;
> >
> 
> 

Reply via email to