Re: Dealing with data locality in the HBase Java API

Nicolas Liochon Wed, 04 Mar 2015 00:38:03 -0800

It's going to be fairly difficult imho.
What you need to look at is region. Tables are split in regions. Regions
are allocated to region server (i.e. an hbase node). Reads and writes are
directed to the region server owning the region. Regions can move from one
region server to another, that's the job of the load balancer. Regions can
be split at any moment. In the HBase client API, you don't really see these
regions: it's managed internally by HBase (my guess is that locations are
available anyway, but I'm not sure). If you want locality, you need to run
the user code on the region server owning the region you're
reading/writing. But it could be a premature and costly optimization.


Nicolas

On Wed, Mar 4, 2015 at 6:46 AM, Gokul Balakrishnan <[email protected]>
wrote:

> Hello,
>
> I'm fairly new to HBase so would be grateful for any assistance.
>
> My project is as follows: use HBase as an underlying data store for an
> analytics cluster (powered by Apache Spark).
>
> In doing this, I'm wondering how I may set about leveraging the locality of
> the HBase data during processing (in other words, if the Spark instance is
> running on a node that also houses HBase data, how to make use of the local
> data first).
>
> Is there some form of metadata offered by the Java API which I could then
> use to organise the data into (virtual) groups based on the locality to be
> passed forward to Spark? It could be something that *identifies on which
> node a particular row resides*. I found [1] but I'm not sure if this is
> what I'm looking for. Could someone please point me in the right direction?
>
> [1] https://issues.apache.org/jira/browse/HBASE-12361
>
> Thanks so much!
> Gokul Balakrishnan.
>

Re: Dealing with data locality in the HBase Java API

Reply via email to