It's going to be fairly difficult imho. What you need to look at is region. Tables are split in regions. Regions are allocated to region server (i.e. an hbase node). Reads and writes are directed to the region server owning the region. Regions can move from one region server to another, that's the job of the load balancer. Regions can be split at any moment. In the HBase client API, you don't really see these regions: it's managed internally by HBase (my guess is that locations are available anyway, but I'm not sure). If you want locality, you need to run the user code on the region server owning the region you're reading/writing. But it could be a premature and costly optimization.
Nicolas On Wed, Mar 4, 2015 at 6:46 AM, Gokul Balakrishnan <[email protected]> wrote: > Hello, > > I'm fairly new to HBase so would be grateful for any assistance. > > My project is as follows: use HBase as an underlying data store for an > analytics cluster (powered by Apache Spark). > > In doing this, I'm wondering how I may set about leveraging the locality of > the HBase data during processing (in other words, if the Spark instance is > running on a node that also houses HBase data, how to make use of the local > data first). > > Is there some form of metadata offered by the Java API which I could then > use to organise the data into (virtual) groups based on the locality to be > passed forward to Spark? It could be something that *identifies on which > node a particular row resides*. I found [1] but I'm not sure if this is > what I'm looking for. Could someone please point me in the right direction? > > [1] https://issues.apache.org/jira/browse/HBASE-12361 > > Thanks so much! > Gokul Balakrishnan. >
