TableInputFormat vs. a map of table regions (data locality)

Saptarshi Guha Wed, 17 Nov 2010 20:31:42 -0800

Hello,

I'm fairly new to HBase and would appreciate your comments.


[1] One way compute across an HBase dataset would be to run as many
maps as regions,
for each map, run a scan across the region row limits (within the map
method). This approach does not use TableInputFormat.In the reduce (if needed),
directly write (using put) to the table.


[2] In the *second* approach I could use the TableInputFormat and
TableOutputFormat.

My hypotheses:

H1: As for TableOutputFormat, I think both approaches, performance-wise are
equivalent. Correct me if I'm wrong.

H2: As for TableInputFormat vs. approach[1]. A quick glance through the
TableSplit source reveals location information. At first blush I can imagine in
approach [1] I scan from row_start to row_end all the data of which
resides on a computer different from the compute node on which the split is
being run. Since TableInputFormat (approach [2]) uses region information, my
guess (not sure at all) is that Hadoop Mapreduce will assign the computation to
the node where the region lies and so when the scan is issued the queries will
be issued against local data - achieving data locality. So it makes sense to
take advantage of (at the least) the TableSplit information.

Are my hypotheses correct?

Thanks
Joy

TableInputFormat vs. a map of table regions (data locality)

Reply via email to