Hello, I'm fairly new to HBase and would appreciate your comments.
[1] One way compute across an HBase dataset would be to run as many maps as regions, for each map, run a scan across the region row limits (within the map method). This approach does not use TableInputFormat.In the reduce (if needed), directly write (using put) to the table. [2] In the *second* approach I could use the TableInputFormat and TableOutputFormat. My hypotheses: H1: As for TableOutputFormat, I think both approaches, performance-wise are equivalent. Correct me if I'm wrong. H2: As for TableInputFormat vs. approach[1]. A quick glance through the TableSplit source reveals location information. At first blush I can imagine in approach [1] I scan from row_start to row_end all the data of which resides on a computer different from the compute node on which the split is being run. Since TableInputFormat (approach [2]) uses region information, my guess (not sure at all) is that Hadoop Mapreduce will assign the computation to the node where the region lies and so when the scan is issued the queries will be issued against local data - achieving data locality. So it makes sense to take advantage of (at the least) the TableSplit information. Are my hypotheses correct? Thanks Joy
