Hi Joy, [1] is what [2] does. They are just a thin wrapper around the raw API.
And as Alex pointed out and you noticed too, [2] adds the benefit to have locality support. If you were to add this to [1] then you have [2]. Lars On Thu, Nov 18, 2010 at 5:30 AM, Saptarshi Guha <[email protected]> wrote: > Hello, > > I'm fairly new to HBase and would appreciate your comments. > > [1] One way compute across an HBase dataset would be to run as many > maps as regions, > for each map, run a scan across the region row limits (within the map > method). This approach does not use TableInputFormat.In the reduce (if > needed), > directly write (using put) to the table. > > > [2] In the *second* approach I could use the TableInputFormat and > TableOutputFormat. > > My hypotheses: > > H1: As for TableOutputFormat, I think both approaches, performance-wise are > equivalent. Correct me if I'm wrong. > > H2: As for TableInputFormat vs. approach[1]. A quick glance through the > TableSplit source reveals location information. At first blush I can imagine > in > approach [1] I scan from row_start to row_end all the data of which > resides on a computer different from the compute node on which the split is > being run. Since TableInputFormat (approach [2]) uses region information, my > guess (not sure at all) is that Hadoop Mapreduce will assign the computation > to > the node where the region lies and so when the scan is issued the queries will > be issued against local data - achieving data locality. So it makes sense to > take advantage of (at the least) the TableSplit information. > > Are my hypotheses correct? > > Thanks > Joy >
