Hi Lars, Perfect. Thanks for confirming. I have some existing code for which I want to add HBase support with minimal modifications to the original code base. I think i need to provide InputFormat containing TableSplit.
On a side note, i feel the Key and Values in map, reduce, record reader methods should be interfaces and not classes (I guess there is a reason for the change). Keys/Values should conform to a contract but do they need to sit in a class hierarchy? Cheers Joy On Wed, Nov 17, 2010 at 11:55 PM, Lars George <[email protected]> wrote: > Hi Joy, > > [1] is what [2] does. They are just a thin wrapper around the raw API. > > And as Alex pointed out and you noticed too, [2] adds the benefit to > have locality support. If you were to add this to [1] then you have > [2]. > > Lars > > On Thu, Nov 18, 2010 at 5:30 AM, Saptarshi Guha > <[email protected]> wrote: >> Hello, >> >> I'm fairly new to HBase and would appreciate your comments. >> >> [1] One way compute across an HBase dataset would be to run as many >> maps as regions, >> for each map, run a scan across the region row limits (within the map >> method). This approach does not use TableInputFormat.In the reduce (if >> needed), >> directly write (using put) to the table. >> >> >> [2] In the *second* approach I could use the TableInputFormat and >> TableOutputFormat. >> >> My hypotheses: >> >> H1: As for TableOutputFormat, I think both approaches, performance-wise are >> equivalent. Correct me if I'm wrong. >> >> H2: As for TableInputFormat vs. approach[1]. A quick glance through the >> TableSplit source reveals location information. At first blush I can imagine >> in >> approach [1] I scan from row_start to row_end all the data of which >> resides on a computer different from the compute node on which the split is >> being run. Since TableInputFormat (approach [2]) uses region information, my >> guess (not sure at all) is that Hadoop Mapreduce will assign the computation >> to >> the node where the region lies and so when the scan is issued the queries >> will >> be issued against local data - achieving data locality. So it makes sense to >> take advantage of (at the least) the TableSplit information. >> >> Are my hypotheses correct? >> >> Thanks >> Joy >> >
