Hi Joy,

[1] is what [2] does. They are just a thin wrapper around the raw API.

And as Alex pointed out and you noticed too, [2] adds the benefit to
have locality support. If you were to add this to [1] then you have
[2].

Lars

On Thu, Nov 18, 2010 at 5:30 AM, Saptarshi Guha
<[email protected]> wrote:
> Hello,
>
> I'm fairly new to HBase and would appreciate your comments.
>
> [1] One way compute across an HBase dataset would be to run as many
> maps as regions,
> for each map, run a scan across the region row limits (within the map
> method). This approach does not use TableInputFormat.In the reduce (if 
> needed),
> directly write (using put) to the table.
>
>
> [2] In the *second* approach I could use the TableInputFormat and
> TableOutputFormat.
>
> My hypotheses:
>
> H1: As for TableOutputFormat, I think both approaches, performance-wise are
> equivalent. Correct me if I'm wrong.
>
> H2: As for TableInputFormat vs. approach[1]. A quick glance through the
> TableSplit source reveals location information. At first blush I can imagine 
> in
> approach [1] I scan from row_start to row_end all the data of which
> resides on a computer different from the compute node on which the split is
> being run. Since TableInputFormat (approach [2]) uses region information, my
> guess (not sure at all) is that Hadoop Mapreduce will assign the computation 
> to
> the node where the region lies and so when the scan is issued the queries will
> be issued against local data - achieving data locality. So it makes sense to
> take advantage of (at the least) the TableSplit information.
>
> Are my hypotheses correct?
>
> Thanks
> Joy
>

Reply via email to