Right, that's to be expected with bulk batch jobs. The alternative is
keeping duplicate files in HDFS, and not being able to easily create or
manage them. Snapshots in the HBase format'll be fine.
On May 6, 2011 5:19 PM, "Bill Graham" <[email protected]> wrote:
> One big reason is that there will be updates in the memory store that
aren't
> yet written to HFiles. You'll miss these.
>
> On Fri, May 6, 2011 at 12:27 PM, Jason Rutherglen <
> [email protected]> wrote:
>
>> Is there an issue open or any particular reason that an MR job needs to
>> access
>> the HBase data directly from the region server? It seems possible to also
>> provide functionality such that MR can execute over the HFile(s) stored
in
>> HDFS, thereby giving similar performance characteristics comparable to
>> typical
>> MR jobs that execute against files in HDFS.
>>
>> Jason
>>