Hi all,

Can anyone elaborate on the pitfalls or implications of running
MapReduce using an HFileInputFormat extending FileInputFormat?

I'm sure scanning goes through the RS for good reasons (guessing
handling splits, locking, RS monitoring etc) but can it ever be "safe"
to run MR over HFiles directly?  E.g. For scenarios like a a region
split,  would the MR just get stale data or would _bad_things_happen_?

For our use cases we could tolerate stale data, the occasional MR
failure on a node dropping out, and if we could detect a region split
we can suspend MR jobs on the HFile until the split is finished.  We
don't anticipate huge daily growth, but a lot of scanning and random
access.

I knocked up a quick example porting the Scala version of HFIF [1] to
Java [2] and full data scans appear to be an order of magnitude
quicker (30 -> 3 mins), but I suspect this is *fraught* with dangers.
If not, I'd like to try and take this further, possibly with Hive.

Thanks,
Tim

[1] https://gist.github.com/1120311
[2] http://pastebin.com/e5qeKgAd

Reply via email to