Pe 30.01.2012 09:53, Rohit Kelkar a scris:
Hi Stack,
My problem is that I have large number of smaller objects and a few
larger objects. My strategy is to store smaller objects (size< 5MB)
in hbase and larger objects (size> 5MB) on hdfs. And I also want to
run MapReduce tasks on those objects. Loan suggested that I should put
all objects in a MapFile/SequenceFile on hdfs and insert in to hbase
the reference of the object stored in the file. Now if I run a
mapreduce task, my mapper would be run locally wrt the object
references and not the actual dfs block where the object resides.
- Rohit Kelkar
Hi Rohit,
First my name is Ioan (with i), second. It's a tricky question. If you
run MapReduce with input from HBase you will have data locality for
HBase data and not from the data in your SequenceFiles. You could get
data locality from those if you perform a pre-setup job that scans HBase
and builds a list of files to process and then runs another MR job on
Hadoop targeting the SequenceFile. I think you can find ways to optimize
the pre-process step to be fast.
The set-up that I described is more suitable for situations when you
need to stream data that it's larger then a HBase is recommended to
handle like mailboxes with large attachements. I'm planning to implement
it soon in Apache James's HBase mailbox implementation to deal with
large inboxes.
Cheers,
--
Ioan Eugen Stan
http://ieugen.blogspot.com