Re: advice needed on storing large objects on hdfs

Ioan Eugen Stan Mon, 30 Jan 2012 00:12:04 -0800

Pe 30.01.2012 09:53, Rohit Kelkar a scris:

Hi Stack,
My problem is that I have large number of smaller objects and a few
larger objects. My strategy is to store smaller objects (size<  5MB)
in hbase and larger objects (size>  5MB) on hdfs. And I also want to
run MapReduce tasks on those objects. Loan suggested that I should put
all objects in a MapFile/SequenceFile on hdfs and insert in to hbase
the reference of the object stored in the file. Now if I run a
mapreduce task, my mapper would be run locally wrt the object
references and not the actual dfs block where the object resides.


- Rohit Kelkar


Hi Rohit,

First my name is Ioan (with i), second. It's a tricky question. If yourun MapReduce with input from HBase you will have data locality forHBase data and not from the data in your SequenceFiles. You could getdata locality from those if you perform a pre-setup job that scans HBaseand builds a list of files to process and then runs another MR job onHadoop targeting the SequenceFile. I think you can find ways to optimizethe pre-process step to be fast.

The set-up that I described is more suitable for situations when youneed to stream data that it's larger then a HBase is recommended tohandle like mailboxes with large attachements. I'm planning to implementit soon in Apache James's HBase mailbox implementation to deal withlarge inboxes.


Cheers,

--
Ioan Eugen Stan
http://ieugen.blogspot.com

Re: advice needed on storing large objects on hdfs

Reply via email to