Ioan, Sorry for messing up your name. Your strategy sounds interesting. I will try that out and post the results/problems if and when ...
- Rohit Kelkar On Mon, Jan 30, 2012 at 1:41 PM, Ioan Eugen Stan <[email protected]> wrote: > Pe 30.01.2012 09:53, Rohit Kelkar a scris: > >> Hi Stack, >> My problem is that I have large number of smaller objects and a few >> larger objects. My strategy is to store smaller objects (size< 5MB) >> in hbase and larger objects (size> 5MB) on hdfs. And I also want to >> run MapReduce tasks on those objects. Loan suggested that I should put >> all objects in a MapFile/SequenceFile on hdfs and insert in to hbase >> the reference of the object stored in the file. Now if I run a >> mapreduce task, my mapper would be run locally wrt the object >> references and not the actual dfs block where the object resides. >> >> - Rohit Kelkar > > > Hi Rohit, > > First my name is Ioan (with i), second. It's a tricky question. If you run > MapReduce with input from HBase you will have data locality for HBase data > and not from the data in your SequenceFiles. You could get data locality > from those if you perform a pre-setup job that scans HBase and builds a list > of files to process and then runs another MR job on Hadoop targeting the > SequenceFile. I think you can find ways to optimize the pre-process step to > be fast. > > The set-up that I described is more suitable for situations when you need to > stream data that it's larger then a HBase is recommended to handle like > mailboxes with large attachements. I'm planning to implement it soon in > Apache James's HBase mailbox implementation to deal with large inboxes. > > Cheers, > > > -- > Ioan Eugen Stan > http://ieugen.blogspot.com
