Ioan, Sorry for messing up your name. Your strategy sounds
interesting. I will try that out and post the results/problems if and
when ...

- Rohit Kelkar

On Mon, Jan 30, 2012 at 1:41 PM, Ioan Eugen Stan <[email protected]> wrote:
> Pe 30.01.2012 09:53, Rohit Kelkar a scris:
>
>> Hi Stack,
>> My problem is that I have large number of smaller objects and a few
>> larger objects. My strategy is to store smaller objects (size<  5MB)
>> in hbase and larger objects (size>  5MB) on hdfs. And I also want to
>> run MapReduce tasks on those objects. Loan suggested that I should put
>> all objects in a MapFile/SequenceFile on hdfs and insert in to hbase
>> the reference of the object stored in the file. Now if I run a
>> mapreduce task, my mapper would be run locally wrt the object
>> references and not the actual dfs block where the object resides.
>>
>> - Rohit Kelkar
>
>
> Hi Rohit,
>
> First my name is Ioan (with i), second. It's a tricky question. If you run
> MapReduce with input from HBase you will have data locality for HBase data
> and not from the data in your SequenceFiles. You could get data locality
> from those if you perform a pre-setup job that scans HBase and builds a list
> of files to process and then runs another MR job on Hadoop targeting the
> SequenceFile. I think you can find ways to optimize the pre-process step to
> be fast.
>
> The set-up that I described is more suitable for situations when you need to
> stream data that it's larger then a HBase is recommended to handle like
> mailboxes with large attachements. I'm planning to implement it soon in
> Apache James's HBase mailbox implementation to deal with large inboxes.
>
> Cheers,
>
>
> --
> Ioan Eugen Stan
> http://ieugen.blogspot.com

Reply via email to