Re: Distributed Cache For 100MB+ Data Structure

Michael Segel Sat, 13 Oct 2012 10:54:07 -0700

Build and store the tree in some sort of globally accessible space? 

Like HBase, or HDFS?


On Oct 13, 2012, at 9:46 AM, Kyle Moses <[email protected]> wrote:

> Chris,
> Thanks for the suggestion on serializing the radix tree and your thoughts on 
> the memory issue.  I'm planning to test a few different solutions and will 
> post another reply if the results prove interesting.
> 
> Kyle
> 
> On 10/11/2012 1:52 PM, Chris Nauroth wrote:
>> Hello Kyle,
>> 
>> Regarding the setup time of the radix tree, is it possible to precompute the 
>> radix tree before job submission time, then create a serialized 
>> representation (perhaps just Java object serialization), and send the 
>> serialized form through distributed cache?  Then, each reducer would just 
>> need to deserialize during setup() instead of recomputing the full radix 
>> tree for every reducer task.  That might save time.
>> 
>> Regarding the memory consumption, when I've run into a situation like this, 
>> I've generally solved it by caching the data in a separate process and using 
>> some kind of IPC from the reducers to access it.  memcache is one example, 
>> though that's probably not an ideal fit for this data structure.  I'm aware 
>> of no equivalent solution directly in Hadoop and would be curious to hear 
>> from others on the topic.
>> 
>> Thanks,
>> --Chris
>> 
>> On Thu, Oct 11, 2012 at 10:12 AM, Kyle Moses <[email protected]> wrote:
>> Problem Background:
>> I have a Hadoop MapReduce program that uses a IPv6 radix tree to provide 
>> auxiliary input during the reduce phase of the second job in it's workflow, 
>> but doesn't need the data at any other point.
>> It seems pretty straight forward to use the distributed cache to build this 
>> data structure inside each reducer in the setup() method.
>> This solution is functional, but ends up using a large amount of memory if I 
>> have 3 or more reducers running on the same node and the setup time of the 
>> radix tree is non-trivial.
>> Additionally, the IPv6 version of the structure is quite a bit larger in 
>> memory.
>> 
>> Question:
>> Is there a "good" way to share this data structure across all reducers on 
>> the same node within the Hadoop framework?
>> 
>> Initial Thoughts:
>> It seems like this might be possible by altering the Task JVM Reuse 
>> parameters, but from what I have read this would also affect map tasks and 
>> I'm concerned about drawbacks/side-effects.
>> 
>> Thanks for your help!
>> 
>

Re: Distributed Cache For 100MB+ Data Structure

Reply via email to