Re: Distributed Cache For 100MB+ Data Structure

Kyle Moses Sat, 13 Oct 2012 07:47:38 -0700

Chris,

Thanks for the suggestion on serializing the radix tree and yourthoughts on the memory issue. I'm planning to test a few differentsolutions and will post another reply if the results prove interesting.


Kyle

On 10/11/2012 1:52 PM, Chris Nauroth wrote:

Hello Kyle,
Regarding the setup time of the radix tree, is it possible toprecompute the radix tree before job submission time, then create aserialized representation (perhaps just Java object serialization),and send the serialized form through distributed cache? Then, eachreducer would just need to deserialize during setup() instead ofrecomputing the full radix tree for every reducer task. That mightsave time.
Regarding the memory consumption, when I've run into a situation likethis, I've generally solved it by caching the data in a separateprocess and using some kind of IPC from the reducers to access it.memcache is one example, though that's probably not an ideal fit forthis data structure. I'm aware of no equivalent solution directly inHadoop and would be curious to hear from others on the topic.
Thanks,
--Chris
On Thu, Oct 11, 2012 at 10:12 AM, Kyle Moses <[email protected]<mailto:[email protected]>> wrote:
    Problem Background:
    I have a Hadoop MapReduce program that uses a IPv6 radix tree to
    provide auxiliary input during the reduce phase of the second job
    in it's workflow, but doesn't need the data at any other point.
    It seems pretty straight forward to use the distributed cache to
    build this data structure inside each reducer in the setup() method.
    This solution is functional, but ends up using a large amount of
    memory if I have 3 or more reducers running on the same node and
    the setup time of the radix tree is non-trivial.
    Additionally, the IPv6 version of the structure is quite a bit
    larger in memory.

    Question:
    Is there a "good" way to share this data structure across all
    reducers on the same node within the Hadoop framework?

    Initial Thoughts:
    It seems like this might be possible by altering the Task JVM
    Reuse parameters, but from what I have read this would also affect
    map tasks and I'm concerned about drawbacks/side-effects.

    Thanks for your help!

Re: Distributed Cache For 100MB+ Data Structure

Reply via email to