Build and store the tree in some sort of globally accessible space? Like HBase, or HDFS?
On Oct 13, 2012, at 9:46 AM, Kyle Moses <[email protected]> wrote: > Chris, > Thanks for the suggestion on serializing the radix tree and your thoughts on > the memory issue. I'm planning to test a few different solutions and will > post another reply if the results prove interesting. > > Kyle > > On 10/11/2012 1:52 PM, Chris Nauroth wrote: >> Hello Kyle, >> >> Regarding the setup time of the radix tree, is it possible to precompute the >> radix tree before job submission time, then create a serialized >> representation (perhaps just Java object serialization), and send the >> serialized form through distributed cache? Then, each reducer would just >> need to deserialize during setup() instead of recomputing the full radix >> tree for every reducer task. That might save time. >> >> Regarding the memory consumption, when I've run into a situation like this, >> I've generally solved it by caching the data in a separate process and using >> some kind of IPC from the reducers to access it. memcache is one example, >> though that's probably not an ideal fit for this data structure. I'm aware >> of no equivalent solution directly in Hadoop and would be curious to hear >> from others on the topic. >> >> Thanks, >> --Chris >> >> On Thu, Oct 11, 2012 at 10:12 AM, Kyle Moses <[email protected]> wrote: >> Problem Background: >> I have a Hadoop MapReduce program that uses a IPv6 radix tree to provide >> auxiliary input during the reduce phase of the second job in it's workflow, >> but doesn't need the data at any other point. >> It seems pretty straight forward to use the distributed cache to build this >> data structure inside each reducer in the setup() method. >> This solution is functional, but ends up using a large amount of memory if I >> have 3 or more reducers running on the same node and the setup time of the >> radix tree is non-trivial. >> Additionally, the IPv6 version of the structure is quite a bit larger in >> memory. >> >> Question: >> Is there a "good" way to share this data structure across all reducers on >> the same node within the Hadoop framework? >> >> Initial Thoughts: >> It seems like this might be possible by altering the Task JVM Reuse >> parameters, but from what I have read this would also affect map tasks and >> I'm concerned about drawbacks/side-effects. >> >> Thanks for your help! >> >
