I am researching a Hadoop solution for an existing application that requires a
directory structure full of data for processing.
To make the Hadoop solution work I need to deploy the data directory to each DN
when the job is executed.I know this isn't new and commonly done with a
Distributed Cache.
Based on experience what are the common file sizes deployed in a Distributed
Cache? I know smaller is better, but how big is too big? the larger cache
deployed I have read there will be startup latency. I also assume there are
other factors that play into this.
I know that->Default local.cache.size=10Gb
-Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??-Distributed Cache
is normally not used if larger than =____?
Another Option: Put the data directories on each DN and provide location to
TaskTracker?
thanksJohn