Hmmm.. maybe im missing something.. but (@bjorn) Why would you use hdfs as a 
replacement for the distributed cache?

After all - the distributed cache is just a file with replication over the 
whole cluster, which isn't in hdfs.  Cant you Just make the cache size big and 
store the file there?

What advantage is hdfs distribution of the file over all nodes  ?

On Apr 9, 2013, at 6:49 AM, Bjorn Jonsson <bjorn...@gmail.com> wrote:

> Put it once on hdfs with a replication factor equal to the number of DN. No 
> startup latency on job submission or max size and access it from anywhere 
> with fs since it sticks around untill you replace it? Just a thought.
> 
> On Apr 8, 2013 9:59 PM, "John Meza" <j_meza...@hotmail.com> wrote:
>> I am researching a Hadoop solution for an existing application that requires 
>> a directory structure full of data for processing.
>> 
>> To make the Hadoop solution work I need to deploy the data directory to each 
>> DN when the job is executed.
>> I know this isn't new and commonly done with a Distributed Cache.
>> 
>> Based on experience what are the common file sizes deployed in a Distributed 
>> Cache? 
>> I know smaller is better, but how big is too big? the larger cache deployed 
>> I have read there will be startup latency. I also assume there are other 
>> factors that play into this.
>> 
>> I know that->Default local.cache.size=10Gb
>> 
>> -Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??
>> -Distributed Cache is normally not used if larger than =____?
>> 
>> Another Option: Put the data directories on each DN and provide location to 
>> TaskTracker?
>> 
>> thanks
>> John

Reply via email to