t: Re: Distributed cache: how big is too big?
From: bjorn...@gmail.com
To: user@hadoop.apache.org
I think the correct question is why would you use distributed cache for a large
file that is read during map/reduce instead of plain hdfs? It does not sound
wise to shuffle GB of data onto all nodes on
I think the correct question is why would you use distributed cache for a
large file that is read during map/reduce instead of plain hdfs? It does
not sound wise to shuffle GB of data onto all nodes on each job submission
and then just remove it when the job is done. I would think about picking
ano
Hmmm.. maybe im missing something.. but (@bjorn) Why would you use hdfs as a
replacement for the distributed cache?
After all - the distributed cache is just a file with replication over the
whole cluster, which isn't in hdfs. Cant you Just make the cache size big and
store the file there?
Wh
"a replication factor equal to the number of DN"Hmmm... I'm not sure I
understand: there are 8 DN in mytest cluster.
Date: Tue, 9 Apr 2013 04:49:17 -0700
Subject: Re: Distributed cache: how big is too big?
From: bjorn...@gmail.com
To: user@hadoop.apache.org
Put it once
Put it once on hdfs with a replication factor equal to the number of DN. No
startup latency on job submission or max size and access it from anywhere
with fs since it sticks around untill you replace it? Just a thought.
On Apr 8, 2013 9:59 PM, "John Meza" wrote:
> I am researching a Hadoop soluti
I am researching a Hadoop solution for an existing application that requires a
directory structure full of data for processing.
To make the Hadoop solution work I need to deploy the data directory to each DN
when the job is executed.I know this isn't new and commonly done with a
Distributed Cach