RE: Distributed cache: how big is too big?

2013-04-09 Thread John Meza
t: Re: Distributed cache: how big is too big? From: bjorn...@gmail.com To: user@hadoop.apache.org I think the correct question is why would you use distributed cache for a large file that is read during map/reduce instead of plain hdfs? It does not sound wise to shuffle GB of data onto all nodes on

Re: Distributed cache: how big is too big?

2013-04-09 Thread Bjorn Jonsson
I think the correct question is why would you use distributed cache for a large file that is read during map/reduce instead of plain hdfs? It does not sound wise to shuffle GB of data onto all nodes on each job submission and then just remove it when the job is done. I would think about picking ano

Re: Distributed cache: how big is too big?

2013-04-09 Thread Jay Vyas
Hmmm.. maybe im missing something.. but (@bjorn) Why would you use hdfs as a replacement for the distributed cache? After all - the distributed cache is just a file with replication over the whole cluster, which isn't in hdfs. Cant you Just make the cache size big and store the file there? Wh

RE: Distributed cache: how big is too big?

2013-04-09 Thread John Meza
"a replication factor equal to the number of DN"Hmmm... I'm not sure I understand: there are 8 DN in mytest cluster. Date: Tue, 9 Apr 2013 04:49:17 -0700 Subject: Re: Distributed cache: how big is too big? From: bjorn...@gmail.com To: user@hadoop.apache.org Put it once

Re: Distributed cache: how big is too big?

2013-04-09 Thread Bjorn Jonsson
Put it once on hdfs with a replication factor equal to the number of DN. No startup latency on job submission or max size and access it from anywhere with fs since it sticks around untill you replace it? Just a thought. On Apr 8, 2013 9:59 PM, "John Meza" wrote: > I am researching a Hadoop soluti

Distributed cache: how big is too big?

2013-04-08 Thread John Meza
I am researching a Hadoop solution for an existing application that requires a directory structure full of data for processing. To make the Hadoop solution work I need to deploy the data directory to each DN when the job is executed.I know this isn't new and commonly done with a Distributed Cach