subject:"ephemeral\-hdfs vs persistent\-hdfs \- performance"

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-04 Thread Kelvin Chu

Joe, I also use S3 and gzip. So far the I/O is not a problem. In my case, the operation is SQLContext.JsonFile() and I can see from Ganglia that the whole cluster is CPU bound (99% saturated). I have 160 cores and I can see the network can sustain about 150MBit/s. Kelvin On Wed, Feb 4, 2015 at 10

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-04 Thread Aaron Davidson

The latter would be faster. With S3, you want to maximize number of concurrent readers until you hit your network throughput limits. On Wed, Feb 4, 2015 at 6:20 AM, Peter Rudenko wrote: > Hi if i have a 10GB file on s3 and set 10 partitions, would it be > download whole file on master first and

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-04 Thread Peter Rudenko

Hi if i have a 10GB file on s3 and set 10 partitions, would it be download whole file on master first and broadcast it or each worker would just read it's range from the file? Thanks, Peter On 2015-02-03 23:30, Sven Krasser wrote: Hey Joe, With the ephemeral HDFS, you get the instance store

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Sven Krasser

Hey Joe, With the ephemeral HDFS, you get the instance store of your worker nodes. For m3.xlarge that will be two 40 GB SSDs local to each instance, which are very fast. For the persistent HDFS, you get whatever EBS volumes the launch script configured. EBS volumes are always network drives, so t

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread David Rosenstrauch

Not all of our input files are zipped. The ones that are obviously are not parallelized - they're just processed by a single task. Not a big issue for us, though, as the those zipped files aren't too big. DR On 02/03/2015 01:08 PM, Joe Wass wrote: Thanks very much, that's good to know, I'll

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Ted Yu

Using s3a protocol (introduced in hadoop 2.6.0) would be faster compared to s3. The upcoming hadoop 2.7.0 contains some bug fixes for s3a. FYI On Tue, Feb 3, 2015 at 9:48 AM, David Rosenstrauch wrote: > We use S3 as a main storage for all our input data and our generated > (output) data. (10'

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Joe Wass

Thanks very much, that's good to know, I'll certainly give it a look. Can you give me a hint about you unzip your input files on the fly? I thought that it wasn't possible to parallelize zipped inputs unless they were unzipped before passing to Spark? Joe On 3 February 2015 at 17:48, David Rosen

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread David Rosenstrauch

We use S3 as a main storage for all our input data and our generated (output) data. (10's of terabytes of data daily.) We read gzipped data directly from S3 in our Hadoop/Spark jobs - it's not crazily slow, as long as you parallelize the work well by distributing the processing across enough

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Joe Wass

The data is coming from S3 in the first place, and the results will be uploaded back there. But even in the same availability zone, fetching 170 GB (that's gzipped) is slow. From what I understand of the pipelines, multiple transforms on the same RDD might involve re-reading the input, which very q

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread David Rosenstrauch

You could also just push the data to Amazon S3, which would un-link the size of the cluster needed to process the data from the size of the data. DR On 02/03/2015 11:43 AM, Joe Wass wrote: I want to process about 800 GB of data on an Amazon EC2 cluster. So, I need to store the input in HDFS so

ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Joe Wass

I want to process about 800 GB of data on an Amazon EC2 cluster. So, I need to store the input in HDFS somehow. I currently have a cluster of 5 x m3.xlarge, each of which has 80GB disk. Each HDFS node reports 73 GB, and the total capacity is ~370 GB. If I want to process 800 GB of data (assuming

Re: ephemeral-hdfs vs persistent-hdfs - performance

Re: ephemeral-hdfs vs persistent-hdfs - performance

Re: ephemeral-hdfs vs persistent-hdfs - performance

Re: ephemeral-hdfs vs persistent-hdfs - performance

Re: ephemeral-hdfs vs persistent-hdfs - performance

Re: ephemeral-hdfs vs persistent-hdfs - performance

Re: ephemeral-hdfs vs persistent-hdfs - performance

Re: ephemeral-hdfs vs persistent-hdfs - performance

Re: ephemeral-hdfs vs persistent-hdfs - performance

Re: ephemeral-hdfs vs persistent-hdfs - performance

ephemeral-hdfs vs persistent-hdfs - performance

11 matches

Site Navigation

Mail list logo

Footer information