Joe, I also use S3 and gzip. So far the I/O is not a problem. In my case,
the operation is SQLContext.JsonFile() and I can see from Ganglia that the
whole cluster is CPU bound (99% saturated). I have 160 cores and I can see
the network can sustain about 150MBit/s.

Kelvin

On Wed, Feb 4, 2015 at 10:20 AM, Aaron Davidson <ilike...@gmail.com> wrote:

> The latter would be faster. With S3, you want to maximize number of
> concurrent readers until you hit your network throughput limits.
>
> On Wed, Feb 4, 2015 at 6:20 AM, Peter Rudenko <petro.rude...@gmail.com>
> wrote:
>
>>  Hi if i have a 10GB file on s3 and set 10 partitions, would it be
>> download whole file on master first and broadcast it or each worker would
>> just read it's range from the file?
>>
>> Thanks,
>> Peter
>>
>> On 2015-02-03 23:30, Sven Krasser wrote:
>>
>>  Hey Joe,
>>
>> With the ephemeral HDFS, you get the instance store of your worker nodes.
>> For m3.xlarge that will be two 40 GB SSDs local to each instance, which are
>> very fast.
>>
>>  For the persistent HDFS, you get whatever EBS volumes the launch script
>> configured. EBS volumes are always network drives, so the usual limitations
>> apply. To optimize throughput, you can use EBS volumes with provisioned
>> IOPS and you can use EBS optimized instances. I don't have hard numbers at
>> hand, but I'd expect this to be noticeably slower than using local SSDs.
>>
>> As far as only using S3 goes, it depends on your use case (i.e. what you
>> plan on doing with the data while it is there). If you store it there in
>> between running different applications, you can likely work around
>> consistency issues.
>>
>> Also, if you use Amazon's EMRFS to access data in S3, you can use their
>> new consistency feature (
>> https://aws.amazon.com/blogs/aws/emr-consistent-file-system/).
>>
>> Hope this helps!
>> -Sven
>>
>>
>> On Tue, Feb 3, 2015 at 9:32 AM, Joe Wass <jw...@crossref.org> wrote:
>>
>>> The data is coming from S3 in the first place, and the results will be
>>> uploaded back there. But even in the same availability zone, fetching 170
>>> GB (that's gzipped) is slow. From what I understand of the pipelines,
>>> multiple transforms on the same RDD might involve re-reading the input,
>>> which very quickly add up in comparison to having the data locally. Unless
>>> I persisted the data (which I am in fact doing) but that would involve
>>> storing approximately the same amount of data in HDFS, which wouldn't fit.
>>>
>>>  Also, I understood that S3 was unsuitable for practical? See "Why you
>>> cannot use S3 as a replacement for HDFS"[0]. I'd love to be proved wrong,
>>> though, that would make things a lot easier.
>>>
>>>  [0] http://wiki.apache.org/hadoop/AmazonS3
>>>
>>>
>>>
>>> On 3 February 2015 at 16:45, David Rosenstrauch <dar...@darose.net>
>>> wrote:
>>>
>>>> You could also just push the data to Amazon S3, which would un-link the
>>>> size of the cluster needed to process the data from the size of the data.
>>>>
>>>> DR
>>>>
>>>>
>>>> On 02/03/2015 11:43 AM, Joe Wass wrote:
>>>>
>>>>> I want to process about 800 GB of data on an Amazon EC2 cluster. So, I
>>>>> need
>>>>> to store the input in HDFS somehow.
>>>>>
>>>>> I currently have a cluster of 5 x m3.xlarge, each of which has 80GB
>>>>> disk.
>>>>> Each HDFS node reports 73 GB, and the total capacity is ~370 GB.
>>>>>
>>>>> If I want to process 800 GB of data (assuming I can't split the jobs
>>>>> up),
>>>>> I'm guessing I need to get persistent-hdfs involved.
>>>>>
>>>>> 1 - Does persistent-hdfs have noticeably different performance than
>>>>> ephemeral-hdfs?
>>>>> 2 - If so, is there a recommended configuration (like storing input and
>>>>> output on persistent, but persisted RDDs on ephemeral?)
>>>>>
>>>>> This seems like a common use-case, so sorry if this has already been
>>>>> covered.
>>>>>
>>>>> Joe
>>>>>
>>>>>
>>>>
>>>>  ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>>
>> --
>> http://sites.google.com/site/krasser/?utm_source=sig
>>
>>
>>
>

Reply via email to