RE: Spark performance over S3

Boris Litvak Tue, 06 Apr 2021 23:18:27 -0700

Hi Tzahi,

I don’t know the reasons for that, though I’d check for fs.s3a implementation 
to be using multipart uploads, which I assume it does.


I would say that none of the comments in the link are relevant to you, as the 
VPC endpoint is more of a security rather than performance feature.

I got an answer from AWS support recently saying that they tested this vs S3 
access via public internet and the differences were negligible.
There is always an option it was not tested in your region, but it’s unlikely. 
Anyway, you can provision & test this with aws cli.

There is always an option to compare this with EMRFS performance …
I know it requires you to put in some work.

Boris

From: Gourav Sengupta <gourav.sengu...@gmail.com>
Sent: Tuesday, 6 April 2021 22:24
To: Tzahi File <tzahi.f...@ironsrc.com>
Cc: user <user@spark.apache.org>
Subject: Re: Spark performance over S3

Hi Tzahi,

that is a huge cost. So that I can understand the question before answering it:
1. what is the SPARK version that you are using?
2. what is the SQL code that you are using to read and write?

There are several other questions that are pertinent, but the above will be a 
great starting point.

Regards,
Gourav Sengupta

On Tue, Apr 6, 2021 at 7:46 PM Tzahi File 
<tzahi.f...@ironsrc.com<mailto:tzahi.f...@ironsrc.com>> wrote:

Hi All,

We have a spark cluster on aws ec2 that has 60 X i3.4xlarge.

The spark job running on that cluster reads from an S3 bucket and writes to 
that bucket.

the bucket and the ec2 run in the same region.

As part of our efforts to reduce the runtime of our spark jobs we found there's 
serious latency when reading from S3.

When the job:
·         reads the parquet files from S3 and also writes to S3, it takes 22 min
·         reads the parquet files from S3 and writes to its local hdfs, it 
takes the same amount of time (±22 min)
·         reads the parquet files from S3 (they were copied into the hdfs 
before) and writes to its local hdfs, the job took 7 min

the spark job has the following S3-related configuration:
·         spark.hadoop.fs.s3a.connection.establish.timeout=5000
·         spark.hadoop.fs.s3a.connection.maximum=200

when reading from S3 we tried to increase the 
spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900 but 
it didn't reduce the S3 latency.

Do you have any idea for the cause of the read latency from S3?

I saw this 
post<https://aws.amazon.com/premiumsupport/knowledge-center/s3-transfer-data-bucket-instance/>
 to improve the transfer speed, is something here relevant?


Thanks,
Tzahi

RE: Spark performance over S3

Reply via email to