RE: Spark performance over S3

2021-04-07 Thread Boris Litvak
Oh, Tzahi, I misread the metrics in the first reply. It’s about reads indeed, not writes. From: Tzahi File Sent: Wednesday, 7 April 2021 16:02 To: Hariharan Cc: user Subject: Re: Spark performance over S3 Hi Hariharan, Thanks for your reply. In both cases we are writing the data to S3. The

Re: Spark performance over S3

2021-04-07 Thread Tzahi File
Hi Hariharan, Thanks for your reply. In both cases we are writing the data to S3. The difference is that in the first case we read the data from S3 and in the second we read from HDFS. We are using ListObjectsV2 API in S3A . The S3 bucket and t

Re: Spark performance over S3

2021-04-07 Thread Vladimir Prus
VPC endpoint can also make a major difference in costs. Without it, access to S3 incurs data transfer costs and NAT costs, and these can be large. On Wed, 7 Apr 2021 at 14:13, Hariharan wrote: > Hi Tzahi, > > Comparing the first two cases: > >- > reads the parquet files from S3 and also writ

Re: Spark performance over S3

2021-04-07 Thread Hariharan
Hi Tzahi, Comparing the first two cases: - > reads the parquet files from S3 and also writes to S3, it takes 22 min - > reads the parquet files from S3 and writes to its local hdfs, it takes the same amount of time (±22 min) It looks like most of the time is being spent in reading, and the time s

RE: Spark performance over S3

2021-04-06 Thread Boris Litvak
n to compare this with EMRFS performance … I know it requires you to put in some work. Boris From: Gourav Sengupta Sent: Tuesday, 6 April 2021 22:24 To: Tzahi File Cc: user Subject: Re: Spark performance over S3 Hi Tzahi, that is a huge cost. So that I can understand the question before answe

Re: Spark performance over S3

2021-04-06 Thread Gourav Sengupta
Hi Tzahi, that is a huge cost. So that I can understand the question before answering it: 1. what is the SPARK version that you are using? 2. what is the SQL code that you are using to read and write? There are several other questions that are pertinent, but the above will be a great starting poi