for and launch a job per directory.
What am I missing?
Boris
From: Eric Beabes
Sent: Wednesday, 26 May 2021 0:34
To: Sean Owen
Cc: Silvio Fiorito ; spark-user
Subject: Re: Reading parquet files in parallel on the cluster
Right... but the problem is still the same, no? Those N Jobs (aka
Thanks for your time & advice. We will experiment & see which works best
for us EMR or ECS.
On Tue, May 25, 2021 at 2:39 PM Sean Owen wrote:
> No, the work is happening on the cluster; you just have (say) 100 parallel
> jobs running at the same time. You apply spark.read.parquet to each dir
No, the work is happening on the cluster; you just have (say) 100 parallel
jobs running at the same time. You apply spark.read.parquet to each dir --
from the driver yes, but spark.read is distributed. At extremes, yes that
would challenge the driver, to manage 1000s of jobs concurrently. You may
a
Right... but the problem is still the same, no? Those N Jobs (aka Futures
or Threads) will all be running on the Driver. Each with its own
SparkSession. Isn't that going to put a lot of burden on one Machine? Is
that really distributing the load across the cluster? Am I missing
something?
Would it
What you could do is launch N Spark jobs in parallel from the driver. Each
one would process a directory you supply with spark.read.parquet, for
example. You would just have 10s or 100s of those jobs running at the same
time. You have to write a bit of async code to do it, but it's pretty easy
wit
Here's the use case:
We've a bunch of directories (over 1000) which contain tons of small files
in each. Each directory is for a different customer so they are independent
in that respect. We need to merge all the small files in each directory
into one (or a few) compacted file(s) by using a 'coal
Why not just read from Spark as normal? Do these files have different or
incompatible schemas?
val df = spark.read.option(“mergeSchema”, “true”).load(listOfPaths)
From: Eric Beabes
Date: Tuesday, May 25, 2021 at 1:24 PM
To: spark-user
Subject: Reading parquet files in parallel on the cluster
Right, you can't use Spark within Spark.
Do you actually need to read Parquet like this vs spark.read.parquet?
that's also parallel of course.
You'd otherwise be reading the files directly in your function with the
Parquet APIs.
On Tue, May 25, 2021 at 12:24 PM Eric Beabes
wrote:
> I've a use ca