RE: Reading parquet files in parallel on the cluster

2021-05-30 Thread Boris Litvak
for and launch a job per directory. What am I missing? Boris From: Eric Beabes Sent: Wednesday, 26 May 2021 0:34 To: Sean Owen Cc: Silvio Fiorito ; spark-user Subject: Re: Reading parquet files in parallel on the cluster Right... but the problem is still the same, no? Those N Jobs (aka

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes
Thanks for your time & advice. We will experiment & see which works best for us EMR or ECS. On Tue, May 25, 2021 at 2:39 PM Sean Owen wrote: > No, the work is happening on the cluster; you just have (say) 100 parallel > jobs running at the same time. You apply spark.read.parquet to each dir

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Sean Owen
No, the work is happening on the cluster; you just have (say) 100 parallel jobs running at the same time. You apply spark.read.parquet to each dir -- from the driver yes, but spark.read is distributed. At extremes, yes that would challenge the driver, to manage 1000s of jobs concurrently. You may a

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes
Right... but the problem is still the same, no? Those N Jobs (aka Futures or Threads) will all be running on the Driver. Each with its own SparkSession. Isn't that going to put a lot of burden on one Machine? Is that really distributing the load across the cluster? Am I missing something? Would it

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Sean Owen
What you could do is launch N Spark jobs in parallel from the driver. Each one would process a directory you supply with spark.read.parquet, for example. You would just have 10s or 100s of those jobs running at the same time. You have to write a bit of async code to do it, but it's pretty easy wit

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes
Here's the use case: We've a bunch of directories (over 1000) which contain tons of small files in each. Each directory is for a different customer so they are independent in that respect. We need to merge all the small files in each directory into one (or a few) compacted file(s) by using a 'coal

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Silvio Fiorito
Why not just read from Spark as normal? Do these files have different or incompatible schemas? val df = spark.read.option(“mergeSchema”, “true”).load(listOfPaths) From: Eric Beabes Date: Tuesday, May 25, 2021 at 1:24 PM To: spark-user Subject: Reading parquet files in parallel on the cluster

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Sean Owen
Right, you can't use Spark within Spark. Do you actually need to read Parquet like this vs spark.read.parquet? that's also parallel of course. You'd otherwise be reading the files directly in your function with the Parquet APIs. On Tue, May 25, 2021 at 12:24 PM Eric Beabes wrote: > I've a use ca