Hi,
We have 30 million small (100k each) files on s3 to process. I am thinking
about something like below to load them in parallel

val data = sc.union(sc.wholeTextFiles("s3a://.../*.json").map(...)
.toDF().createOrReplaceTempView("data")

How to estimate the driver memory it should be given? is there better
practice? or should I merge them in preprocess? Thanks in advance.

Regards,
Shawn

Reply via email to