read large number of files on s3

Xiaomeng Wan Tue, 08 Nov 2016 09:37:24 -0800

Hi,
We have 30 million small (100k each) files on s3 to process. I am thinking
about something like below to load them in parallel


val data = sc.union(sc.wholeTextFiles("s3a://.../*.json").map(...)
.toDF().createOrReplaceTempView("data")

How to estimate the driver memory it should be given? is there better
practice? or should I merge them in preprocess? Thanks in advance.

Regards,
Shawn

read large number of files on s3

Reply via email to