Hi, We have 30 million small (100k each) files on s3 to process. I am thinking about something like below to load them in parallel
val data = sc.union(sc.wholeTextFiles("s3a://.../*.json").map(...) .toDF().createOrReplaceTempView("data") How to estimate the driver memory it should be given? is there better practice? or should I merge them in preprocess? Thanks in advance. Regards, Shawn