Hi, In my spark batch job,
step 1: the driver assigns a partition of json file path list to each executor. step 2: each executor gets these assigned json files from S3 and save into hdfs. step 3: the driver read these json files into a data frame and save into parquet. To improve performance by avoiding writing jsons to hdfs, I want to change the workflow to: step 1: the driver assigns a partition of json file path list to each executor. step 2: each executor gets these assigned json files from S3, merge the json content in memory and directly write to parquet. No need to write jsons to hdfs. I cannot create dataframes in executors. Is this improvement feasible? Appreciate any help!