read json and write into parquet in executors

Lian Jiang Mon, 11 Mar 2019 19:59:42 -0700

Hi,

In my spark batch job,


step 1: the driver assigns a partition of json file path list to each
executor.
step 2: each executor gets these assigned json files from S3 and save into
hdfs.
step 3: the driver read these json files into a data frame and save into
parquet.

To improve performance by avoiding writing jsons to hdfs, I want to change
the workflow to:

step 1: the driver assigns a partition of json file path list to each
executor.
step 2: each executor gets these assigned json files from S3, merge the
json content in memory and directly write to parquet. No need to write
jsons to hdfs.

I cannot create dataframes in executors. Is this improvement feasible?
Appreciate any help!

read json and write into parquet in executors

Reply via email to