Hi,

I am reading a json part files using Spark, it has 1.4 Million records and
the input size is closer to 200 GB.

At the time of reading/infering schema, (spark.read.json) its throwing out
of memory Exception.

The job is running in the cluster, where i am providing 22g executor
memory, 4 executor cores, 25g driver memory and i have ran in both the
modes, client and cluster.

It is working fine for the less amount of data (2.0 GB)

I have further noticed by increasing the driver memory I could bring the
job to read  2.4 GB of input data.

Why this spark.read.json is running in the driver memory? Can we make it
run on the executors?

Any suggestions around it would highly appreciated.

Our input records schema is dynamic. Its difficult to manually define the
schema using structs. There are more than 1500 columns in the json which
varies record to record and the json is a complex nested one.

I have tried reading it with 1600 partitions and 2000+ partitions but
nothing is working out.

In the end i am getting the memory exception.


If i am defining the schema for the json while reading, then the job is
running smoothly.


Only reason to fetch the schema is later we can use if for the flattening
of json.

Any help would be highly appreciated


TIA.

Reply via email to