Hi, I am reading a json part files using Spark, it has 1.4 Million records and the input size is closer to 200 GB.
At the time of reading/infering schema, (spark.read.json) its throwing out of memory Exception. The job is running in the cluster, where i am providing 22g executor memory, 4 executor cores, 25g driver memory and i have ran in both the modes, client and cluster. It is working fine for the less amount of data (2.0 GB) I have further noticed by increasing the driver memory I could bring the job to read 2.4 GB of input data. Why this spark.read.json is running in the driver memory? Can we make it run on the executors? Any suggestions around it would highly appreciated. Our input records schema is dynamic. Its difficult to manually define the schema using structs. There are more than 1500 columns in the json which varies record to record and the json is a complex nested one. I have tried reading it with 1600 partitions and 2000+ partitions but nothing is working out. In the end i am getting the memory exception. If i am defining the schema for the json while reading, then the job is running smoothly. Only reason to fetch the schema is later we can use if for the flattening of json. Any help would be highly appreciated TIA.