Process large JSON file without causing OOM

Alec Swan Mon, 13 Nov 2017 16:23:35 -0800

Hello,

I am using the Spark library to convert JSON/Snappy files to ORC/ZLIB
format. Effectively, my Java service starts up an embedded Spark cluster
(master=local[*]) and uses Spark SQL to convert JSON to ORC. However, I
keep getting OOM errors with large (~1GB) files.


I've tried different ways to reduce memory usage, e.g. by partitioning data
with dataSet.partitionBy("customer).save(filePath), or capping memory usage
by setting spark.executor.memory=1G, but to no vail.

I am wondering if there is a way to avoid OOM besides splitting the source
JSON file into multiple smaller ones and processing the small ones
individually? Does Spark SQL have to read the JSON/Snappy (row-based) file
in it's entirety before converting it to ORC (columnar)? If so, would it
make sense to create a custom receiver that reads the Snappy file and use
Spark streaming for ORC conversion?

Thanks,

Alec

Process large JSON file without causing OOM

Reply via email to