Hello, I am using the Spark library to convert JSON/Snappy files to ORC/ZLIB format. Effectively, my Java service starts up an embedded Spark cluster (master=local[*]) and uses Spark SQL to convert JSON to ORC. However, I keep getting OOM errors with large (~1GB) files.
I've tried different ways to reduce memory usage, e.g. by partitioning data with dataSet.partitionBy("customer).save(filePath), or capping memory usage by setting spark.executor.memory=1G, but to no vail. I am wondering if there is a way to avoid OOM besides splitting the source JSON file into multiple smaller ones and processing the small ones individually? Does Spark SQL have to read the JSON/Snappy (row-based) file in it's entirety before converting it to ORC (columnar)? If so, would it make sense to create a custom receiver that reads the Snappy file and use Spark streaming for ORC conversion? Thanks, Alec