Paul,
Did you try, writing to disk rather than in memory. When files are large
depending upon which one of quality (performance)/quantity
You want to have, writing to disk would get the load of executors down and will
pass to stage where format your data in app2.
Other options are to use Kafka s
Hi Paul,
>From what you're describing, it seems that stream1 is possibly generating
tons of small files and stream2 is OOMing because it tries to maintain an
in-memory list of files. Some notes/questions:
1. Parquet files are splittable, therefore having large parquet files
shouldn't be a proble
I have a Spark Structured Streaming process that is implemented in 2 separate
streaming apps.
First App reads .gz, which range in size from 1GB to 9GB compressed, files in
from s3 filters out invalid records and repartitions the data and outputs to
parquet on s3 partitioned the same as the stre