Re: Structured Streaming from Parquet

2017-05-25 Thread upendra 1991
Paul, Did you try, writing to disk rather than in memory. When files are large depending upon which one of quality (performance)/quantity  You want to have, writing to disk would get the load of executors down and will pass to stage where format your data in app2. Other options are to use Kafka s

Re: Structured Streaming from Parquet

2017-05-25 Thread Burak Yavuz
Hi Paul, >From what you're describing, it seems that stream1 is possibly generating tons of small files and stream2 is OOMing because it tries to maintain an in-memory list of files. Some notes/questions: 1. Parquet files are splittable, therefore having large parquet files shouldn't be a proble

Structured Streaming from Parquet

2017-05-25 Thread Paul Corley
I have a Spark Structured Streaming process that is implemented in 2 separate streaming apps. First App reads .gz, which range in size from 1GB to 9GB compressed, files in from s3 filters out invalid records and repartitions the data and outputs to parquet on s3 partitioned the same as the stre