I need to write a Spark Structured Streaming pipeline that involves multiple aggregations, splitting data into multiple sub-pipes and union them. Also it need to have stateful aggregation with timeout.
Spark Structured Streaming support all of the required functionality but not as one stream. I did a proof of concept that divide the pipeline into 3 sub-streams cascaded using Kafka and it seems to work. But I was wondering if it would be a good idea to skip Kafka and use HDFS files as integration. Or maybe there is another way to cascade streams without needing extra service like Kafka. Thanks,