I need to write a Spark Structured Streaming pipeline that involves
multiple aggregations, splitting data into multiple sub-pipes and union
them. Also it need to have stateful aggregation with timeout.

Spark Structured Streaming support all of the required functionality but
not as one stream. I did a proof of concept that divide the pipeline into 3
sub-streams cascaded using Kafka and it seems to work. But I was wondering
if it would be a good idea to skip Kafka and use HDFS files as integration.
Or maybe there is another way to cascade streams without needing extra
service like Kafka.

Thanks,

Reply via email to