Hello Spark users,

I have to aggregate messages from kafka and at some fixed interval (say
every half hour) update a memory persisted RDD and run some computation.
This computation uses last one day data. Steps are:

- Read from realtime Kafka topic X in spark streaming batches of 5 seconds
- Filter the above DStream messages and keep some of them
- Create windows of 30 minutes on above DStream and aggregate by Key
- Merge this 30 minute RDD with a memory persisted RDD say combinedRdd
- Maintain last N such RDDs in a deque persisting them on disk. While
adding new RDD, subtract oldest RDD from the combinedRdd.
- Final step consider last N such windows (of 30 minutes each) and do final
aggregation

Does the above way of using spark streaming looks reasonable? Is there a
better way of doing the above?

--
Thanks
Jatin

Reply via email to