Hello Spark users, I have to aggregate messages from kafka and at some fixed interval (say every half hour) update a memory persisted RDD and run some computation. This computation uses last one day data. Steps are:
- Read from realtime Kafka topic X in spark streaming batches of 5 seconds - Filter the above DStream messages and keep some of them - Create windows of 30 minutes on above DStream and aggregate by Key - Merge this 30 minute RDD with a memory persisted RDD say combinedRdd - Maintain last N such RDDs in a deque persisting them on disk. While adding new RDD, subtract oldest RDD from the combinedRdd. - Final step consider last N such windows (of 30 minutes each) and do final aggregation Does the above way of using spark streaming looks reasonable? Is there a better way of doing the above? -- Thanks Jatin
