w.r.t. cleaner TTL, please see: [SPARK-7689] Remove TTL-based metadata cleaning in Spark 2.0
FYI On Sun, Feb 21, 2016 at 4:16 AM, Gerard Maas <gerard.m...@gmail.com> wrote: > > It sounds like another window operation on top of the 30-min window will > achieve the desired objective. > Just keep in mind that you'll need to set the clean TTL (spark.cleaner.ttl) > to a long enough value and you will require enough resources (mem & disk) > to keep the required data. > > -kr, Gerard. > > On Sun, Feb 21, 2016 at 12:54 PM, Jatin Kumar < > jku...@rocketfuelinc.com.invalid> wrote: > >> Hello Spark users, >> >> I have to aggregate messages from kafka and at some fixed interval (say >> every half hour) update a memory persisted RDD and run some computation. >> This computation uses last one day data. Steps are: >> >> - Read from realtime Kafka topic X in spark streaming batches of 5 seconds >> - Filter the above DStream messages and keep some of them >> - Create windows of 30 minutes on above DStream and aggregate by Key >> - Merge this 30 minute RDD with a memory persisted RDD say combinedRdd >> - Maintain last N such RDDs in a deque persisting them on disk. While >> adding new RDD, subtract oldest RDD from the combinedRdd. >> - Final step consider last N such windows (of 30 minutes each) and do >> final aggregation >> >> Does the above way of using spark streaming looks reasonable? Is there a >> better way of doing the above? >> >> -- >> Thanks >> Jatin >> >> >