Re: Evaluating spark streaming use case

Ted Yu Sun, 21 Feb 2016 09:23:02 -0800

w.r.t. cleaner TTL, please see:

[SPARK-7689] Remove TTL-based metadata cleaning in Spark 2.0


FYI

On Sun, Feb 21, 2016 at 4:16 AM, Gerard Maas <gerard.m...@gmail.com> wrote:

>
> It sounds like another  window operation on top of the 30-min window will
> achieve the  desired objective.
> Just keep in mind that you'll need to set the clean TTL (spark.cleaner.ttl)
> to a long enough value and you will require enough resources (mem & disk)
> to keep the required data.
>
> -kr, Gerard.
>
> On Sun, Feb 21, 2016 at 12:54 PM, Jatin Kumar <
> jku...@rocketfuelinc.com.invalid> wrote:
>
>> Hello Spark users,
>>
>> I have to aggregate messages from kafka and at some fixed interval (say
>> every half hour) update a memory persisted RDD and run some computation.
>> This computation uses last one day data. Steps are:
>>
>> - Read from realtime Kafka topic X in spark streaming batches of 5 seconds
>> - Filter the above DStream messages and keep some of them
>> - Create windows of 30 minutes on above DStream and aggregate by Key
>> - Merge this 30 minute RDD with a memory persisted RDD say combinedRdd
>> - Maintain last N such RDDs in a deque persisting them on disk. While
>> adding new RDD, subtract oldest RDD from the combinedRdd.
>> - Final step consider last N such windows (of 30 minutes each) and do
>> final aggregation
>>
>> Does the above way of using spark streaming looks reasonable? Is there a
>> better way of doing the above?
>>
>> --
>> Thanks
>> Jatin
>>
>>
>

Re: Evaluating spark streaming use case

Reply via email to