w.r.t. the new mapWithState(), there have been some bug fixes since the
release of 1.6.0
e.g.

SPARK-13121 java mapWithState mishandles scala Option

Looks like 1.6.1 RC should come out next week.

FYI

On Sun, Feb 21, 2016 at 10:47 AM, Chris Fregly <ch...@fregly.com> wrote:

> good catch on the cleaner.ttl
>
> @jatin-  when you say "memory-persisted RDD", what do you mean exactly?
>  and how much data are you talking about?  remember that spark can evict
> these memory-persisted RDDs at any time.  they can be recovered from Kafka,
> but this is not a good situation to be in.
>
> also, is this spark 1.6 with the new mapState() or the old
> updateStateByKey()?  you definitely want the newer 1.6 mapState().
>
> and is there any other way to store and aggregate this data outside of
> spark?  I get a bit nervous when I see people treat spark/streaming like an
> in-memory database.
>
> perhaps redis or another type of in-memory store is more appropriate.  or
> just write to long-term storage using parquet.
>
> if this is a lot of data, you may want to use approximate probabilistic
> data structures like CountMin Sketch or HyperLogLog.  here's some relevant
> links with more info - including how to use these with redis:
>
>
> https://www.slideshare.net/cfregly/advanced-apache-spark-meetup-approximations-and-probabilistic-data-structures-jan-28-2016-galvanize
>
>
> https://github.com/fluxcapacitor/pipeline/tree/master/myapps/streaming/src/main/scala/com/advancedspark/streaming/rating/approx
>
> you can then setup a cron (or airflow) spark job to do the compute and
> aggregate against either redis or long-term storage.
>
> this reference pipeline contains the latest airflow workflow scheduler:
> https://github.com/fluxcapacitor/pipeline/wiki
>
> my advice with spark streaming is to get the data out of spark streaming
> as quickly as possible - and into a more durable format more suitable for
> aggregation and compute.
>
> this greatly simplifies your operational concerns, in my
> opinion.
>
> good question.  very common use case.
>
> On Feb 21, 2016, at 12:22 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
> w.r.t. cleaner TTL, please see:
>
> [SPARK-7689] Remove TTL-based metadata cleaning in Spark 2.0
>
> FYI
>
> On Sun, Feb 21, 2016 at 4:16 AM, Gerard Maas <gerard.m...@gmail.com>
> wrote:
>
>>
>> It sounds like another  window operation on top of the 30-min window will
>> achieve the  desired objective.
>> Just keep in mind that you'll need to set the clean TTL (
>> spark.cleaner.ttl) to a long enough value and you will require enough
>> resources (mem & disk) to keep the required data.
>>
>> -kr, Gerard.
>>
>> On Sun, Feb 21, 2016 at 12:54 PM, Jatin Kumar <
>> jku...@rocketfuelinc.com.invalid> wrote:
>>
>>> Hello Spark users,
>>>
>>> I have to aggregate messages from kafka and at some fixed interval (say
>>> every half hour) update a memory persisted RDD and run some computation.
>>> This computation uses last one day data. Steps are:
>>>
>>> - Read from realtime Kafka topic X in spark streaming batches of 5
>>> seconds
>>> - Filter the above DStream messages and keep some of them
>>> - Create windows of 30 minutes on above DStream and aggregate by Key
>>> - Merge this 30 minute RDD with a memory persisted RDD say combinedRdd
>>> - Maintain last N such RDDs in a deque persisting them on disk. While
>>> adding new RDD, subtract oldest RDD from the combinedRdd.
>>> - Final step consider last N such windows (of 30 minutes each) and do
>>> final aggregation
>>>
>>> Does the above way of using spark streaming looks reasonable? Is there a
>>> better way of doing the above?
>>>
>>> --
>>> Thanks
>>> Jatin
>>>
>>>
>>
>

Reply via email to