w.r.t. the new mapWithState(), there have been some bug fixes since the release of 1.6.0 e.g.
SPARK-13121 java mapWithState mishandles scala Option Looks like 1.6.1 RC should come out next week. FYI On Sun, Feb 21, 2016 at 10:47 AM, Chris Fregly <ch...@fregly.com> wrote: > good catch on the cleaner.ttl > > @jatin- when you say "memory-persisted RDD", what do you mean exactly? > and how much data are you talking about? remember that spark can evict > these memory-persisted RDDs at any time. they can be recovered from Kafka, > but this is not a good situation to be in. > > also, is this spark 1.6 with the new mapState() or the old > updateStateByKey()? you definitely want the newer 1.6 mapState(). > > and is there any other way to store and aggregate this data outside of > spark? I get a bit nervous when I see people treat spark/streaming like an > in-memory database. > > perhaps redis or another type of in-memory store is more appropriate. or > just write to long-term storage using parquet. > > if this is a lot of data, you may want to use approximate probabilistic > data structures like CountMin Sketch or HyperLogLog. here's some relevant > links with more info - including how to use these with redis: > > > https://www.slideshare.net/cfregly/advanced-apache-spark-meetup-approximations-and-probabilistic-data-structures-jan-28-2016-galvanize > > > https://github.com/fluxcapacitor/pipeline/tree/master/myapps/streaming/src/main/scala/com/advancedspark/streaming/rating/approx > > you can then setup a cron (or airflow) spark job to do the compute and > aggregate against either redis or long-term storage. > > this reference pipeline contains the latest airflow workflow scheduler: > https://github.com/fluxcapacitor/pipeline/wiki > > my advice with spark streaming is to get the data out of spark streaming > as quickly as possible - and into a more durable format more suitable for > aggregation and compute. > > this greatly simplifies your operational concerns, in my > opinion. > > good question. very common use case. > > On Feb 21, 2016, at 12:22 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > w.r.t. cleaner TTL, please see: > > [SPARK-7689] Remove TTL-based metadata cleaning in Spark 2.0 > > FYI > > On Sun, Feb 21, 2016 at 4:16 AM, Gerard Maas <gerard.m...@gmail.com> > wrote: > >> >> It sounds like another window operation on top of the 30-min window will >> achieve the desired objective. >> Just keep in mind that you'll need to set the clean TTL ( >> spark.cleaner.ttl) to a long enough value and you will require enough >> resources (mem & disk) to keep the required data. >> >> -kr, Gerard. >> >> On Sun, Feb 21, 2016 at 12:54 PM, Jatin Kumar < >> jku...@rocketfuelinc.com.invalid> wrote: >> >>> Hello Spark users, >>> >>> I have to aggregate messages from kafka and at some fixed interval (say >>> every half hour) update a memory persisted RDD and run some computation. >>> This computation uses last one day data. Steps are: >>> >>> - Read from realtime Kafka topic X in spark streaming batches of 5 >>> seconds >>> - Filter the above DStream messages and keep some of them >>> - Create windows of 30 minutes on above DStream and aggregate by Key >>> - Merge this 30 minute RDD with a memory persisted RDD say combinedRdd >>> - Maintain last N such RDDs in a deque persisting them on disk. While >>> adding new RDD, subtract oldest RDD from the combinedRdd. >>> - Final step consider last N such windows (of 30 minutes each) and do >>> final aggregation >>> >>> Does the above way of using spark streaming looks reasonable? Is there a >>> better way of doing the above? >>> >>> -- >>> Thanks >>> Jatin >>> >>> >> >
