Spark Streaming, external windowing?

2014-07-16 Thread Sargun Dhillon
Does anyone here have a way to do Spark Streaming with external timing
for windows? Right now, it relies on the wall clock of the driver to
determine the amount of time that each batch read lasts.

We have a Kafka, and HDFS ingress into our Spark Streaming pipeline
where the events are annotated by the timestamps that they happened
(in real time) in. We would like to keep our windows based on those
timestamps, as opposed to based on the driver time.

Does anyone have any ideas how to do this?


Stateful RDDs?

2014-07-10 Thread Sargun Dhillon
So, one portion of our Spark streaming application requires some
state. Our application takes a bunch of application events (i.e.
user_session_started, user_session_ended, etc..), and calculates out
metrics from these, and writes them to a serving layer (see: Lambda
Architecture). Two related events can be ingested into the streaming
context moments apart, or time inderminate. Given this, and the fact
that our normal windows pump data out every 500-1 ms, with a step
of 500ms, you might end up with two related pieces of data across two
windows. In order to work around this, we go ahead and do
updateStateByKey to persist state, as opposed to persisting key
intermediate state in some external system, as building a system to
handle the complexities of (concurrent, idempotent) updates, as well
as ensure scalability is non-trivial.

The questions I have around this, is even in a highly-partitionable
dataset, what's the upper scalability limits with stateful dstreams?
If I have a dataset, starting at around 10-million keys, growing at
that rate monthly, what are the complexities within? Most of the data
is cold. I realize that I can remove data from the stateful dstream,
by sending (key, null) to it, but there is not necessarily an easy way
of knowing when the last update is coming in (unless there is some way
in spark of saying, Wait N windows, and send this tuple or Wait
until all keys in the upstream Dstreams smaller than M are processed
before sending such a tuple. Additionally, given that my data is
partitionable by datetime, does it make sense to have a custom
datetime partitioner, and just persist the dstream to disk, to ensure
that its RDDs are only pulled off of disk (into memory) occasionally?
What's the cost of having a bunch of relatively large, stateful RDDs
around on disk? Does Spark have to load / deserialize the entire RDD
to update one key?