The TwitterPopularTags example works great: the Twitter firehose keeps
messages pretty well in order by timestamp, and so to get the most popular
hashtags over the last 60 seconds, reduceByKeyAndWindow works well.

My stream pulls Apache weblogs from Kafka, and so it's not as simple:
messages can pass through out-of-order, and if I take down my streaming
process and start it up again, the Kafka index stays in place and now I
might be consuming 10x of what I was consuming before in order to catch up
to the current time.  In this case, reduceByKeyAndWindow won't work.

I'd like my bucket size to be 5 seconds, and I'd like to do the same thing
TwitterPopularTags is doing, except instead of hashtags I have "row types",
and instead of aggregating by 60 seconds of clock time I'd like to aggregate
over all rows of that row type with a timestamp within 60 seconds of the
current time.

My thinking is to maintain state in an RDD and update it an persist it with
each 2-second pass, but this also seems like it could get messy.  Any
thoughts or examples that might help me?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKeyAndWindow-but-using-log-timestamps-instead-of-clock-seconds-tp21405.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to