The TwitterPopularTags example works great: the Twitter firehose keeps messages pretty well in order by timestamp, and so to get the most popular hashtags over the last 60 seconds, reduceByKeyAndWindow works well.
My stream pulls Apache weblogs from Kafka, and so it's not as simple: messages can pass through out-of-order, and if I take down my streaming process and start it up again, the Kafka index stays in place and now I might be consuming 10x of what I was consuming before in order to catch up to the current time. In this case, reduceByKeyAndWindow won't work. I'd like my bucket size to be 5 seconds, and I'd like to do the same thing TwitterPopularTags is doing, except instead of hashtags I have "row types", and instead of aggregating by 60 seconds of clock time I'd like to aggregate over all rows of that row type with a timestamp within 60 seconds of the current time. My thinking is to maintain state in an RDD and update it an persist it with each 2-second pass, but this also seems like it could get messy. Any thoughts or examples that might help me? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKeyAndWindow-but-using-log-timestamps-instead-of-clock-seconds-tp21405.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
