I'm not quiet sure if i understood it correctly, but can you not create a key from the timestamps and do the reduceByKeyAndWindow over it?
Thanks Best Regards On Wed, Jan 28, 2015 at 10:24 PM, YaoPau <jonrgr...@gmail.com> wrote: > The TwitterPopularTags example works great: the Twitter firehose keeps > messages pretty well in order by timestamp, and so to get the most popular > hashtags over the last 60 seconds, reduceByKeyAndWindow works well. > > My stream pulls Apache weblogs from Kafka, and so it's not as simple: > messages can pass through out-of-order, and if I take down my streaming > process and start it up again, the Kafka index stays in place and now I > might be consuming 10x of what I was consuming before in order to catch up > to the current time. In this case, reduceByKeyAndWindow won't work. > > I'd like my bucket size to be 5 seconds, and I'd like to do the same thing > TwitterPopularTags is doing, except instead of hashtags I have "row types", > and instead of aggregating by 60 seconds of clock time I'd like to > aggregate > over all rows of that row type with a timestamp within 60 seconds of the > current time. > > My thinking is to maintain state in an RDD and update it an persist it with > each 2-second pass, but this also seems like it could get messy. Any > thoughts or examples that might help me? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKeyAndWindow-but-using-log-timestamps-instead-of-clock-seconds-tp21405.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >