I'm not quiet sure if i understood it correctly, but can you not create a
key from the timestamps and do the reduceByKeyAndWindow over it?

Thanks
Best Regards

On Wed, Jan 28, 2015 at 10:24 PM, YaoPau <jonrgr...@gmail.com> wrote:

> The TwitterPopularTags example works great: the Twitter firehose keeps
> messages pretty well in order by timestamp, and so to get the most popular
> hashtags over the last 60 seconds, reduceByKeyAndWindow works well.
>
> My stream pulls Apache weblogs from Kafka, and so it's not as simple:
> messages can pass through out-of-order, and if I take down my streaming
> process and start it up again, the Kafka index stays in place and now I
> might be consuming 10x of what I was consuming before in order to catch up
> to the current time.  In this case, reduceByKeyAndWindow won't work.
>
> I'd like my bucket size to be 5 seconds, and I'd like to do the same thing
> TwitterPopularTags is doing, except instead of hashtags I have "row types",
> and instead of aggregating by 60 seconds of clock time I'd like to
> aggregate
> over all rows of that row type with a timestamp within 60 seconds of the
> current time.
>
> My thinking is to maintain state in an RDD and update it an persist it with
> each 2-second pass, but this also seems like it could get messy.  Any
> thoughts or examples that might help me?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKeyAndWindow-but-using-log-timestamps-instead-of-clock-seconds-tp21405.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to