Hi, So my Spark app needs to run a sliding window through a time series dataset (I'm not using Spark streaming). And then run different types on aggregations on per window basis. Right now I'm using a groupByKey() which gives me Iterables for each window. There are a few concerns I have with this approach:
1. groupByKey() could potentially fail for a key not fitting in the memory. 2. I'd like to run aggregations like max(), mean() on each of the groups, it'd be nice to have the RDD functionality at this point instead of the iterables. 3. I can't use reduceByKey() or aggregateByKey() are some of my aggregations need to have a view of the entire window. Only other way I could think of is partitioning my RDDs into multiple RDDs with each RDD representing a window. Is this a sensible approach? Or is there any other way of going about this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Alternatives-to-groupByKey-tp20293.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org