Alternatives to groupByKey

ameyc Wed, 03 Dec 2014 13:27:43 -0800

Hi,

So my Spark app needs to run a sliding window through a time series dataset
(I'm not using Spark streaming). And then run different types on
aggregations on per window basis. Right now I'm using a groupByKey() which
gives me Iterables for each window. There are a few concerns I have with
this approach:


1. groupByKey() could potentially fail for a key not fitting in the memory.
2. I'd like to run aggregations like max(), mean() on each of the groups,
it'd be nice to have the RDD functionality at this point instead of the
iterables.
3. I can't use reduceByKey() or aggregateByKey() are some of my aggregations
need to have a view of the entire window.

Only other way I could think of is partitioning my RDDs into multiple RDDs
with each RDD representing a window. Is this a sensible approach? Or is
there any other way of going about this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Alternatives-to-groupByKey-tp20293.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Alternatives to groupByKey

Reply via email to