I've been trying to figure out how to use Spark to do a simple aggregation without reparitioning and essentially creating fully instantiated intermediate RDDs and it seem virtually impossible.
I've now gone as far as writing my own single parition RDD that wraps an Iterator[String] and calling aggregate() on it. Before any of my aggregation code executes the entire Iterator is unwound and multiple partitions are created to be given to my aggregation. The Task execution call stack includes: ShuffleMap.runTask SortShuffleWriter.write ExternalSorter.insertAll ... which is iterating over my entire RDD and repartitioning it an SpillFile collecting it. How do I prevent this from happening? There's no need to do this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-stop-the-automatic-partitioning-of-my-RDD-tp20732.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org