How do I stop the automatic partitioning of my RDD?

Jim Carroll Tue, 16 Dec 2014 15:45:54 -0800

I've been trying to figure out how to use Spark to do a simple aggregation
without reparitioning and essentially creating fully instantiated
intermediate RDDs and it seem virtually impossible.


I've now gone as far as writing my own single parition RDD that wraps an
Iterator[String] and calling aggregate() on it. Before any of my aggregation
code executes the entire Iterator is unwound and multiple partitions are
created to be given to my aggregation.

The Task execution call stack includes:
   ShuffleMap.runTask
   SortShuffleWriter.write
   ExternalSorter.insertAll
  ... which is iterating over my entire RDD and repartitioning it an
SpillFile collecting it. 

How do I prevent this from happening? There's no need to do this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-stop-the-automatic-partitioning-of-my-RDD-tp20732.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How do I stop the automatic partitioning of my RDD?

Reply via email to