Note I’m assuming you were going for the size of your RDD, meaning in the ‘collect’ alternative, you would go for a size() right after the collect().
If you were simply trying to materialize your RDD, Sean’s answer is more complete. — FG On Thu, Feb 26, 2015 at 2:33 PM, Emre Sevinc <emre.sev...@gmail.com> wrote: > Francois, > Thank you for quickly verifying. > Kind regards, > Emre Sevinç > On Thu, Feb 26, 2015 at 2:32 PM, <francois.garil...@typesafe.com> wrote: >> The short answer: >> count(), as the sum can be partially aggregated on the mappers. >> >> The long answer: >> >> http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dont_call_collect_on_a_very_large_rdd.html >> >> — >> FG >> >> >> On Thu, Feb 26, 2015 at 2:28 PM, Emre Sevinc <emre.sev...@gmail.com> >> wrote: >> >>> Hello, >>> >>> I have a piece of code to force the materialization of RDDs in my Spark >>> Streaming program, and I'm trying to understand which method is faster and >>> has less memory consumption: >>> >>> javaDStream.foreachRDD(new Function<JavaRDD<String>, Void>() { >>> @Override >>> public Void call(JavaRDD<String> stringJavaRDD) throws Exception { >>> >>> //stringJavaRDD.collect(); >>> >>> // or count? >>> >>> //stringJavaRDD.count(); >>> >>> return null; >>> } >>> }); >>> >>> >>> I've checked the source code of Spark at >>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala, >>> and see that collect() is defined as: >>> >>> def collect(): Array[T] = { >>> val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray) >>> Array.concat(results: _*) >>> } >>> >>> and count() defined as: >>> >>> def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum >>> >>> Therefore I think calling the count() method is faster and/or consumes >>> less memory, but I wanted to be sure. >>> >>> Anyone cares to comment? >>> >>> >>> -- >>> Emre Sevinç >>> http://www.bigindustries.be/ >>> >>> >> > -- > Emre Sevinc