Francois, Thank you for quickly verifying.
Kind regards, Emre Sevinç On Thu, Feb 26, 2015 at 2:32 PM, <francois.garil...@typesafe.com> wrote: > The short answer: > count(), as the sum can be partially aggregated on the mappers. > > The long answer: > > http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dont_call_collect_on_a_very_large_rdd.html > > — > FG > > > On Thu, Feb 26, 2015 at 2:28 PM, Emre Sevinc <emre.sev...@gmail.com> > wrote: > >> Hello, >> >> I have a piece of code to force the materialization of RDDs in my Spark >> Streaming program, and I'm trying to understand which method is faster and >> has less memory consumption: >> >> javaDStream.foreachRDD(new Function<JavaRDD<String>, Void>() { >> @Override >> public Void call(JavaRDD<String> stringJavaRDD) throws Exception { >> >> //stringJavaRDD.collect(); >> >> // or count? >> >> //stringJavaRDD.count(); >> >> return null; >> } >> }); >> >> >> I've checked the source code of Spark at >> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala, >> and see that collect() is defined as: >> >> def collect(): Array[T] = { >> val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray) >> Array.concat(results: _*) >> } >> >> and count() defined as: >> >> def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum >> >> Therefore I think calling the count() method is faster and/or consumes >> less memory, but I wanted to be sure. >> >> Anyone cares to comment? >> >> >> -- >> Emre Sevinç >> http://www.bigindustries.be/ >> >> > -- Emre Sevinc