Francois,

Thank you for quickly verifying.

Kind regards,
Emre Sevinç

On Thu, Feb 26, 2015 at 2:32 PM, <francois.garil...@typesafe.com> wrote:

> The short answer:
> count(), as the sum can be partially aggregated on the mappers.
>
> The long answer:
>
> http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dont_call_collect_on_a_very_large_rdd.html
>
> —
> FG
>
>
> On Thu, Feb 26, 2015 at 2:28 PM, Emre Sevinc <emre.sev...@gmail.com>
> wrote:
>
>>  Hello,
>>
>> I have a piece of code to force the materialization of RDDs in my Spark
>> Streaming program, and I'm trying to understand which method is faster and
>> has less memory consumption:
>>
>>   javaDStream.foreachRDD(new Function<JavaRDD<String>, Void>() {
>>       @Override
>>       public Void call(JavaRDD<String> stringJavaRDD) throws Exception {
>>
>>         //stringJavaRDD.collect();
>>
>>        // or count?
>>
>>         //stringJavaRDD.count();
>>
>>         return null;
>>       }
>>     });
>>
>>
>> I've checked the source code of Spark at
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala,
>> and see that collect() is defined as:
>>
>>   def collect(): Array[T] = {
>>     val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
>>    Array.concat(results: _*)
>>   }
>>
>> and count() defined as:
>>
>>   def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
>>
>> Therefore I think calling the count() method is faster and/or consumes
>> less memory, but I wanted to be sure.
>>
>> Anyone cares to comment?
>>
>>
>> --
>> Emre Sevinç
>> http://www.bigindustries.be/
>>
>>
>


-- 
Emre Sevinc

Reply via email to