Re: Which one is faster / consumes less memory: collect() or count()?

francois . garillot Thu, 26 Feb 2015 05:42:19 -0800

Note I’m assuming you were going for the size of your RDD, meaning in the 
‘collect’ alternative, you would go for a size() right after the collect().





If you were simply trying to materialize your RDD, Sean’s answer is more 
complete.


—
FG

On Thu, Feb 26, 2015 at 2:33 PM, Emre Sevinc <emre.sev...@gmail.com>
wrote:

> Francois,
> Thank you for quickly verifying.
> Kind regards,
> Emre Sevinç
> On Thu, Feb 26, 2015 at 2:32 PM, <francois.garil...@typesafe.com> wrote:
>> The short answer:
>> count(), as the sum can be partially aggregated on the mappers.
>>
>> The long answer:
>>
>> http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dont_call_collect_on_a_very_large_rdd.html
>>
>> —
>> FG
>>
>>
>> On Thu, Feb 26, 2015 at 2:28 PM, Emre Sevinc <emre.sev...@gmail.com>
>> wrote:
>>
>>>  Hello,
>>>
>>> I have a piece of code to force the materialization of RDDs in my Spark
>>> Streaming program, and I'm trying to understand which method is faster and
>>> has less memory consumption:
>>>
>>>   javaDStream.foreachRDD(new Function<JavaRDD<String>, Void>() {
>>>       @Override
>>>       public Void call(JavaRDD<String> stringJavaRDD) throws Exception {
>>>
>>>         //stringJavaRDD.collect();
>>>
>>>        // or count?
>>>
>>>         //stringJavaRDD.count();
>>>
>>>         return null;
>>>       }
>>>     });
>>>
>>>
>>> I've checked the source code of Spark at
>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala,
>>> and see that collect() is defined as:
>>>
>>>   def collect(): Array[T] = {
>>>     val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
>>>    Array.concat(results: _*)
>>>   }
>>>
>>> and count() defined as:
>>>
>>>   def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
>>>
>>> Therefore I think calling the count() method is faster and/or consumes
>>> less memory, but I wanted to be sure.
>>>
>>> Anyone cares to comment?
>>>
>>>
>>> --
>>> Emre Sevinç
>>> http://www.bigindustries.be/
>>>
>>>
>>
> -- 
> Emre Sevinc

Re: Which one is faster / consumes less memory: collect() or count()?

Reply via email to