Re: RDD.count() take a lot of time

Valentin Michajlenko Thu, 14 Nov 2013 12:23:24 -0800

Thank you, Meisam! But I have found something interesting (for me, as
novice in Spark). Working with 400k elements, count() takes 30 secs
and .take(Int.MaxValue).size is less than a second!
The problem comes when working with 1400k elements -
.take(Int.MaxValue).size is not so quik.
Best regards,
Valentin


2013/11/14 Meisam Fathi <[email protected]>:
> Hi Valentin,
>
> data.filter() and rdd map() do not actually do the computation. When
> you call count() or collect(), your RDD first dies the filter(), then
> the map() and then the count() or collect().
> See this for more info:
> https://github.com/mesos/spark/wiki/Spark-Programming-Guide#transformations
>
> Thanks,
> Meisam
>
> On Thu, Nov 14, 2013 at 2:02 PM, Valentin Michajlenko
> <[email protected]> wrote:
>> Hi!
>> I load data from list( sc.parallelize() ) with length about 1400000
>> items. After that I run data.filter(func1).map(func2). This operation
>> runs less, then a second. But after that function count() (or
>> collect() ) takes about 30 seconds. Please, help me to reduce this
>> time!
>> Best Regards,
>> Valentin

Re: RDD.count() take a lot of time

Reply via email to