Re: RDD.count() take a lot of time

Meisam Fathi Fri, 15 Nov 2013 08:44:07 -0800

I am a bit confused now. Are you calling count() on the same rdd that
you have called take() on? Or are you calling it on a different rdd
that you create from scratch with 400K/1400K elements? Can you share
your code and your logs?


As a Spark application runs, you can see details about progress of
each stage in the web UI.

Thanks,
Meisam


On Thu, Nov 14, 2013 at 3:22 PM, Valentin Michajlenko
<[email protected]> wrote:
> Thank you, Meisam! But I have found something interesting (for me, as
> novice in Spark). Working with 400k elements, count() takes 30 secs
> and .take(Int.MaxValue).size is less than a second!
> The problem comes when working with 1400k elements -
> .take(Int.MaxValue).size is not so quik.
> Best regards,
> Valentin
>
> 2013/11/14 Meisam Fathi <[email protected]>:
>> Hi Valentin,
>>
>> data.filter() and rdd map() do not actually do the computation. When
>> you call count() or collect(), your RDD first dies the filter(), then
>> the map() and then the count() or collect().
>> See this for more info:
>> https://github.com/mesos/spark/wiki/Spark-Programming-Guide#transformations
>>
>> Thanks,
>> Meisam
>>
>> On Thu, Nov 14, 2013 at 2:02 PM, Valentin Michajlenko
>> <[email protected]> wrote:
>>> Hi!
>>> I load data from list( sc.parallelize() ) with length about 1400000
>>> items. After that I run data.filter(func1).map(func2). This operation
>>> runs less, then a second. But after that function count() (or
>>> collect() ) takes about 30 seconds. Please, help me to reduce this
>>> time!
>>> Best Regards,
>>> Valentin

Re: RDD.count() take a lot of time

Reply via email to