Re: RDD.count() take a lot of time

Valentin Michajlenko Fri, 15 Nov 2013 14:12:18 -0800

Thank you, Meisam!
> Are you calling count() on the same rdd that you have called take() on? Or 
> are you calling it on a different rdd that you create from scratch with 
> 400K/1400K elements?
>
Yes, on the same. 400K was subset of 1400K elements.
I have found out, that data was not stored in memory. I store data to
memory first of all doing functions cache().count() - thank you for
advice. Now everything (functions count() and take()) runs fast.
While my task is runnung, unfortunately, I dont see anything in web
UI(tabke of RDDs is empty), but maybe because I run local server(and
not cluster with master)?
Best regards,
Valentin


2013/11/15 Meisam Fathi <[email protected]>:
> I am a bit confused now. Are you calling count() on the same rdd that
> you have called take() on? Or are you calling it on a different rdd
> that you create from scratch with 400K/1400K elements? Can you share
> your code and your logs?
>
> As a Spark application runs, you can see details about progress of
> each stage in the web UI.
>
> Thanks,
> Meisam
>
>
> On Thu, Nov 14, 2013 at 3:22 PM, Valentin Michajlenko
> <[email protected]> wrote:
>> Thank you, Meisam! But I have found something interesting (for me, as
>> novice in Spark). Working with 400k elements, count() takes 30 secs
>> and .take(Int.MaxValue).size is less than a second!
>> The problem comes when working with 1400k elements -
>> .take(Int.MaxValue).size is not so quik.
>> Best regards,
>> Valentin
>>
>> 2013/11/14 Meisam Fathi <[email protected]>:
>>> Hi Valentin,
>>>
>>> data.filter() and rdd map() do not actually do the computation. When
>>> you call count() or collect(), your RDD first dies the filter(), then
>>> the map() and then the count() or collect().
>>> See this for more info:
>>> https://github.com/mesos/spark/wiki/Spark-Programming-Guide#transformations
>>>
>>> Thanks,
>>> Meisam
>>>
>>> On Thu, Nov 14, 2013 at 2:02 PM, Valentin Michajlenko
>>> <[email protected]> wrote:
>>>> Hi!
>>>> I load data from list( sc.parallelize() ) with length about 1400000
>>>> items. After that I run data.filter(func1).map(func2). This operation
>>>> runs less, then a second. But after that function count() (or
>>>> collect() ) takes about 30 seconds. Please, help me to reduce this
>>>> time!
>>>> Best Regards,
>>>> Valentin

Re: RDD.count() take a lot of time

Reply via email to