I am a bit confused now. Are you calling count() on the same rdd that you have called take() on? Or are you calling it on a different rdd that you create from scratch with 400K/1400K elements? Can you share your code and your logs?
As a Spark application runs, you can see details about progress of each stage in the web UI. Thanks, Meisam On Thu, Nov 14, 2013 at 3:22 PM, Valentin Michajlenko <[email protected]> wrote: > Thank you, Meisam! But I have found something interesting (for me, as > novice in Spark). Working with 400k elements, count() takes 30 secs > and .take(Int.MaxValue).size is less than a second! > The problem comes when working with 1400k elements - > .take(Int.MaxValue).size is not so quik. > Best regards, > Valentin > > 2013/11/14 Meisam Fathi <[email protected]>: >> Hi Valentin, >> >> data.filter() and rdd map() do not actually do the computation. When >> you call count() or collect(), your RDD first dies the filter(), then >> the map() and then the count() or collect(). >> See this for more info: >> https://github.com/mesos/spark/wiki/Spark-Programming-Guide#transformations >> >> Thanks, >> Meisam >> >> On Thu, Nov 14, 2013 at 2:02 PM, Valentin Michajlenko >> <[email protected]> wrote: >>> Hi! >>> I load data from list( sc.parallelize() ) with length about 1400000 >>> items. After that I run data.filter(func1).map(func2). This operation >>> runs less, then a second. But after that function count() (or >>> collect() ) takes about 30 seconds. Please, help me to reduce this >>> time! >>> Best Regards, >>> Valentin
