Thank you, Meisam! > Are you calling count() on the same rdd that you have called take() on? Or > are you calling it on a different rdd that you create from scratch with > 400K/1400K elements? > Yes, on the same. 400K was subset of 1400K elements. I have found out, that data was not stored in memory. I store data to memory first of all doing functions cache().count() - thank you for advice. Now everything (functions count() and take()) runs fast. While my task is runnung, unfortunately, I dont see anything in web UI(tabke of RDDs is empty), but maybe because I run local server(and not cluster with master)? Best regards, Valentin
2013/11/15 Meisam Fathi <[email protected]>: > I am a bit confused now. Are you calling count() on the same rdd that > you have called take() on? Or are you calling it on a different rdd > that you create from scratch with 400K/1400K elements? Can you share > your code and your logs? > > As a Spark application runs, you can see details about progress of > each stage in the web UI. > > Thanks, > Meisam > > > On Thu, Nov 14, 2013 at 3:22 PM, Valentin Michajlenko > <[email protected]> wrote: >> Thank you, Meisam! But I have found something interesting (for me, as >> novice in Spark). Working with 400k elements, count() takes 30 secs >> and .take(Int.MaxValue).size is less than a second! >> The problem comes when working with 1400k elements - >> .take(Int.MaxValue).size is not so quik. >> Best regards, >> Valentin >> >> 2013/11/14 Meisam Fathi <[email protected]>: >>> Hi Valentin, >>> >>> data.filter() and rdd map() do not actually do the computation. When >>> you call count() or collect(), your RDD first dies the filter(), then >>> the map() and then the count() or collect(). >>> See this for more info: >>> https://github.com/mesos/spark/wiki/Spark-Programming-Guide#transformations >>> >>> Thanks, >>> Meisam >>> >>> On Thu, Nov 14, 2013 at 2:02 PM, Valentin Michajlenko >>> <[email protected]> wrote: >>>> Hi! >>>> I load data from list( sc.parallelize() ) with length about 1400000 >>>> items. After that I run data.filter(func1).map(func2). This operation >>>> runs less, then a second. But after that function count() (or >>>> collect() ) takes about 30 seconds. Please, help me to reduce this >>>> time! >>>> Best Regards, >>>> Valentin
