Hi Matei, Could you enlighten us on this please?
Thanks Pierre On 11 Apr 2014, at 14:49, Jérémy Subtil <jeremy.sub...@gmail.com> wrote: > Hi Xusen, > > I was convinced the cache() method would involve in-memory only operations > and has nothing to do with disks as the underlying default cache strategy is > MEMORY_ONLY. Am I missing something? > > > 2014-04-11 11:44 GMT+02:00 尹绪森 <yinxu...@gmail.com>: > Hi Pierre, > > 1. cache() would cost time to carry stuffs from disk to memory, so pls do not > use cache() if your job is not an iterative one. > > 2. If your dataset is larger than memory amount, then there will be a > replacement strategy to exchange data between memory and disk. > > > 2014-04-11 0:07 GMT+08:00 Pierre Borckmans > <pierre.borckm...@realimpactanalytics.com>: > > Hi there, > > Just playing around in the Spark shell, I am now a bit confused by the > performance I observe when the dataset does not fit into memory : > > - i load a dataset with roughly 500 million rows > - i do a count, it takes about 20 seconds > - now if I cache the RDD and do a count again (which will try cache the data > again), it takes roughly 90 seconds (the fraction cached is only 25%). > => is this expected? to be roughly 5 times slower when caching and not > enough RAM is available? > - the subsequent calls to count are also really slow : about 90 seconds as > well. > => I can see that the first 25% tasks are fast (the ones dealing with > data in memory), but then it gets really slow… > > Am I missing something? > I thought performance would decrease kind of linearly with the amour of data > fit into memory… > > Thanks for your help! > > Cheers > > > > > > Pierre Borckmans > > RealImpact Analytics | Brussels Office > www.realimpactanalytics.com | pierre.borckm...@realimpactanalytics.com > > FR +32 485 91 87 31 | Skype pierre.borckmans > > > > > > > > > -- > Best Regards > ----------------------------------- > Xusen Yin 尹绪森 > Intel Labs China > Homepage: http://yinxusen.github.io/ >