Hi Xusen, I was convinced the cache() method would involve in-memory only operations and has nothing to do with disks as the underlying default cache strategy is MEMORY_ONLY. Am I missing something?
2014-04-11 11:44 GMT+02:00 尹绪森 <yinxu...@gmail.com>: > Hi Pierre, > > 1. cache() would cost time to carry stuffs from disk to memory, so pls do > not use cache() if your job is not an iterative one. > > 2. If your dataset is larger than memory amount, then there will be a > replacement strategy to exchange data between memory and disk. > > > 2014-04-11 0:07 GMT+08:00 Pierre Borckmans < > pierre.borckm...@realimpactanalytics.com>: > > Hi there, >> >> Just playing around in the Spark shell, I am now a bit confused by the >> performance I observe when the dataset does not fit into memory : >> >> - i load a dataset with roughly 500 million rows >> - i do a count, it takes about 20 seconds >> - now if I cache the RDD and do a count again (which will try cache the >> data again), it takes roughly 90 seconds (the fraction cached is only 25%). >> => is this expected? to be roughly 5 times slower when caching and not >> enough RAM is available? >> - the subsequent calls to count are also really slow : about 90 seconds >> as well. >> => I can see that the first 25% tasks are fast (the ones dealing with >> data in memory), but then it gets really slow… >> >> Am I missing something? >> I thought performance would decrease kind of linearly with the amour of >> data fit into memory… >> >> Thanks for your help! >> >> Cheers >> >> >> >> >> >> *Pierre Borckmans* >> >> *Real**Impact* Analytics *| *Brussels Office >> www.realimpactanalytics.com *| >> *pierre.borckm...@realimpactanalytics.com<thierry.lib...@realimpactanalytics.com> >> >> *FR *+32 485 91 87 31 *| **Skype* pierre.borckmans >> >> >> >> >> >> > > > -- > Best Regards > ----------------------------------- > Xusen Yin 尹绪森 > Intel Labs China > Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>* >