One reason could be that spark uses scratch disk space on intermediate calculations so as you perform calculations that data need to be flushed before you can leverage memory for operations. Second issue could be large intermediate data may push more data in rdd onto disk ( something I see in warehouse use cases a lot) . Can you see in storage tab how much of rdd is in memory on each subsequent counts & how much intermediate data is generated each time. On Apr 11, 2014 9:22 AM, "Pierre Borckmans" < pierre.borckm...@realimpactanalytics.com> wrote:
> Hi Matei, > > Could you enlighten us on this please? > > Thanks > > Pierre > > On 11 Apr 2014, at 14:49, Jérémy Subtil <jeremy.sub...@gmail.com> wrote: > > Hi Xusen, > > I was convinced the cache() method would involve in-memory only operations > and has nothing to do with disks as the underlying default cache strategy > is MEMORY_ONLY. Am I missing something? > > > 2014-04-11 11:44 GMT+02:00 尹绪森 <yinxu...@gmail.com>: > >> Hi Pierre, >> >> 1. cache() would cost time to carry stuffs from disk to memory, so pls do >> not use cache() if your job is not an iterative one. >> >> 2. If your dataset is larger than memory amount, then there will be a >> replacement strategy to exchange data between memory and disk. >> >> >> 2014-04-11 0:07 GMT+08:00 Pierre Borckmans < >> pierre.borckm...@realimpactanalytics.com>: >> >> Hi there, >>> >>> Just playing around in the Spark shell, I am now a bit confused by the >>> performance I observe when the dataset does not fit into memory : >>> >>> - i load a dataset with roughly 500 million rows >>> - i do a count, it takes about 20 seconds >>> - now if I cache the RDD and do a count again (which will try cache the >>> data again), it takes roughly 90 seconds (the fraction cached is only 25%). >>> => is this expected? to be roughly 5 times slower when caching and not >>> enough RAM is available? >>> - the subsequent calls to count are also really slow : about 90 seconds >>> as well. >>> => I can see that the first 25% tasks are fast (the ones dealing with >>> data in memory), but then it gets really slow… >>> >>> Am I missing something? >>> I thought performance would decrease kind of linearly with the amour of >>> data fit into memory… >>> >>> Thanks for your help! >>> >>> Cheers >>> >>> >>> >>> >>> >>> *Pierre Borckmans* >>> >>> *Real**Impact* Analytics *| *Brussels Office >>> www.realimpactanalytics.com *| * >>> pierre.borckm...@realimpactanalytics.com<thierry.lib...@realimpactanalytics.com> >>> >>> *FR *+32 485 91 87 31 *| **Skype* pierre.borckmans >>> >>> >>> >>> >>> >>> >> >> >> -- >> Best Regards >> ----------------------------------- >> Xusen Yin 尹绪森 >> Intel Labs China >> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>* >> > > >