Hi Matei,

Could you enlighten us on this please?

Thanks

Pierre

On 11 Apr 2014, at 14:49, Jérémy Subtil <jeremy.sub...@gmail.com> wrote:

> Hi Xusen,
> 
> I was convinced the cache() method would involve in-memory only operations 
> and has nothing to do with disks as the underlying default cache strategy is 
> MEMORY_ONLY. Am I missing something?
> 
> 
> 2014-04-11 11:44 GMT+02:00 尹绪森 <yinxu...@gmail.com>:
> Hi Pierre,
> 
> 1. cache() would cost time to carry stuffs from disk to memory, so pls do not 
> use cache() if your job is not an iterative one.
> 
> 2. If your dataset is larger than memory amount, then there will be a 
> replacement strategy to exchange data between memory and disk.
> 
> 
> 2014-04-11 0:07 GMT+08:00 Pierre Borckmans 
> <pierre.borckm...@realimpactanalytics.com>:
> 
> Hi there,
> 
> Just playing around in the Spark shell, I am now a bit confused by the 
> performance I observe when the dataset does not fit into memory :
> 
> - i load a dataset with roughly 500 million rows
> - i do a count, it takes about 20 seconds
> - now if I cache the RDD and do a count again (which will try cache the data 
> again), it takes roughly 90 seconds (the fraction cached is only 25%).
>       => is this expected? to be roughly 5 times slower when caching and not 
> enough RAM is available?
> - the subsequent calls to count are also really slow : about 90 seconds as 
> well.
>       => I can see that the first 25% tasks are fast (the ones dealing with 
> data in memory), but then it gets really slow…
> 
> Am I missing something?
> I thought performance would decrease kind of linearly with the amour of data 
> fit into memory…
> 
> Thanks for your help!
> 
> Cheers
> 
> 
> 
> 
> 
> Pierre Borckmans
> 
> RealImpact Analytics | Brussels Office
> www.realimpactanalytics.com | pierre.borckm...@realimpactanalytics.com
> 
> FR +32 485 91 87 31 | Skype pierre.borckmans
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> Best Regards
> -----------------------------------
> Xusen Yin    尹绪森
> Intel Labs China
> Homepage: http://yinxusen.github.io/
> 

Reply via email to