Hi Pierre,

1. cache() would cost time to carry stuffs from disk to memory, so pls do
not use cache() if your job is not an iterative one.

2. If your dataset is larger than memory amount, then there will be a
replacement strategy to exchange data between memory and disk.


2014-04-11 0:07 GMT+08:00 Pierre Borckmans <
pierre.borckm...@realimpactanalytics.com>:

> Hi there,
>
> Just playing around in the Spark shell, I am now a bit confused by the
> performance I observe when the dataset does not fit into memory :
>
> - i load a dataset with roughly 500 million rows
> - i do a count, it takes about 20 seconds
> - now if I cache the RDD and do a count again (which will try cache the
> data again), it takes roughly 90 seconds (the fraction cached is only 25%).
>  => is this expected? to be roughly 5 times slower when caching and not
> enough RAM is available?
> - the subsequent calls to count are also really slow : about 90 seconds as
> well.
>  => I can see that the first 25% tasks are fast (the ones dealing with
> data in memory), but then it gets really slow…
>
> Am I missing something?
> I thought performance would decrease kind of linearly with the amour of
> data fit into memory…
>
> Thanks for your help!
>
> Cheers
>
>
>
>
>
>  *Pierre Borckmans*
>
> *Real**Impact* Analytics *| *Brussels Office
>  www.realimpactanalytics.com *| 
> *pierre.borckm...@realimpactanalytics.com<thierry.lib...@realimpactanalytics.com>
>
> *FR *+32 485 91 87 31 *| **Skype* pierre.borckmans
>
>
>
>
>
>


-- 
Best Regards
-----------------------------------
Xusen Yin    尹绪森
Intel Labs China
Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*

Reply via email to