Hi Xusen,

I was convinced the cache() method would involve in-memory only operations
and has nothing to do with disks as the underlying default cache strategy
is MEMORY_ONLY. Am I missing something?


2014-04-11 11:44 GMT+02:00 尹绪森 <yinxu...@gmail.com>:

> Hi Pierre,
>
> 1. cache() would cost time to carry stuffs from disk to memory, so pls do
> not use cache() if your job is not an iterative one.
>
> 2. If your dataset is larger than memory amount, then there will be a
> replacement strategy to exchange data between memory and disk.
>
>
> 2014-04-11 0:07 GMT+08:00 Pierre Borckmans <
> pierre.borckm...@realimpactanalytics.com>:
>
> Hi there,
>>
>> Just playing around in the Spark shell, I am now a bit confused by the
>> performance I observe when the dataset does not fit into memory :
>>
>> - i load a dataset with roughly 500 million rows
>> - i do a count, it takes about 20 seconds
>> - now if I cache the RDD and do a count again (which will try cache the
>> data again), it takes roughly 90 seconds (the fraction cached is only 25%).
>>  => is this expected? to be roughly 5 times slower when caching and not
>> enough RAM is available?
>> - the subsequent calls to count are also really slow : about 90 seconds
>> as well.
>>  => I can see that the first 25% tasks are fast (the ones dealing with
>> data in memory), but then it gets really slow…
>>
>> Am I missing something?
>> I thought performance would decrease kind of linearly with the amour of
>> data fit into memory…
>>
>> Thanks for your help!
>>
>> Cheers
>>
>>
>>
>>
>>
>>  *Pierre Borckmans*
>>
>> *Real**Impact* Analytics *| *Brussels Office
>>  www.realimpactanalytics.com *| 
>> *pierre.borckm...@realimpactanalytics.com<thierry.lib...@realimpactanalytics.com>
>>
>> *FR *+32 485 91 87 31 *| **Skype* pierre.borckmans
>>
>>
>>
>>
>>
>>
>
>
> --
> Best Regards
> -----------------------------------
> Xusen Yin    尹绪森
> Intel Labs China
> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*
>

Reply via email to