Re: Question about Serialization in Storage Level

2015-05-27 Thread Imran Rashid
Hi Zhipeng,

yes, your understanding is correct.  the "SER" portion just refers to how
its stored in memory.  On disk, the data always has to be serialized.


On Fri, May 22, 2015 at 10:40 PM, Jiang, Zhipeng 
wrote:

>  Hi Todd, Howard,
>
>
>
> Thanks for your reply, I might not present my question clearly.
>
>
>
> What I mean is, if I call *rdd.persist(StorageLevel.MEMORY_AND_DISK)*,
> the BlockManager will cache the rdd to MemoryStore. RDD will be migrated to
> DiskStore when it cannot fit in memory. I think this migration does require
> data serialization and compression (if spark.rdd.compress is set to be
> true). So the data in Disk is serialized, even if I didn’t choose a
> serialized storage level, am I right?
>
>
>
> Thanks,
>
> Zhipeng
>
>
>
>
>
> *From:* Todd Nist [mailto:tsind...@gmail.com]
> *Sent:* Thursday, May 21, 2015 8:49 PM
> *To:* Jiang, Zhipeng
> *Cc:* user@spark.apache.org
> *Subject:* Re: Question about Serialization in Storage Level
>
>
>
> From the docs,
> https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
> :
>
>
>
> *Storage Level*
>
> *Meaning*
>
> MEMORY_ONLY
>
> Store RDD as deserialized Java objects in the JVM. If the RDD does not fit
> in memory, some partitions will not be cached and will be recomputed on the
> fly each time they're needed. This is the default level.
>
> MEMORY_AND_DISK
>
> Store RDD as *deserialized* Java objects in the JVM. If the RDD does not
> fit in memory, store the partitions that don't fit on disk, and read them
> from there when they're needed.
>
> MEMORY_ONLY_SER
>
> Store RDD as *serialized* Java objects (one byte array per partition).
> This is generally more space-efficient than deserialized objects,
> especially when using a fast serializer
> <https://spark.apache.org/docs/latest/tuning.html>, but more
> CPU-intensive to read.
>
> MEMORY_AND_DISK_SER
>
> Similar to *MEMORY_ONLY_SER*, but spill partitions that don't fit in
> memory to disk instead of recomputing them on the fly each time they're
> needed.
>
>
>
> On Thu, May 21, 2015 at 3:52 AM, Jiang, Zhipeng 
> wrote:
>
>  Hi there,
>
>
>
> This question may seem to be kind of naïve, but what’s the difference
> between *MEMORY_AND_DISK* and *MEMORY_AND_DISK_SER*?
>
>
>
> If I call *rdd.persist(StorageLevel.MEMORY_AND_DISK)*, the BlockManager
> won’t serialize the *rdd*?
>
>
>
> Thanks,
>
> Zhipeng
>
>
>


RE: Question about Serialization in Storage Level

2015-05-22 Thread Jiang, Zhipeng
Hi Todd, Howard,

Thanks for your reply, I might not present my question clearly.

What I mean is, if I call rdd.persist(StorageLevel.MEMORY_AND_DISK), the 
BlockManager will cache the rdd to MemoryStore. RDD will be migrated to 
DiskStore when it cannot fit in memory. I think this migration does require 
data serialization and compression (if spark.rdd.compress is set to be true). 
So the data in Disk is serialized, even if I didn’t choose a serialized storage 
level, am I right?

Thanks,
Zhipeng


From: Todd Nist [mailto:tsind...@gmail.com]
Sent: Thursday, May 21, 2015 8:49 PM
To: Jiang, Zhipeng
Cc: user@spark.apache.org
Subject: Re: Question about Serialization in Storage Level

From the docs,  
https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence:

Storage Level

Meaning

MEMORY_ONLY

Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in 
memory, some partitions will not be cached and will be recomputed on the fly 
each time they're needed. This is the default level.

MEMORY_AND_DISK

Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in 
memory, store the partitions that don't fit on disk, and read them from there 
when they're needed.

MEMORY_ONLY_SER

Store RDD as serialized Java objects (one byte array per partition). This is 
generally more space-efficient than deserialized objects, especially when using 
a fast serializer<https://spark.apache.org/docs/latest/tuning.html>, but more 
CPU-intensive to read.

MEMORY_AND_DISK_SER

Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to 
disk instead of recomputing them on the fly each time they're needed.


On Thu, May 21, 2015 at 3:52 AM, Jiang, Zhipeng 
mailto:zhipeng.ji...@intel.com>> wrote:
Hi there,

This question may seem to be kind of naïve, but what’s the difference between 
MEMORY_AND_DISK and MEMORY_AND_DISK_SER?

If I call rdd.persist(StorageLevel.MEMORY_AND_DISK), the BlockManager won’t 
serialize the rdd?

Thanks,
Zhipeng



Re: Question about Serialization in Storage Level

2015-05-21 Thread Todd Nist
>From the docs,
https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence:

Storage LevelMeaningMEMORY_ONLYStore RDD as deserialized Java objects in
the JVM. If the RDD does not fit in memory, some partitions will not be
cached and will be recomputed on the fly each time they're needed. This is
the default level.MEMORY_AND_DISKStore RDD as *deserialized* Java objects
in the JVM. If the RDD does not fit in memory, store the partitions that
don't fit on disk, and read them from there when they're needed.
MEMORY_ONLY_SERStore RDD as *serialized* Java objects (one byte array per
partition). This is generally more space-efficient than deserialized
objects, especially when using a fast serializer
, but more CPU-intensive
to read.MEMORY_AND_DISK_SERSimilar to *MEMORY_ONLY_SER*, but spill
partitions that don't fit in memory to disk instead of recomputing them on
the fly each time they're needed.

On Thu, May 21, 2015 at 3:52 AM, Jiang, Zhipeng 
wrote:

>  Hi there,
>
>
>
> This question may seem to be kind of naïve, but what’s the difference
> between *MEMORY_AND_DISK* and *MEMORY_AND_DISK_SER*?
>
>
>
> If I call *rdd.persist(StorageLevel.MEMORY_AND_DISK)*, the BlockManager
> won’t serialize the *rdd*?
>
>
>
> Thanks,
>
> Zhipeng
>