Re: [EXTERNAL] Re: Conflicting PySpark Storage Level Defaults?

grp Mon, 16 Sep 2019 15:22:32 -0700

Running a simple test - here is the stack overflow code snippet using .count() 
as the action.  You can see the differences between the storage levels.


print(spark.version)
2.4.3

# id 3 => using default storage level for df (memory_and_disk) and unsure why 
storage level is not serialized since i am using pyspark
df = spark.range(10)
print(type(df))
df.cache().count()
print(df.storageLevel)

# id 15 => using default storage level for rdd (memory_only) and makes sense 
why it is serialized
rdd = df.rdd
print(type(rdd))
rdd.cache().collect()

# id 19 => manually configuring to (memory_and_disk) which makes the storage 
level serialized
df2 = spark.range(100)
from pyspark import StorageLevel
print(type(df2))
df2.persist(StorageLevel.MEMORY_AND_DISK).count()
print(df2.storageLevel)

<class 'pyspark.sql.dataframe.DataFrame'>
Disk Memory Deserialized 1x Replicated
<class 'pyspark.rdd.RDD'>
<class 'pyspark.sql.dataframe.DataFrame'>
Disk Memory Serialized 1x Replicated

> On Sep 16, 2019, at 2:02 AM, Jörn Franke <jornfra...@gmail.com> wrote:
> 
> I don’t know your full source code but you may missing an action so that it 
> is indeed persisted.
> 
>> Am 16.09.2019 um 02:07 schrieb grp <gpete...@villanova.edu>:
>> 
>> Hi There Spark Users,
>> 
>> Curious what is going on here.  Not sure if possible bug or missing 
>> something.  Extra eyes are much appreciated.
>> 
>> Spark UI (Python API 2.4.3) by default is reporting persisted data-frames to 
>> be de-serialized MEMORY_AND_DISK however I always thought they were 
>> serialized for Python by default according to official documentation.
>> However when explicitly changing the storage level to default … ex => 
>> df.persist(StorageLevel.MEMORY_AND_DISK) … the Spark UI returns the expected 
>> serialized data-frame under Storage Tab, but not when just calling … 
>> df.cache().
>> 
>> Do we have to explicitly set to … StorageLevel.MEMORY_AND_DISK … to get the 
>> serialized benefit in Python (which I thought was automatic)?  Or is the 
>> Spark UI incorrect?
>> 
>> SO post with specific example/details => 
>> https://stackoverflow.com/questions/56926337/conflicting-pyspark-storage-level-defaults
>> 
>> Thank you for your time and research!
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>

Re: [EXTERNAL] Re: Conflicting PySpark Storage Level Defaults?

Reply via email to