Running a simple test - here is the stack overflow code snippet using .count() as the action. You can see the differences between the storage levels.
print(spark.version) 2.4.3 # id 3 => using default storage level for df (memory_and_disk) and unsure why storage level is not serialized since i am using pyspark df = spark.range(10) print(type(df)) df.cache().count() print(df.storageLevel) # id 15 => using default storage level for rdd (memory_only) and makes sense why it is serialized rdd = df.rdd print(type(rdd)) rdd.cache().collect() # id 19 => manually configuring to (memory_and_disk) which makes the storage level serialized df2 = spark.range(100) from pyspark import StorageLevel print(type(df2)) df2.persist(StorageLevel.MEMORY_AND_DISK).count() print(df2.storageLevel) <class 'pyspark.sql.dataframe.DataFrame'> Disk Memory Deserialized 1x Replicated <class 'pyspark.rdd.RDD'> <class 'pyspark.sql.dataframe.DataFrame'> Disk Memory Serialized 1x Replicated > On Sep 16, 2019, at 2:02 AM, Jörn Franke <jornfra...@gmail.com> wrote: > > I don’t know your full source code but you may missing an action so that it > is indeed persisted. > >> Am 16.09.2019 um 02:07 schrieb grp <gpete...@villanova.edu>: >> >> Hi There Spark Users, >> >> Curious what is going on here. Not sure if possible bug or missing >> something. Extra eyes are much appreciated. >> >> Spark UI (Python API 2.4.3) by default is reporting persisted data-frames to >> be de-serialized MEMORY_AND_DISK however I always thought they were >> serialized for Python by default according to official documentation. >> However when explicitly changing the storage level to default … ex => >> df.persist(StorageLevel.MEMORY_AND_DISK) … the Spark UI returns the expected >> serialized data-frame under Storage Tab, but not when just calling … >> df.cache(). >> >> Do we have to explicitly set to … StorageLevel.MEMORY_AND_DISK … to get the >> serialized benefit in Python (which I thought was automatic)? Or is the >> Spark UI incorrect? >> >> SO post with specific example/details => >> https://stackoverflow.com/questions/56926337/conflicting-pyspark-storage-level-defaults >> >> Thank you for your time and research! >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>