[ https://issues.apache.org/jira/browse/SPARK-30580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-30580. ---------------------------------- Resolution: Invalid > Why can PySpark persist data only in serialised format? > ------------------------------------------------------- > > Key: SPARK-30580 > URL: https://issues.apache.org/jira/browse/SPARK-30580 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.4.0 > Reporter: Francesco Cavrini > Priority: Minor > Labels: performance > > The storage levels in PySpark allow to persist data only in serialised > format. There is also [a > comment|[https://github.com/apache/spark/blob/master/python/pyspark/storagelevel.py#L28]] > explicitly stating that "Since the data is always serialized on the Python > side, all the constants use the serialized formats." While that makes totally > sense for RDDs, it is not clear to me why it is not possible to persist data > without serialisation when using the dataframe/dataset APIs. In theory, in > such cases, the persist would only be a directive and data would never leave > the JVM, thus allowing for un-serialised persistence, correct? Many thanks > for the feedback! -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org