[jira] [Comment Edited] (SPARK-12635) More efficient (column batch) serialization for Python/R
[ https://issues.apache.org/jira/browse/SPARK-12635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095494#comment-15095494 ] Sun Rui edited comment on SPARK-12635 at 1/13/16 2:35 AM: -- [~dselivanov] PySpark uses pickle and CloudPickle on python side and net.razorvine.pickle on JVM side for data serialization/deserialization between Python and JVM. While there lacks a library similar to net.razorvine.pickle which can deserialize from and serialize to R serialization format. So currently, SparkR depends on ReadBin()/writeBin() on R side and Java DataInputStream/DataOutputStream for serialization/deserialization between R and JVM, based on the fact that for simple types like integer, double, byte array, they share the same format. For collect(), the serialization/deserialization happens along with the communication via socket. I suspect there are much communication overhead occurring during many socket reads/writes. Maybe we can change the behavior in batch way, that is, serialize part of the collection result into a buffer in memory and transfer it back. Would you interested in doing a prototype and see if there is any performance improvement? Another idea would be introduce something like net.razorvine.pickle, but that sounds a lot of effort. was (Author: sunrui): [~dselivanov] PySpark uses pickle and CloudPickle on python side and net.razorvine.pickle on JVM side for data serialization/deserialization between Python and JVM. While there lacks a library similar to net.razorvine.pickle which can deserialize from and serialize to R serialization format. So currently, SparkR depends on ReadBin()/writeBin() on R side and DataInputStream/DataOutputStream for serialization/deserialization between R and JVM, based on the fact that for simple types like integer, double, array byte, they shares the same format. For collect(), the serialization/deserialization happens along with the communication via socket. I suspect there are much communication overhead occurring during many socket reads/writes. Maybe we can change the behavior in batch way, that is, serialize part of the collection result into a buffer in memory and transfer it back. Would you interested in doing a prototype and see if there is any performance improvement? Another idea would be introduce something like net.razorvine.pickle, but that sounds a lot of effort. > More efficient (column batch) serialization for Python/R > > > Key: SPARK-12635 > URL: https://issues.apache.org/jira/browse/SPARK-12635 > Project: Spark > Issue Type: New Feature > Components: PySpark, SparkR, SQL >Reporter: Reynold Xin > > Serialization between Scala / Python / R is pretty slow. Python and R both > work pretty well with column batch interface (e.g. numpy arrays). Technically > we should be able to just pass column batches around with minimal > serialization (maybe even zero copy memory). > Note that this depends on some internal refactoring to use a column batch > interface in Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12635) More efficient (column batch) serialization for Python/R
[ https://issues.apache.org/jira/browse/SPARK-12635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093987#comment-15093987 ] Dmitriy Selivanov edited comment on SPARK-12635 at 1/12/16 2:48 PM: Hi! First, thanks to all SparkR and Spark developers. I just start to evaluate SparkR. I tried it several times (since it was in AMPLab), but before 1.6 there were too many rough edges. So I used Scala API. For now I see two main limiting issues (and they are interconnected): 1. Lack of UDF in R interface. I saw SPARK-6817. 2. And I think more important - lack of fast serialization / deserialization. I believe it is impossible to develop useful R UDF interface without fast serialization / deserialization. Consider following example. I have tiny cached spark DF with nrow=300k, ncol=25 and I want to collect it to local R session: {code:R} df_local <- collect(df) {code} Resulting R data.frame is only ~ 70mb!!, but it takes **120sec** to collect it to R. (compared to **7sec** of df.toPandas() in pyspark). I made some profiling. Almost all time is spent at this calls collect -> callJStatic -> invokeJava -> readObject. readObject make a lot of read* calls from [deserialize.R](https://github.com/apache/spark/blob/c3d505602de2fd2361633f90e4fff7e041849e28/R/pkg/R/deserialize.R). So for now it **much** faster to write spark data.frame to simple plain csv/json and then read it R. I didn't read python serialization. Is it diffrent from R? Why so dramatic difference between R and Python? cc [~sunrui] was (Author: dselivanov): Hi! First, thanks to all SparkR and Spark developers. I just start to evaluate SparkR. I tried it several times (since it was in AMPLab), but before 1.6 there were too many rough edges. So I used Scala API. For now I see two main limiting issues (and they are interconnected): 1. Lack of UDF in R interface. I saw SPARK-6817. 2. And I think more important - lack of fast serialization / deserialization. I believe it is impossible to develop useful R UDF interface without fast serialization / deserialization. cc [~sunrui] Consider following example. I have tiny cached spark DF with nrow=300k, ncol=25 and I want to collect it to local R session: {code:R} df_local <- collect(df) {code} Resulting R data.frame is only ~ 70mb!!, but it takes **120sec** to collect it to R. (compared to **7sec** of df.toPandas() in pyspark). I made some profiling. Almost all time is spent at this calls collect -> callJStatic -> invokeJava -> readObject. readObject make a lot of read* calls from [deserialize.R](https://github.com/apache/spark/blob/c3d505602de2fd2361633f90e4fff7e041849e28/R/pkg/R/deserialize.R). So for now it **much** faster to write spark data.frame to simple plain csv/json and then read it R. I didn't read python serialization. Is it diffrent from R? Why so dramatic difference between R and Python? > More efficient (column batch) serialization for Python/R > > > Key: SPARK-12635 > URL: https://issues.apache.org/jira/browse/SPARK-12635 > Project: Spark > Issue Type: New Feature > Components: PySpark, SparkR, SQL >Reporter: Reynold Xin > > Serialization between Scala / Python / R is pretty slow. Python and R both > work pretty well with column batch interface (e.g. numpy arrays). Technically > we should be able to just pass column batches around with minimal > serialization (maybe even zero copy memory). > Note that this depends on some internal refactoring to use a column batch > interface in Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12635) More efficient (column batch) serialization for Python/R
[ https://issues.apache.org/jira/browse/SPARK-12635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093987#comment-15093987 ] Dmitriy Selivanov edited comment on SPARK-12635 at 1/12/16 2:48 PM: Hi! First, thanks to all SparkR and Spark developers. I just start to evaluate SparkR. I tried it several times (since it was in AMPLab), but before 1.6 there were too many rough edges. So I used Scala API. For now I see two main limiting issues (and they are interconnected): 1. Lack of UDF in R interface. I saw SPARK-6817. 2. And I think more important - lack of fast serialization / deserialization. I believe it is impossible to develop useful R UDF interface without fast serialization / deserialization. cc [~sunrui] Consider following example. I have tiny cached spark DF with nrow=300k, ncol=25 and I want to collect it to local R session: {code:R} df_local <- collect(df) {code} Resulting R data.frame is only ~ 70mb!!, but it takes **120sec** to collect it to R. (compared to **7sec** of df.toPandas() in pyspark). I made some profiling. Almost all time is spent at this calls collect -> callJStatic -> invokeJava -> readObject. readObject make a lot of read* calls from [deserialize.R](https://github.com/apache/spark/blob/c3d505602de2fd2361633f90e4fff7e041849e28/R/pkg/R/deserialize.R). So for now it **much** faster to write spark data.frame to simple plain csv/json and then read it R. I didn't read python serialization. Is it diffrent from R? Why so dramatic difference between R and Python? was (Author: dselivanov): Hi! First, thanks to all SparkR and Spark developers. I just start to evaluate SparkR. I tried it several times (since it was in AMPLab), but before 1.6 there were too many rough edges. So I used Scala API. For now I see two main limiting issues (and they are interconnected): 1. Lack of UDF in R interface. I saw SPARK-6817. 2. And I think more important - lack of fast serialization / deserialization. I believe it is impossible to develop useful R UDF interface without fast serialization / deserialization. Consider following example. I have tiny cached spark DF with nrow=300k, ncol=25 and I want to collect it to local R session: {code:R} df_local <- collect(df) {code} Resulting R data.frame is only ~ 70mb!!, but it takes **120sec** to collect it to R. (compared to **7sec** of df.toPandas() in pyspark). I made some profiling. Almost all time is spent at this calls collect -> callJStatic -> invokeJava -> readObject. readObject make a lot of read* calls from [deserialize.R](https://github.com/apache/spark/blob/c3d505602de2fd2361633f90e4fff7e041849e28/R/pkg/R/deserialize.R). So for now it **much** faster to write spark data.frame to simple plain csv/json and then read it R. I didn't read python serialization. Is it diffrent from R? Why so dramatic difference between R and Python? > More efficient (column batch) serialization for Python/R > > > Key: SPARK-12635 > URL: https://issues.apache.org/jira/browse/SPARK-12635 > Project: Spark > Issue Type: New Feature > Components: PySpark, SparkR, SQL >Reporter: Reynold Xin > > Serialization between Scala / Python / R is pretty slow. Python and R both > work pretty well with column batch interface (e.g. numpy arrays). Technically > we should be able to just pass column batches around with minimal > serialization (maybe even zero copy memory). > Note that this depends on some internal refactoring to use a column batch > interface in Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12635) More efficient (column batch) serialization for Python/R
[ https://issues.apache.org/jira/browse/SPARK-12635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093987#comment-15093987 ] Dmitriy Selivanov edited comment on SPARK-12635 at 1/12/16 2:48 PM: Hi! First, thanks to all SparkR and Spark developers. I just start to evaluate SparkR. I tried it several times (since it was in AMPLab), but before 1.6 there were too many rough edges. So I used Scala API. For now I see two main limiting issues (and they are interconnected): 1. Lack of UDF in R interface. I saw SPARK-6817. 2. And I think more important - lack of fast serialization / deserialization. I believe it is impossible to develop useful R UDF interface without fast serialization / deserialization. Consider following example. I have tiny cached spark DF with nrow=300k, ncol=25 and I want to collect it to local R session: {code} df_local <- collect(df) {code} Resulting R data.frame is only ~ 70mb!!, but it takes **120sec** to collect it to R. (compared to **7sec** of df.toPandas() in pyspark). I made some profiling. Almost all time is spent at this calls collect -> callJStatic -> invokeJava -> readObject. readObject make a lot of read* calls from [deserialize.R](https://github.com/apache/spark/blob/c3d505602de2fd2361633f90e4fff7e041849e28/R/pkg/R/deserialize.R). So for now it **much** faster to write spark data.frame to simple plain csv/json and then read it R. I didn't read python serialization. Is it diffrent from R? Why so dramatic difference between R and Python? cc [~sunrui] was (Author: dselivanov): Hi! First, thanks to all SparkR and Spark developers. I just start to evaluate SparkR. I tried it several times (since it was in AMPLab), but before 1.6 there were too many rough edges. So I used Scala API. For now I see two main limiting issues (and they are interconnected): 1. Lack of UDF in R interface. I saw SPARK-6817. 2. And I think more important - lack of fast serialization / deserialization. I believe it is impossible to develop useful R UDF interface without fast serialization / deserialization. Consider following example. I have tiny cached spark DF with nrow=300k, ncol=25 and I want to collect it to local R session: {code:R} df_local <- collect(df) {code} Resulting R data.frame is only ~ 70mb!!, but it takes **120sec** to collect it to R. (compared to **7sec** of df.toPandas() in pyspark). I made some profiling. Almost all time is spent at this calls collect -> callJStatic -> invokeJava -> readObject. readObject make a lot of read* calls from [deserialize.R](https://github.com/apache/spark/blob/c3d505602de2fd2361633f90e4fff7e041849e28/R/pkg/R/deserialize.R). So for now it **much** faster to write spark data.frame to simple plain csv/json and then read it R. I didn't read python serialization. Is it diffrent from R? Why so dramatic difference between R and Python? cc [~sunrui] > More efficient (column batch) serialization for Python/R > > > Key: SPARK-12635 > URL: https://issues.apache.org/jira/browse/SPARK-12635 > Project: Spark > Issue Type: New Feature > Components: PySpark, SparkR, SQL >Reporter: Reynold Xin > > Serialization between Scala / Python / R is pretty slow. Python and R both > work pretty well with column batch interface (e.g. numpy arrays). Technically > we should be able to just pass column batches around with minimal > serialization (maybe even zero copy memory). > Note that this depends on some internal refactoring to use a column batch > interface in Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org