[ https://issues.apache.org/jira/browse/SPARK-22367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16221899#comment-16221899 ]
Apache Spark commented on SPARK-22367: -------------------------------------- User 'ConeyLiu' has created a pull request for this issue: https://github.com/apache/spark/pull/19586 > Separate the serialization of class and object for iteraor > ---------------------------------------------------------- > > Key: SPARK-22367 > URL: https://issues.apache.org/jira/browse/SPARK-22367 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.2.0 > Reporter: Xianyang Liu > > Becuase they are all the same class for an iterator. So there is no need > write class information for every record in the iterator. We only need write > the class information once at the serialization beginning, also only need > read the class information once for deserialization. > In this patch, we separate the serialization of class and object for an > iterator serialized by Kryo. This can improve the performance of the > serialization and deserialization, and save the space. > Test case: > ```scala > val conf = new SparkConf().setAppName("Test for serialization") > val sc = new SparkContext(conf) > val random = new Random(1) > val data = sc.parallelize(1 to 1000000000).map { i => > Person("id-" + i, random.nextInt(Integer.MAX_VALUE)) > }.persist(StorageLevel.OFF_HEAP) > var start = System.currentTimeMillis() > data.count() > println("First time: " + (System.currentTimeMillis() - start)) > start = System.currentTimeMillis() > data.count() > println("Second time: " + (System.currentTimeMillis() - start)) > ``` > Test result: > The size of serialized: > before: 34.3GB > after: 17.5GB > | before(cal+serialization)| before(deserialization)| > after(cal+serialization)| after(deserialization) | > | ------| ------ | ------ | ------ | > | 63869| 21882| 45513| 15158| > | 59368| 21507| 51683| 15524| > | 66230| 21481| 62163| 14903| > | 62399| 22529| 52400| 16255| > | 137564.2 | 136990.8 | 1.004186 | -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org