Hi all, SparkSQL usually creates DataFrame with GenericRowWithSchema(is that right?). And 'Row' is a super class of GenericRow and GenericRowWithSchema. The only difference is that GenericRowWithSchema has its schema information as StructType. But I think one DataFrame has only one schema then each row should not have to store schema in it. Because StructType is very heavy and most of RDD has many rows. To test this, 1) create DataFrame and call rdd ( RDD[Row] ) <= GenericRowWithSchema 2) dataframe.map( row => Row(row.toSeq)) <= GenericRow 3) dataframe.map( row => row.toSeq) <= underlying sequence of a row 4) saveAsObjectFile or use org.apache.spark.util.SizeEstimator.estimate And my result is, (dataframe with 5columns) GenericRowWithSchema => 13gb GenericRow => 8.2gb Seq => 7gb
Best regards Kevin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GenericRowWithSchema-is-too-heavy-tp24018.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org