GenericRowWithSchema is too heavy

Kevin Jung Mon, 27 Jul 2015 21:04:15 -0700

Hi all,

SparkSQL usually creates DataFrame with GenericRowWithSchema(is that
right?). And 'Row' is a super class of GenericRow and GenericRowWithSchema.
The only difference is that GenericRowWithSchema has its schema information
as StructType. But I think one DataFrame has only one schema then each row
should not have to store schema in it. Because StructType is very heavy and
most of RDD has many rows. To test this,
1) create DataFrame and call rdd ( RDD[Row] ) <= GenericRowWithSchema
2) dataframe.map( row => Row(row.toSeq)) <= GenericRow
3) dataframe.map( row => row.toSeq) <= underlying sequence of a row
4) saveAsObjectFile or use org.apache.spark.util.SizeEstimator.estimate
And my result is,
(dataframe with 5columns)
GenericRowWithSchema => 13gb
GenericRow => 8.2gb
Seq => 7gb


Best regards
Kevin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GenericRowWithSchema-is-too-heavy-tp24018.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

GenericRowWithSchema is too heavy

Reply via email to