Hi Takeshi, Thank you for your comment. I changed it to RDD and it's a lot better.
Zhuo On Fri, Nov 25, 2016 at 7:04 PM, Takeshi Yamamuro <linguin....@gmail.com> wrote: > Hi, > > I think this is just the overhead to represent nested elements as internal > rows on-runtime > (e.g., it consumes null bits for each nested element). > Moreover, in parquet formats, nested data are columnar and highly > compressed, > so it becomes so compact. > > But, I'm not sure about better approaches in this cases. > > // maropu > > > > > > > > > On Sat, Nov 26, 2016 at 11:16 AM, taozhuo <taoz...@gmail.com> wrote: > >> The Dataset is defined as case class with many fields with nested >> structure(Map, List of another case class etc.) >> The size of the Dataset is only 1T when saving to disk as Parquet file. >> But when joining it, the shuffle write size becomes as large as 12T. >> Is there a way to cut it down without changing the schema? If not, what is >> the best practice when designing complex schemas? >> >> >> >> -- >> View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/Why-is-shuffle-write-size-so-large-whe >> n-joining-Dataset-with-nested-structure-tp28136.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > > > -- > --- > Takeshi Yamamuro >