Re: Why is shuffle write size so large when joining Dataset with nested structure?

Zhuo Tao Sun, 27 Nov 2016 16:29:30 -0800

Hi Takeshi,

Thank you for your comment. I changed it to RDD and it's a lot better.


Zhuo

On Fri, Nov 25, 2016 at 7:04 PM, Takeshi Yamamuro <linguin....@gmail.com>
wrote:

> Hi,
>
> I think this is just the overhead to represent nested elements as internal
> rows on-runtime
> (e.g., it consumes null bits for each nested element).
> Moreover, in parquet formats, nested data are columnar and highly
> compressed,
> so it becomes so compact.
>
> But, I'm not sure about better approaches in this cases.
>
> // maropu
>
>
>
>
>
>
>
>
> On Sat, Nov 26, 2016 at 11:16 AM, taozhuo <taoz...@gmail.com> wrote:
>
>> The Dataset is defined as case class with many fields with nested
>> structure(Map, List of another case class etc.)
>> The size of the Dataset is only 1T when saving to disk as Parquet file.
>> But when joining it, the shuffle write size becomes as large as 12T.
>> Is there a way to cut it down without changing the schema? If not, what is
>> the best practice when designing complex schemas?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Why-is-shuffle-write-size-so-large-whe
>> n-joining-Dataset-with-nested-structure-tp28136.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Why is shuffle write size so large when joining Dataset with nested structure?

Reply via email to