Re: Why is shuffle write size so large when joining Dataset with nested structure?

2016-11-27 Thread Zhuo Tao
Hi Takeshi,

Thank you for your comment. I changed it to RDD and it's a lot better.

Zhuo

On Fri, Nov 25, 2016 at 7:04 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> I think this is just the overhead to represent nested elements as internal
> rows on-runtime
> (e.g., it consumes null bits for each nested element).
> Moreover, in parquet formats, nested data are columnar and highly
> compressed,
> so it becomes so compact.
>
> But, I'm not sure about better approaches in this cases.
>
> // maropu
>
>
>
>
>
>
>
>
> On Sat, Nov 26, 2016 at 11:16 AM, taozhuo  wrote:
>
>> The Dataset is defined as case class with many fields with nested
>> structure(Map, List of another case class etc.)
>> The size of the Dataset is only 1T when saving to disk as Parquet file.
>> But when joining it, the shuffle write size becomes as large as 12T.
>> Is there a way to cut it down without changing the schema? If not, what is
>> the best practice when designing complex schemas?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Why-is-shuffle-write-size-so-large-whe
>> n-joining-Dataset-with-nested-structure-tp28136.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: spark-submit hangs forever after all tasks finish(spark 2.0.0 stable version on yarn)

2016-07-31 Thread Zhuo Tao
Yarn client

On Sunday, July 31, 2016, Pradeep  wrote:

> Hi,
>
> Are you running on yarn-client or cluster mode?
>
> Pradeep
>
> > On Jul 30, 2016, at 7:34 PM, taozhuo >
> wrote:
> >
> > below is the error messages that seem run infinitely:
> >
> >
> > 16/07/30 23:25:38 DEBUG ProtobufRpcEngine: Call: getApplicationReport
> took
> > 1ms
> > 16/07/30 23:25:39 DEBUG Client: IPC Client (1735131305) connection to
> > /10.80.1.168:8032 from zhuotao sending #147247
> > 16/07/30 23:25:39 DEBUG Client: IPC Client (1735131305) connection to
> > /10.80.1.168:8032 from zhuotao got value #147247
> > 16/07/30 23:25:39 DEBUG ProtobufRpcEngine: Call: getApplicationReport
> took
> > 1ms
> > 16/07/30 23:25:40 DEBUG Client: IPC Client (1735131305) connection to
> > /10.80.1.168:8032 from zhuotao sending #147248
> > 16/07/30 23:25:40 DEBUG Client: IPC Client (1735131305) connection to
> > /10.80.1.168:8032 from zhuotao got value #147248
> > 16/07/30 23:25:40 DEBUG ProtobufRpcEngine: Call: getApplicationReport
> took
> > 1ms
> > 16/07/30 23:25:41 DEBUG Client: IPC Client (1735131305) connection to
> > /10.80.1.168:8032 from zhuotao sending #147249
> > 16/07/30 23:25:41 DEBUG Client: IPC Client (1735131305) connection to
> > /10.80.1.168:8032 from zhuotao got value #147249
> > 16/07/30 23:25:41 DEBUG ProtobufRpcEngine: Call: getApplicationReport
> took
> > 1ms
> > 16/07/30 23:25:42 DEBUG Client: IPC Client (1735131305) connection to
> > /10.80.1.168:8032 from zhuotao sending #147250
> > 16/07/30 23:25:42 DEBUG Client: IPC Client (1735131305) connection to
> > /10.80.1.168:8032 from zhuotao got value #147250
> > 16/07/30 23:25:42 DEBUG ProtobufRpcEngine: Call: getApplicationReport
> took
> > 1ms
> > 16/07/30 23:25:43 DEBUG Client: IPC Client (1735131305) connection to
> > /10.80.1.168:8032 from zhuotao sending #147251
> > 16/07/30 23:25:43 DEBUG Client: IPC Client (1735131305) connection to
> > /10.80.1.168:8032 from zhuotao got value #147251
> > 16/07/30 23:25:43 DEBUG ProtobufRpcEngine: Call: getApplicationReport
> took
> > 1ms
> > 16/07/30 23:25:44 DEBUG Client: IPC Client (1735131305) connection to
> > /10.80.1.168:8032 from zhuotao sending #147252
> > 16/07/30 23:25:44 DEBUG Client: IPC Client (1735131305) connection to
> > /10.80.1.168:8032 from zhuotao got value #147252
> > 16/07/30 23:25:44 DEBUG ProtobufRpcEngine: Call: getApplicationReport
> took
> > 0ms
> > 16/07/30 23:25:45 DEBUG Client: IPC Client (1735131305) connection to
> > /10.80.1.168:8032 from zhuotao sending #147253
> > 16/07/30 23:25:45 DEBUG Client: IPC Client (1735131305) connection to
> > /10.80.1.168:8032 from zhuotao got value #147253
> > 16/07/30 23:25:45 DEBUG ProtobufRpcEngine: Call: getApplicationReport
> took
> > 0ms
> > 16/07/30 23:25:46 DEBUG Client: IPC Client (1735131305) connection to
> > /10.80.1.168:8032 from zhuotao sending #147254
> > 16/07/30 23:25:46 DEBUG Client: IPC Client (1735131305) connection to
> > /10.80.1.168:8032 from zhuotao got value #147254
> > 16/07/30 23:25:46 DEBUG ProtobufRpcEngine: Call: getApplicationReport
> took
> > 1ms
> > 16/07/30 23:25:47 DEBUG Client: IPC Client (1735131305) connection to
> > /10.80.1.168:8032 from zhuotao sending #147255
> > 16/07/30 23:25:47 DEBUG Client: IPC Client (1735131305) connection to
> > /10.80.1.168:8032 from zhuotao got value #147255
> > 16/07/30 23:25:47 DEBUG ProtobufRpcEngine: Call: getApplicationReport
> took
> > 1ms
> > 16/07/30 23:25:48 DEBUG Client: IPC Client (1735131305) connection to
> > /10.80.1.168:8032 from zhuotao sending #147256
> > 16/07/30 23:25:48 DEBUG Client: IPC Client (1735131305) connection to
> > /10.80.1.168:8032 from zhuotao got value #147256
> > 16/07/30 23:25:48 DEBUG ProtobufRpcEngine: Call: getApplicationReport
> took
> > 1ms
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-submit-hangs-forever-after-all-tasks-finish-spark-2-0-0-stable-version-on-yarn-tp27436.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> >
>
>