subject:"Re\: how to save spark files as parquets efficiently"

Re: how to save spark files as parquets efficiently

2016-07-29 Thread Sumit Khanna

Great! Common sense is very uncommon.

On Fri, Jul 29, 2016 at 8:26 PM, Ewan Leith 
wrote:

> If you replace the df.write ….
>
>
>
> With
>
>
>
> df.count()
>
>
>
> in your code you’ll see how much time is taken to process the full
> execution plan without the write output.
>
>
>
> That code below looks perfectly normal for writing a parquet file yes,
> there shouldn’t be any tuning needed for “normal” performance.
>
>
>
> Thanks,
>
> Ewan
>
>
>
> *From:* Sumit Khanna [mailto:sumit.kha...@askme.in]
> *Sent:* 29 July 2016 13:41
> *To:* Gourav Sengupta 
> *Cc:* user 
> *Subject:* Re: how to save spark files as parquets efficiently
>
>
>
> Hey Gourav,
>
>
>
> Well so I think that it is my execution plan that is at fault. So
> basically df.write as a spark job on localhost:4040/ well being an action
> will include the time taken for all the umpteen transformation on it right?
> All I wanted to know is "what apt env/config params are needed to something
> simple read a dataframe from parquet and save it back as another parquet
> (meaning vanilla load/store no transformation). Is it good enough to simply
> read. and write. in the very format mentioned in spark tutorial docs i.e
>
>
>
> *df.write.format("parquet").mode("overwrite").save(hdfspathTemp) *??
>
>
>
> Thanks,
>
>
>
> On Fri, Jul 29, 2016 at 4:22 PM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
> Hi,
>
>
> The default write format in SPARK is parquet. And I have never faced any
> issues writing over a billion records in SPARK. Are you using
> virtualization by any chance or an obsolete hard disk or Intel Celeron may
> be?
>
> Regards,
>
> Gourav Sengupta
>
>
>
> On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna 
> wrote:
>
> Hey,
>
>
>
> master=yarn
>
> mode=cluster
>
>
>
> spark.executor.memory=8g
>
> spark.rpc.netty.dispatcher.numThreads=2
>
>
>
> All the POC on a single node cluster. the biggest bottle neck being :
>
>
>
> 1.8 hrs to save 500k records as a parquet file/dir executing this command :
>
>
>
> *df.write.format("parquet").mode("overwrite").save(hdfspathTemp)*
>
>
>
> No doubt, the whole execution plan gets triggered on this write / save
> action. But is it the right command / set of params to save a dataframe?
>
>
>
> essentially I am doing an upsert by pulling in data from hdfs and then
> updating it with the delta changes of the current run. But not sure if
> write itself takes that much time or some optimization is needed for
> upsert. (I have that asked as another question altogether).
>
>
>
> Thanks,
>
> Sumit
>
>
>
>
>
>
>

RE: how to save spark files as parquets efficiently

2016-07-29 Thread Ewan Leith

If you replace the df.write ….

With

df.count()

in your code you’ll see how much time is taken to process the full execution 
plan without the write output.

That code below looks perfectly normal for writing a parquet file yes, there 
shouldn’t be any tuning needed for “normal” performance.

Thanks,
Ewan

From: Sumit Khanna [mailto:sumit.kha...@askme.in]
Sent: 29 July 2016 13:41
To: Gourav Sengupta 
Cc: user 
Subject: Re: how to save spark files as parquets efficiently

Hey Gourav,

Well so I think that it is my execution plan that is at fault. So basically 
df.write as a spark job on localhost:4040/ well being an action will include 
the time taken for all the umpteen transformation on it right? All I wanted to 
know is "what apt env/config params are needed to something simple read a 
dataframe from parquet and save it back as another parquet (meaning vanilla 
load/store no transformation). Is it good enough to simply read. and write. in 
the very format mentioned in spark tutorial docs i.e

df.write.format("parquet").mode("overwrite").save(hdfspathTemp) ??

Thanks,

On Fri, Jul 29, 2016 at 4:22 PM, Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>> wrote:
Hi,

The default write format in SPARK is parquet. And I have never faced any issues 
writing over a billion records in SPARK. Are you using virtualization by any 
chance or an obsolete hard disk or Intel Celeron may be?
Regards,
Gourav Sengupta

On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna 
mailto:sumit.kha...@askme.in>> wrote:
Hey,

master=yarn
mode=cluster

spark.executor.memory=8g
spark.rpc.netty.dispatcher.numThreads=2

All the POC on a single node cluster. the biggest bottle neck being :

1.8 hrs to save 500k records as a parquet file/dir executing this command :


df.write.format("parquet").mode("overwrite").save(hdfspathTemp)

No doubt, the whole execution plan gets triggered on this write / save action. 
But is it the right command / set of params to save a dataframe?

essentially I am doing an upsert by pulling in data from hdfs and then updating 
it with the delta changes of the current run. But not sure if write itself 
takes that much time or some optimization is needed for upsert. (I have that 
asked as another question altogether).

Thanks,
Sumit

Re: how to save spark files as parquets efficiently

2016-07-29 Thread Sumit Khanna

Hey Gourav,

Well so I think that it is my execution plan that is at fault. So basically
df.write as a spark job on localhost:4040/ well being an action will
include the time taken for all the umpteen transformation on it right? All
I wanted to know is "what apt env/config params are needed to something
simple read a dataframe from parquet and save it back as another parquet
(meaning vanilla load/store no transformation). Is it good enough to simply
read. and write. in the very format mentioned in spark tutorial docs i.e

df.write.format("parquet").mode("overwrite").save(hdfspathTemp) ??

Thanks,

On Fri, Jul 29, 2016 at 4:22 PM, Gourav Sengupta 
wrote:

> Hi,
>
> The default write format in SPARK is parquet. And I have never faced any
> issues writing over a billion records in SPARK. Are you using
> virtualization by any chance or an obsolete hard disk or Intel Celeron may
> be?
>
> Regards,
> Gourav Sengupta
>
> On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna 
> wrote:
>
>> Hey,
>>
>> master=yarn
>> mode=cluster
>>
>> spark.executor.memory=8g
>> spark.rpc.netty.dispatcher.numThreads=2
>>
>> All the POC on a single node cluster. the biggest bottle neck being :
>>
>> 1.8 hrs to save 500k records as a parquet file/dir executing this command
>> :
>>
>> df.write.format("parquet").mode("overwrite").save(hdfspathTemp)
>>
>>
>> No doubt, the whole execution plan gets triggered on this write / save
>> action. But is it the right command / set of params to save a dataframe?
>>
>> essentially I am doing an upsert by pulling in data from hdfs and then
>> updating it with the delta changes of the current run. But not sure if
>> write itself takes that much time or some optimization is needed for
>> upsert. (I have that asked as another question altogether).
>>
>> Thanks,
>> Sumit
>>
>>
>

Re: how to save spark files as parquets efficiently

2016-07-29 Thread Gourav Sengupta

Hi,

The default write format in SPARK is parquet. And I have never faced any
issues writing over a billion records in SPARK. Are you using
virtualization by any chance or an obsolete hard disk or Intel Celeron may
be?

Regards,
Gourav Sengupta

On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna  wrote:

> Hey,
>
> master=yarn
> mode=cluster
>
> spark.executor.memory=8g
> spark.rpc.netty.dispatcher.numThreads=2
>
> All the POC on a single node cluster. the biggest bottle neck being :
>
> 1.8 hrs to save 500k records as a parquet file/dir executing this command :
>
> df.write.format("parquet").mode("overwrite").save(hdfspathTemp)
>
>
> No doubt, the whole execution plan gets triggered on this write / save
> action. But is it the right command / set of params to save a dataframe?
>
> essentially I am doing an upsert by pulling in data from hdfs and then
> updating it with the delta changes of the current run. But not sure if
> write itself takes that much time or some optimization is needed for
> upsert. (I have that asked as another question altogether).
>
> Thanks,
> Sumit
>
>

Re: how to save spark files as parquets efficiently

2016-07-29 Thread Sumit Khanna

Hey,

So I believe this is the right format to save the file, as in optimization
is never in the write part, but with the head / body of my execution plan
isnt it?

Thanks,

On Fri, Jul 29, 2016 at 11:57 AM, Sumit Khanna 
wrote:

> Hey,
>
> master=yarn
> mode=cluster
>
> spark.executor.memory=8g
> spark.rpc.netty.dispatcher.numThreads=2
>
> All the POC on a single node cluster. the biggest bottle neck being :
>
> 1.8 hrs to save 500k records as a parquet file/dir executing this command :
>
> df.write.format("parquet").mode("overwrite").save(hdfspathTemp)
>
>
> No doubt, the whole execution plan gets triggered on this write / save
> action. But is it the right command / set of params to save a dataframe?
>
> essentially I am doing an upsert by pulling in data from hdfs and then
> updating it with the delta changes of the current run. But not sure if
> write itself takes that much time or some optimization is needed for
> upsert. (I have that asked as another question altogether).
>
> Thanks,
> Sumit
>
>

Re: how to save spark files as parquets efficiently

RE: how to save spark files as parquets efficiently

Re: how to save spark files as parquets efficiently

Re: how to save spark files as parquets efficiently

Re: how to save spark files as parquets efficiently

5 matches

Site Navigation

Mail list logo

Footer information