Re: how to save spark files as parquets efficiently
Great! Common sense is very uncommon. On Fri, Jul 29, 2016 at 8:26 PM, Ewan Leith wrote: > If you replace the df.write …. > > > > With > > > > df.count() > > > > in your code you’ll see how much time is taken to process the full > execution plan without the write output. > > > > That code below looks perfectly normal for writing a parquet file yes, > there shouldn’t be any tuning needed for “normal” performance. > > > > Thanks, > > Ewan > > > > *From:* Sumit Khanna [mailto:sumit.kha...@askme.in] > *Sent:* 29 July 2016 13:41 > *To:* Gourav Sengupta > *Cc:* user > *Subject:* Re: how to save spark files as parquets efficiently > > > > Hey Gourav, > > > > Well so I think that it is my execution plan that is at fault. So > basically df.write as a spark job on localhost:4040/ well being an action > will include the time taken for all the umpteen transformation on it right? > All I wanted to know is "what apt env/config params are needed to something > simple read a dataframe from parquet and save it back as another parquet > (meaning vanilla load/store no transformation). Is it good enough to simply > read. and write. in the very format mentioned in spark tutorial docs i.e > > > > *df.write.format("parquet").mode("overwrite").save(hdfspathTemp) *?? > > > > Thanks, > > > > On Fri, Jul 29, 2016 at 4:22 PM, Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > > Hi, > > > The default write format in SPARK is parquet. And I have never faced any > issues writing over a billion records in SPARK. Are you using > virtualization by any chance or an obsolete hard disk or Intel Celeron may > be? > > Regards, > > Gourav Sengupta > > > > On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna > wrote: > > Hey, > > > > master=yarn > > mode=cluster > > > > spark.executor.memory=8g > > spark.rpc.netty.dispatcher.numThreads=2 > > > > All the POC on a single node cluster. the biggest bottle neck being : > > > > 1.8 hrs to save 500k records as a parquet file/dir executing this command : > > > > *df.write.format("parquet").mode("overwrite").save(hdfspathTemp)* > > > > No doubt, the whole execution plan gets triggered on this write / save > action. But is it the right command / set of params to save a dataframe? > > > > essentially I am doing an upsert by pulling in data from hdfs and then > updating it with the delta changes of the current run. But not sure if > write itself takes that much time or some optimization is needed for > upsert. (I have that asked as another question altogether). > > > > Thanks, > > Sumit > > > > > > >
RE: how to save spark files as parquets efficiently
If you replace the df.write …. With df.count() in your code you’ll see how much time is taken to process the full execution plan without the write output. That code below looks perfectly normal for writing a parquet file yes, there shouldn’t be any tuning needed for “normal” performance. Thanks, Ewan From: Sumit Khanna [mailto:sumit.kha...@askme.in] Sent: 29 July 2016 13:41 To: Gourav Sengupta Cc: user Subject: Re: how to save spark files as parquets efficiently Hey Gourav, Well so I think that it is my execution plan that is at fault. So basically df.write as a spark job on localhost:4040/ well being an action will include the time taken for all the umpteen transformation on it right? All I wanted to know is "what apt env/config params are needed to something simple read a dataframe from parquet and save it back as another parquet (meaning vanilla load/store no transformation). Is it good enough to simply read. and write. in the very format mentioned in spark tutorial docs i.e df.write.format("parquet").mode("overwrite").save(hdfspathTemp) ?? Thanks, On Fri, Jul 29, 2016 at 4:22 PM, Gourav Sengupta mailto:gourav.sengu...@gmail.com>> wrote: Hi, The default write format in SPARK is parquet. And I have never faced any issues writing over a billion records in SPARK. Are you using virtualization by any chance or an obsolete hard disk or Intel Celeron may be? Regards, Gourav Sengupta On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna mailto:sumit.kha...@askme.in>> wrote: Hey, master=yarn mode=cluster spark.executor.memory=8g spark.rpc.netty.dispatcher.numThreads=2 All the POC on a single node cluster. the biggest bottle neck being : 1.8 hrs to save 500k records as a parquet file/dir executing this command : df.write.format("parquet").mode("overwrite").save(hdfspathTemp) No doubt, the whole execution plan gets triggered on this write / save action. But is it the right command / set of params to save a dataframe? essentially I am doing an upsert by pulling in data from hdfs and then updating it with the delta changes of the current run. But not sure if write itself takes that much time or some optimization is needed for upsert. (I have that asked as another question altogether). Thanks, Sumit
Re: how to save spark files as parquets efficiently
Hey Gourav, Well so I think that it is my execution plan that is at fault. So basically df.write as a spark job on localhost:4040/ well being an action will include the time taken for all the umpteen transformation on it right? All I wanted to know is "what apt env/config params are needed to something simple read a dataframe from parquet and save it back as another parquet (meaning vanilla load/store no transformation). Is it good enough to simply read. and write. in the very format mentioned in spark tutorial docs i.e df.write.format("parquet").mode("overwrite").save(hdfspathTemp) ?? Thanks, On Fri, Jul 29, 2016 at 4:22 PM, Gourav Sengupta wrote: > Hi, > > The default write format in SPARK is parquet. And I have never faced any > issues writing over a billion records in SPARK. Are you using > virtualization by any chance or an obsolete hard disk or Intel Celeron may > be? > > Regards, > Gourav Sengupta > > On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna > wrote: > >> Hey, >> >> master=yarn >> mode=cluster >> >> spark.executor.memory=8g >> spark.rpc.netty.dispatcher.numThreads=2 >> >> All the POC on a single node cluster. the biggest bottle neck being : >> >> 1.8 hrs to save 500k records as a parquet file/dir executing this command >> : >> >> df.write.format("parquet").mode("overwrite").save(hdfspathTemp) >> >> >> No doubt, the whole execution plan gets triggered on this write / save >> action. But is it the right command / set of params to save a dataframe? >> >> essentially I am doing an upsert by pulling in data from hdfs and then >> updating it with the delta changes of the current run. But not sure if >> write itself takes that much time or some optimization is needed for >> upsert. (I have that asked as another question altogether). >> >> Thanks, >> Sumit >> >> >
Re: how to save spark files as parquets efficiently
Hi, The default write format in SPARK is parquet. And I have never faced any issues writing over a billion records in SPARK. Are you using virtualization by any chance or an obsolete hard disk or Intel Celeron may be? Regards, Gourav Sengupta On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna wrote: > Hey, > > master=yarn > mode=cluster > > spark.executor.memory=8g > spark.rpc.netty.dispatcher.numThreads=2 > > All the POC on a single node cluster. the biggest bottle neck being : > > 1.8 hrs to save 500k records as a parquet file/dir executing this command : > > df.write.format("parquet").mode("overwrite").save(hdfspathTemp) > > > No doubt, the whole execution plan gets triggered on this write / save > action. But is it the right command / set of params to save a dataframe? > > essentially I am doing an upsert by pulling in data from hdfs and then > updating it with the delta changes of the current run. But not sure if > write itself takes that much time or some optimization is needed for > upsert. (I have that asked as another question altogether). > > Thanks, > Sumit > >
Re: how to save spark files as parquets efficiently
Hey, So I believe this is the right format to save the file, as in optimization is never in the write part, but with the head / body of my execution plan isnt it? Thanks, On Fri, Jul 29, 2016 at 11:57 AM, Sumit Khanna wrote: > Hey, > > master=yarn > mode=cluster > > spark.executor.memory=8g > spark.rpc.netty.dispatcher.numThreads=2 > > All the POC on a single node cluster. the biggest bottle neck being : > > 1.8 hrs to save 500k records as a parquet file/dir executing this command : > > df.write.format("parquet").mode("overwrite").save(hdfspathTemp) > > > No doubt, the whole execution plan gets triggered on this write / save > action. But is it the right command / set of params to save a dataframe? > > essentially I am doing an upsert by pulling in data from hdfs and then > updating it with the delta changes of the current run. But not sure if > write itself takes that much time or some optimization is needed for > upsert. (I have that asked as another question altogether). > > Thanks, > Sumit > >