Great! Common sense is very uncommon. On Fri, Jul 29, 2016 at 8:26 PM, Ewan Leith <ewan.le...@realitymine.com> wrote:
> If you replace the df.write …. > > > > With > > > > df.count() > > > > in your code you’ll see how much time is taken to process the full > execution plan without the write output. > > > > That code below looks perfectly normal for writing a parquet file yes, > there shouldn’t be any tuning needed for “normal” performance. > > > > Thanks, > > Ewan > > > > *From:* Sumit Khanna [mailto:sumit.kha...@askme.in] > *Sent:* 29 July 2016 13:41 > *To:* Gourav Sengupta <gourav.sengu...@gmail.com> > *Cc:* user <user@spark.apache.org> > *Subject:* Re: how to save spark files as parquets efficiently > > > > Hey Gourav, > > > > Well so I think that it is my execution plan that is at fault. So > basically df.write as a spark job on localhost:4040/ well being an action > will include the time taken for all the umpteen transformation on it right? > All I wanted to know is "what apt env/config params are needed to something > simple read a dataframe from parquet and save it back as another parquet > (meaning vanilla load/store no transformation). Is it good enough to simply > read. and write. in the very format mentioned in spark tutorial docs i.e > > > > *df.write.format("parquet").mode("overwrite").save(hdfspathTemp) *?? > > > > Thanks, > > > > On Fri, Jul 29, 2016 at 4:22 PM, Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > > Hi, > > > The default write format in SPARK is parquet. And I have never faced any > issues writing over a billion records in SPARK. Are you using > virtualization by any chance or an obsolete hard disk or Intel Celeron may > be? > > Regards, > > Gourav Sengupta > > > > On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna <sumit.kha...@askme.in> > wrote: > > Hey, > > > > master=yarn > > mode=cluster > > > > spark.executor.memory=8g > > spark.rpc.netty.dispatcher.numThreads=2 > > > > All the POC on a single node cluster. the biggest bottle neck being : > > > > 1.8 hrs to save 500k records as a parquet file/dir executing this command : > > > > *df.write.format("parquet").mode("overwrite").save(hdfspathTemp)* > > > > No doubt, the whole execution plan gets triggered on this write / save > action. But is it the right command / set of params to save a dataframe? > > > > essentially I am doing an upsert by pulling in data from hdfs and then > updating it with the delta changes of the current run. But not sure if > write itself takes that much time or some optimization is needed for > upsert. (I have that asked as another question altogether). > > > > Thanks, > > Sumit > > > > > > >