Hey, So I believe this is the right format to save the file, as in optimization is never in the write part, but with the head / body of my execution plan isnt it?
Thanks, On Fri, Jul 29, 2016 at 11:57 AM, Sumit Khanna <sumit.kha...@askme.in> wrote: > Hey, > > master=yarn > mode=cluster > > spark.executor.memory=8g > spark.rpc.netty.dispatcher.numThreads=2 > > All the POC on a single node cluster. the biggest bottle neck being : > > 1.8 hrs to save 500k records as a parquet file/dir executing this command : > > df.write.format("parquet").mode("overwrite").save(hdfspathTemp) > > > No doubt, the whole execution plan gets triggered on this write / save > action. But is it the right command / set of params to save a dataframe? > > essentially I am doing an upsert by pulling in data from hdfs and then > updating it with the delta changes of the current run. But not sure if > write itself takes that much time or some optimization is needed for > upsert. (I have that asked as another question altogether). > > Thanks, > Sumit > >