Re: Spark Dataframe: Save to hdfs is taking long time
Try setting num partitions to (number of executors * number of cores) while writing to dest location. You should be very very careful while setting num partitions as incorrect number may lead to shuffle. On Fri, Dec 16, 2016 at 12:56 PM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > I am trying to save the files as Paraquet. > > On Thu, Dec 15, 2016 at 10:41 PM, Felix Cheung <felixcheun...@hotmail.com> > wrote: > >> What is the format? >> >> >> -- >> *From:* KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> >> *Sent:* Thursday, December 15, 2016 7:54:27 PM >> *To:* user @spark >> *Subject:* Spark Dataframe: Save to hdfs is taking long time >> >> Hi, >> >> I am using issue while saving the dataframe back to HDFS. It's taking >> long time to run. >> >> val results_dataframe = sqlContext.sql("select gt.*,ct.* from >> PredictTempTable pt,ClusterTempTable ct,GamificationTempTable gt where >> gt.vin=pt.vin and pt.cluster=ct.cluster") >> results_dataframe.coalesce(numPartitions) >> results_dataframe.persist(StorageLevel.MEMORY_AND_DISK) >> >> dataFrame.write.mode(saveMode).format(format) >> .option(Codec, compressCodec) //"org.apache.hadoop.io.compress.snappyCodec" >> .save(outputPath) >> >> It was taking long time and total number of records for this dataframe is >> 4903764 >> >> I even increased number of partitions from 10 to 20, still no luck. Can >> anyone help me in resolving this performance issue >> >> Thanks, >> >> Asmath >> >> > -- -- Thanks, Raju Bairishetti, www.lazada.com
Re: Spark Dataframe: Save to hdfs is taking long time
I am trying to save the files as Paraquet. On Thu, Dec 15, 2016 at 10:41 PM, Felix Cheung <felixcheun...@hotmail.com> wrote: > What is the format? > > > -- > *From:* KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> > *Sent:* Thursday, December 15, 2016 7:54:27 PM > *To:* user @spark > *Subject:* Spark Dataframe: Save to hdfs is taking long time > > Hi, > > I am using issue while saving the dataframe back to HDFS. It's taking long > time to run. > > val results_dataframe = sqlContext.sql("select gt.*,ct.* from > PredictTempTable pt,ClusterTempTable ct,GamificationTempTable gt where > gt.vin=pt.vin and pt.cluster=ct.cluster") > results_dataframe.coalesce(numPartitions) > results_dataframe.persist(StorageLevel.MEMORY_AND_DISK) > > dataFrame.write.mode(saveMode).format(format) > .option(Codec, compressCodec) //"org.apache.hadoop.io.compress.snappyCodec" > .save(outputPath) > > It was taking long time and total number of records for this dataframe is > 4903764 > > I even increased number of partitions from 10 to 20, still no luck. Can > anyone help me in resolving this performance issue > > Thanks, > > Asmath > >
Re: Spark Dataframe: Save to hdfs is taking long time
What is the format? From: KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> Sent: Thursday, December 15, 2016 7:54:27 PM To: user @spark Subject: Spark Dataframe: Save to hdfs is taking long time Hi, I am using issue while saving the dataframe back to HDFS. It's taking long time to run. val results_dataframe = sqlContext.sql("select gt.*,ct.* from PredictTempTable pt,ClusterTempTable ct,GamificationTempTable gt where gt.vin=pt.vin and pt.cluster=ct.cluster") results_dataframe.coalesce(numPartitions) results_dataframe.persist(StorageLevel.MEMORY_AND_DISK) dataFrame.write.mode(saveMode).format(format) .option(Codec, compressCodec) //"org.apache.hadoop.io.compress.snappyCodec" .save(outputPath) It was taking long time and total number of records for this dataframe is 4903764 I even increased number of partitions from 10 to 20, still no luck. Can anyone help me in resolving this performance issue Thanks, Asmath
Spark Dataframe: Save to hdfs is taking long time
Hi, I am using issue while saving the dataframe back to HDFS. It's taking long time to run. val results_dataframe = sqlContext.sql("select gt.*,ct.* from PredictTempTable pt,ClusterTempTable ct,GamificationTempTable gt where gt.vin=pt.vin and pt.cluster=ct.cluster") results_dataframe.coalesce(numPartitions) results_dataframe.persist(StorageLevel.MEMORY_AND_DISK) dataFrame.write.mode(saveMode).format(format) .option(Codec, compressCodec) //"org.apache.hadoop.io.compress.snappyCodec" .save(outputPath) It was taking long time and total number of records for this dataframe is 4903764 I even increased number of partitions from 10 to 20, still no luck. Can anyone help me in resolving this performance issue Thanks, Asmath