Hi, DataFrames are more efficient if you have Tungsten activated as the underlying processing engine (normally by default). However, this only speeds up processing , saving as an io-bound operation not necessarily.
What is exactly slow? The write? You could use myDF.write.save().write... However, repartition (1) means that everything is dumped into one executor and if there is a lot of data this may lead to network congestion. Better (if it is supported by the legacy application) is to write each partition individually in a file. If your processing is slow then you need to provide more concrete examples. Best regards > On 14 Sep 2016, at 14:10, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > These intermediate file what sort of files are there. Are there csv type > files. > > I agree that DF is more efficient than an RDD as it follows tabular format (I > assume that is what you mean by "columnar" format). So if you read these > files in a bath process you may not worry too much about execution time? > > A textFile saving is simply a one to one mapping from your DF to HDFS. I > think it is pretty efficient. > > For myself, I would do something like below > > myDF.rdd.repartition(1).cache.saveAsTextFile("mypath/output") > > HTH > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > >> On 14 September 2016 at 12:46, sanat kumar Patnaik <patnaik.sa...@gmail.com> >> wrote: >> Hi All, >> >> I am writing a batch application using Spark SQL and Dataframes. This >> application has a bunch of file joins and there are intermediate points >> where I need to drop a file for downstream applications to consume. >> The problem is all these downstream applications are still on legacy, so >> they still require us to drop them a text file.As you all must be knowing >> Dataframe stores the data in columnar format internally. >> Only way I found out how to do this and which looks awfully slow is this: >> >> myDF=sc.textFile("inputpath").toDF() >> myDF.rdd.repartition(1).saveAsTextFile("mypath/output") >> >> Is there any better way to do this? >> >> P.S: The other workaround would be to use RDDs for all my operations. But I >> am wary of using them as the documentation says Dataframes are way faster >> because of the Catalyst engine running behind the scene. >> >> Please suggest if any of you might have tried something similar. >