Yes, DataFrames are for much more than SQL and I would recommend using them where ever possible. It is much easier for us to do optimizations when we have more information about the schema of your data, and as such, most of our on going optimization effort will focus on making DataFrames faster.
On Thu, Jun 11, 2015 at 10:08 AM, Tom Hubregtsen <thubregt...@gmail.com> wrote: > I've looked a bit into what DataFrames are, and it seems that most posts on > the subject are related to SQL, but it does seem to be very efficient. My > main questions is: Are DataFrames also beneficial for non-SQL computations? > > For instance I want to: > - sort k/v pairs (in particular, is the naive versus efficient > - perform some arbitrary map-reduce instructions > > I am wondering this, as I read about the *naive vs cache aware layout*, and > also read the following on the databricks blog: > "The first pieces will land in Spark 1.4, which includes explicitly managed > memory for aggregation operations *in Spark’s DataFrame API* as well as > customized serializers. Expanded coverage of binary memory management and > cache-aware data structures will appear in Spark 1.5." > This leads me to believe that the cache aware layout that also seems > beneficial for regular computation/sort is (currently?) only implemented in > dataFrames (?) and makes me wonder if I then should just use dataFrames in > my "regular" computation. > > Thanks in advance, > > Tom > > P.S. currently using the master branch from the gitHub > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/DataFrames-for-non-SQL-computation-tp23281.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >