I optimized a spark sql script but have come to the conclusion that the sql api is not ideal as the tasks which are generated are slow and require too much shuffling.
So the script should be converted to rdd http://stackoverflow.com/q/41445571/2587904 How can I formulate this more efficient using RDD API? aggregateByKeyshould be a good idea but is still not very clear to me how to apply it here to substitute the window functions. Cheers Georg -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Migrate-spark-sql-to-rdd-for-better-performance-tp28270.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org