> > bq. In many cases, the current implementation of the Dataset API does not > yet leverage the additional information it has and can be slower than RDDs. > > Are the characteristics of cases above known so that users can decide which > API to use ? >
Lots of back to back operations aren't great yet because we serialize deseriaize unnecessarily. For example: https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/DatasetPerformance.scala#L37 > > For custom encoders, I did a quick search but didn't find the JIRA number. > Can you share the JIRA number ? > This is probably the closest thing: https://issues.apache.org/jira/browse/SPARK-7768
