Hi Antoaneta, I believe the difference is not due to Datasets being slower (DataFrames are just an alias to Datasets now), but rather using a user defined function for filtering vs using Spark builtins. The builtin can use tricks from Project Tungsten, such as only deserializing the "event_type" column. The user-defined function on the other hand has to be called with a full case class, so the whole object needs to be deserialized for each row.
Well, that is my understanding from reading some slides. Hopefully someone with a more direct knowledge of the code will correct me if I'm wrong. On Mon, Oct 24, 2016 at 2:50 PM, Antoaneta Marinova < antoaneta.vmarin...@gmail.com> wrote: > Hello, > > I am using Spark 2.0 for performing filtering, grouping and counting > operations on events data saved in parquet files. As the events schema has > very nested structure I wanted to read them as scala beans to simplify the > code but I noticed a severe performance degradation. Below you can find > simple comparison of the same operation using DataFrame and Dataset. > > val data = session.read.parquet("events_data/") > > // Using Datasets with custom class > > //Case class matching the events schema > > case class CustomEvent(event_id: Option[String], > > event_type: Option[String] > context : Option[Context], > > …. > time: Option[BigInt]) extends Serializable {} > > scala> val start = System.currentTimeMillis ; > > val count = data.as[CustomEvent].filter(event => > eventNames.contains(event.event_type.get)).count ; > > val time = System.currentTimeMillis - start > > count: Long = 5545 > > time: Long = 11262 > > // Using DataFrames > > scala> > > val start = System.currentTimeMillis ; > > val count = data.filter(col("event_type").isin(eventNames:_*)).count ; > > val time = System.currentTimeMillis - start > > count: Long = 5545 > > time: Long = 147 > > The schema of the events is something like this: > > //events schma > > schemaroot > > |-- event_id: string (nullable = true) > > |-- event_type: string (nullable = true) > > |-- context: struct (nullable = true) > > | |-- environment_1: struct (nullable = true) > > | | |-- filed1: integer (nullable = true) > > | | |-- filed2: integer (nullable = true) > > | | |-- filed3: integer (nullable = true) > > | |-- environment_2: struct (nullable = true) > > | | |-- filed_1: string (nullable = true) > > .... > > | | |-- filed_n: string (nullable = true) > > |-- metadata: struct (nullable = true) > > | |-- elements: array (nullable = true) > > | | |-- element: struct (containsNull = true) > > | | | |-- config: string (nullable = true) > > | | | |-- tree: array (nullable = true) > > | | | | |-- element: struct (containsNull = true) > > | | | | | |-- path: array (nullable = true) > > | | | | | | |-- element: struct (containsNull = true) > > | | | | | | | |-- key: string (nullable = true) > > | | | | | | | |-- name: string (nullable = true) > > | | | | | | | |-- level: integer (nullable = true) > > |-- time: long (nullable = true) > > Could you please advise me on the usage of the different abstractions and > help me understand why using datasets with user defined class is so much > slower. > > Thank you, > Antoaneta >