Hi Antoaneta,
I believe the difference is not due to Datasets being slower (DataFrames
are just an alias to Datasets now), but rather using a user defined
function for filtering vs using Spark builtins. The builtin can use tricks
from Project Tungsten, such as only deserializing the "event_type" column.
The user-defined function on the other hand has to be called with a full
case class, so the whole object needs to be deserialized for each row.

Well, that is my understanding from reading some slides. Hopefully someone
with a more direct knowledge of the code will correct me if I'm wrong.

On Mon, Oct 24, 2016 at 2:50 PM, Antoaneta Marinova <
antoaneta.vmarin...@gmail.com> wrote:

> Hello,
>
> I am using Spark 2.0 for performing filtering, grouping and counting
> operations on events data saved in parquet files. As the events schema has
> very nested structure I wanted to read them as scala beans to simplify the
> code but I noticed a severe performance degradation. Below you can find
> simple comparison of the same operation using DataFrame and Dataset.
>
> val data = session.read.parquet("events_data/")
>
> // Using Datasets with custom class
>
> //Case class matching the events schema
>
> case class CustomEvent(event_id: Option[String],
>
> event_type: Option[String]
>            context : Option[Context],
>
> ….
>                        time: Option[BigInt]) extends Serializable {}
>
> scala> val start = System.currentTimeMillis ;
>
>           val count = data.as[CustomEvent].filter(event =>
> eventNames.contains(event.event_type.get)).count ;
>
>          val time =  System.currentTimeMillis - start
>
> count: Long = 5545
>
> time: Long = 11262
>
> // Using DataFrames
>
> scala>
>
> val start = System.currentTimeMillis ;
>
> val count = data.filter(col("event_type").isin(eventNames:_*)).count ;
>
> val time =  System.currentTimeMillis - start
>
> count: Long = 5545
>
> time: Long = 147
>
> The schema of the events is something like this:
>
> //events schma
>
> schemaroot
>
> |-- event_id: string (nullable = true)
>
> |-- event_type: string (nullable = true)
>
> |-- context: struct (nullable = true)
>
> |    |-- environment_1: struct (nullable = true)
>
> |    |    |-- filed1: integer (nullable = true)
>
> |    |    |-- filed2: integer (nullable = true)
>
> |    |    |-- filed3: integer (nullable = true)
>
> |    |-- environment_2: struct (nullable = true)
>
> |    |    |-- filed_1: string (nullable = true)
>
> ....
>
> |    |    |-- filed_n: string (nullable = true)
>
> |-- metadata: struct (nullable = true)
>
> |    |-- elements: array (nullable = true)
>
> |    |    |-- element: struct (containsNull = true)
>
> |    |    |    |-- config: string (nullable = true)
>
> |    |    |    |-- tree: array (nullable = true)
>
> |    |    |    |    |-- element: struct (containsNull = true)
>
> |    |    |    |    |    |-- path: array (nullable = true)
>
> |    |    |    |    |    |    |-- element: struct (containsNull = true)
>
> |    |    |    |    |    |    |    |-- key: string (nullable = true)
>
> |    |    |    |    |    |    |    |-- name: string (nullable = true)
>
> |    |    |    |    |    |    |    |-- level: integer (nullable = true)
>
> |-- time: long (nullable = true)
>
> Could you please advise me on the usage of the different abstractions and
> help me understand why using datasets with user defined class is so much
> slower.
>
> Thank you,
> Antoaneta
>

Reply via email to