Re: RDD vs Dataset performance

2016-07-28 Thread Reynold Xin
The performance difference is coming from the need to serialize and
deserialize data to AnnotationText. The extra stage is probably very quick
and shouldn't impact much.

If you try cache the RDD using serialized mode, it would slow down a lot
too.


On Thu, Jul 28, 2016 at 9:52 AM, Darin McBeath 
wrote:

> I started playing round with Datasets on Spark 2.0 this morning and I'm
> surprised by the significant performance difference I'm seeing between an
> RDD and a Dataset for a very basic example.
>
>
> I've defined a simple case class called AnnotationText that has a handful
> of fields.
>
>
> I create a Dataset[AnnotationText] with my data and repartition(4) this on
> one of the columns and cache the resulting dataset as ds (force the cache
> by executing a count action).  Everything looks good and I have more than
> 10M records in my dataset ds.
>
> When I execute the following:
>
> ds.filter(textAnnotation => textAnnotation.text ==
> "mutational".toLowerCase).count
>
> It consistently finishes in just under 3 seconds.  One of the things I
> notice is that it has 3 stages.  The first stage is skipped (as this had to
> do with creation ds and it was already cached).  The second stage appears
> to do the filtering (requires 4 tasks) but interestingly it shuffles
> output.  The third stage (requires only 1 task) appears to count the
> results of the shuffle.
>
> When I look at the cached dataset (on 4 partitions) it is 82.6MB.
>
> I then decided to convert the ds dataset to an RDD as follows,
> repartition(4) and cache.
>
> val aRDD = ds.rdd.repartition(4).cache
> aRDD.count
> So, I now have an RDD[AnnotationText]
>
> When I execute the following:
>
> aRDD.filter(textAnnotation => textAnnotation.text ==
> "mutational".toLowerCase).count
>
> It consistently finishes in just under half a second.  One of the things I
> notice is that it only has 2 stages.  The first stage is skipped (as this
> had to do with creation of aRDD and it was already cached).  The second
> stage appears to do the filtering and count(requires 4 tasks).
> Interestingly, there is no shuffle (or subsequently 3rd stage).
>
> When I look at the cached RDD (on 4 partitions) it is 2.9GB.
>
>
> I was surprised how significant the cached storage difference was between
> the Dataset (82.6MB) and the RDD (2.9GB) version of the same content.  Is
> this kind of difference to be expected?
>
> While I like the smaller size for the Dataset version, I was confused as
> to why the performance for the Dataset version was so much slower (2.5s vs
> .5s).  I suspect it might be attributed to the shuffle and third stage
> required by the Dataset version but I'm not sure. I was under the
> impression that Datasets should (would) be faster in many use cases (such
> as the one I'm using above).  Am I doing something wrong or is this to be
> expected?
>
> Thanks.
>
> Darin.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


RDD vs Dataset performance

2016-07-28 Thread Darin McBeath
I started playing round with Datasets on Spark 2.0 this morning and I'm 
surprised by the significant performance difference I'm seeing between an RDD 
and a Dataset for a very basic example.


I've defined a simple case class called AnnotationText that has a handful of 
fields.


I create a Dataset[AnnotationText] with my data and repartition(4) this on one 
of the columns and cache the resulting dataset as ds (force the cache by 
executing a count action).  Everything looks good and I have more than 10M 
records in my dataset ds.

When I execute the following:

ds.filter(textAnnotation => textAnnotation.text == 
"mutational".toLowerCase).count 

It consistently finishes in just under 3 seconds.  One of the things I notice 
is that it has 3 stages.  The first stage is skipped (as this had to do with 
creation ds and it was already cached).  The second stage appears to do the 
filtering (requires 4 tasks) but interestingly it shuffles output.  The third 
stage (requires only 1 task) appears to count the results of the shuffle.  

When I look at the cached dataset (on 4 partitions) it is 82.6MB.

I then decided to convert the ds dataset to an RDD as follows, repartition(4) 
and cache.

val aRDD = ds.rdd.repartition(4).cache
aRDD.count
So, I now have an RDD[AnnotationText]

When I execute the following:

aRDD.filter(textAnnotation => textAnnotation.text == 
"mutational".toLowerCase).count

It consistently finishes in just under half a second.  One of the things I 
notice is that it only has 2 stages.  The first stage is skipped (as this had 
to do with creation of aRDD and it was already cached).  The second stage 
appears to do the filtering and count(requires 4 tasks).  Interestingly, there 
is no shuffle (or subsequently 3rd stage).   

When I look at the cached RDD (on 4 partitions) it is 2.9GB.


I was surprised how significant the cached storage difference was between the 
Dataset (82.6MB) and the RDD (2.9GB) version of the same content.  Is this kind 
of difference to be expected?

While I like the smaller size for the Dataset version, I was confused as to why 
the performance for the Dataset version was so much slower (2.5s vs .5s).  I 
suspect it might be attributed to the shuffle and third stage required by the 
Dataset version but I'm not sure. I was under the impression that Datasets 
should (would) be faster in many use cases (such as the one I'm using above).  
Am I doing something wrong or is this to be expected?

Thanks.

Darin.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org