Hello,

I built a prototype that uses join and groupBy operations via Spark RDD
API. Recently I migrated it to the Dataset API. Now it runs much slower
than with the original RDD implementation. Did I do something wrong here?
Or is this the price I have to pay for the more convienient API?
Is there a known solution to deal with this effect (eg configuration via
"spark.sql.shuffle.partitions" - but how could I determine the correct
value)?
In my prototype I use Java Beans with a lot of attributes. Does this slow
down Spark-operations with Datasets?

Here I have an simple example, that shows the difference: (See attached
file: JoinGroupByTest.zip)
- I build 2 RDDs and join and group them. Afterwards I count and display
the joined RDDs.  (Method de.testrddds.JoinGroupByTest.joinAndGroupViaRDD
() )
- When I do the same actions with Datasets it takes approximately 40 times
as long (Method de.testrddds.JoinGroupByTest.joinAndGroupViaDatasets()).

Thank you very much for your help.
Matthias

PS: See the appended screenshots taken from Spark UI (jobs 0/1 belong to
RDD implementation, jobs 2/3 to Dataset):






Fiducia & GAD IT AG | www.fiduciagad.de
AG Frankfurt a. M. HRB 102381 | Sitz der Gesellschaft: Hahnstr. 48, 60528
Frankfurt a. M. | USt-IdNr. DE 143582320
Vorstand: Klaus-Peter Bruns (Vorsitzender), Claus-Dieter Toben (stv.
Vorsitzender),
Jens-Olaf Bartels, Martin Beyer, Jörg Dreinhöfer, Wolfgang Eckert, Carsten
Pfläging, Jörg Staff
Vorsitzender des Aufsichtsrats: Jürgen Brinkmann

2D782357.gif (62K) 
<http://apache-spark-user-list.1001560.n3.nabble.com/attachment/27449/0/2D782357.gif>
2D546574.gif (98K) 
<http://apache-spark-user-list.1001560.n3.nabble.com/attachment/27449/1/2D546574.gif>
2D310440.gif (126K) 
<http://apache-spark-user-list.1001560.n3.nabble.com/attachment/27449/2/2D310440.gif>
JoinGroupByTest.zip (5K) 
<http://apache-spark-user-list.1001560.n3.nabble.com/attachment/27449/3/JoinGroupByTest.zip>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Are-join-groupBy-operations-with-wide-Java-Beans-using-Dataset-API-much-slower-than-using-RDD-API-tp27449.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to