Spark joins are different than traditional database joins because of the lack of support of indexes. Spark has to shuffle data between various nodes to perform joins. Hence joins are bound to be much slower than count which is just a parallel scan of the data.
Still, to ensure that nothing is wrong with the setup, you may want to look at your Spark Task UI. You may want to look at the Shuffle Reads and Shuffle write parameters. On Wed, Aug 26, 2015 at 3:08 PM, lucap <luca-pi...@hotmail.it> wrote: > Hi, > > I'm trying to perform an ETL using Spark, but as soon as I start performing > joins performance degrades a lot. Let me explain what I'm doing and what I > found out until now. > > First of all, I'm reading avro files that are on a Cloudera cluster, using > commands like this: > /val tab1 = sc.hadoopFile("hdfs:///path/to/file", > classOf[org.apache.avro.mapred.AvroInputFormat[GenericRecord]], > classOf[org.apache.avro.mapred.AvroWrapper[GenericRecord]], > classOf[org.apache.hadoop.io.NullWritable], 10)/ > > After this, I'm applying some filter functions to data (to reproduce > "where" > clauses of the original query) and then I'm using one map for each table in > order to translate RDD elements in (key,record) format. Let's say I'm doing > this: > /val elabTab1 = tab1.filter(...).map(....)/ > > It is important to notice that if I do something like /elabTab1.first/ or > /elabTab1.count/ the task is performed in a short time, let's say around > impala's time. Now I need to do the following: > /val joined = elabTab1.leftOuterJoin(elabTab2)/ > Then I tried something like /joined.count/ to test performance, but it > degraded really a lot (let's say that a count on a single table takes like > 4 > seconds and the count on the joined table takes 12 minutes). I think > there's > a problem with the configuration, but what might it be? > > I'll give you some more information: > 1] Spark is running on YARN on a Cloudera cluster > 2] I'm starting spark-shell with a command like /spark-shell > --executor-cores 5 --executor-memory 10G/ that gives the shell approx 10 > vcores and 25 GB of memory > 3] The task seems still for a lot of time after the map tasks, with the > following message in console: /Asked to send map output locations for > shuffle ... to .../ > 4] If I open the stderr of the executors, I can read plenty of messages > like > the following: /Thread ... spilling in-memory map of ... MB to disk/, where > MBs are in the order of 300-400 > 5] I tried to raise the number of executors, but the situation didn't seem > to change much. I also tried to change the number of splits of the avro > files (currently set to 10), but it didn't seem to change much as well > 6] Tables aren't particularly big, the bigger one should be few GBs > > Regards, > Luca > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-with-Spark-join-tp24458.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >