Sorry, hit the send key a bitt too early. Anyway, this is the code I set. sqlContext.sql("set spark.sql.shuffle.partitions=10"); From: Darin McBeath <ddmcbe...@yahoo.com> To: Darin McBeath <ddmcbe...@yahoo.com>; User <user@spark.apache.org> Sent: Wednesday, October 29, 2014 2:47 PM Subject: Re: Spark SQL and confused about number of partitions/tasks to do a simple join. ok. after reading some documentation, it would appear the issue is the default number of partitions for a join (200). After doing something like the following, I was able to change the value.
From: Darin McBeath <ddmcbe...@yahoo.com.INVALID> To: User <user@spark.apache.org> Sent: Wednesday, October 29, 2014 1:55 PM Subject: Spark SQL and confused about number of partitions/tasks to do a simple join. I have a SchemaRDD with 100 records in 1 partition. We'll call this baseline. I have a SchemaRDD with 11 records in 1 partition. We'll call this daily. After a fairly basic join of these two tables JavaSchemaRDD results = sqlContext.sql("SELECT id, action, daily.epoch, daily.version FROM baseline, daily WHERE key=id AND action='u' AND daily.epoch > baseline.epoch").cache(); I get a new SchemaRDD results with only 6 records (and the RDD has 200 partitions). When the job runs, I can see that 200 tasks were used to do this join. Does this make sense? I'm currently not doing anything special along the lines of partitioning (such as hash). Even if 200 tasks would have been required, since the result is only 6 (shouldn't some of these empty partitions been 'deleted'). I'm using Apache Spark 1.1 and I'm running this in local mode (localhost[1]). Any insight would be appreciated. Thanks. Darin.