Re: Spark SQL and confused about number of partitions/tasks to do a simple join.

Darin McBeath Wed, 29 Oct 2014 11:50:35 -0700

Sorry, hit the send key a bitt too early.
Anyway, this is the code I set.
sqlContext.sql("set spark.sql.shuffle.partitions=10");
      From: Darin McBeath <ddmcbe...@yahoo.com>
 To: Darin McBeath <ddmcbe...@yahoo.com>; User <user@spark.apache.org> 
 Sent: Wednesday, October 29, 2014 2:47 PM
 Subject: Re: Spark SQL and confused about number of partitions/tasks to do a 
simple join.

ok. after reading some documentation, it would appear the issue is the default 
number of partitions for a join (200).
After doing something like the following, I was able to change the value.

     From: Darin McBeath <ddmcbe...@yahoo.com.INVALID>
 To: User <user@spark.apache.org> 
 Sent: Wednesday, October 29, 2014 1:55 PM
 Subject: Spark SQL and confused about number of partitions/tasks to do a 
simple join.

I have a SchemaRDD with 100 records in 1 partition.  We'll call this baseline.
I have a SchemaRDD with 11 records in 1 partition.  We'll call this daily.
After a fairly basic join of these two tables
JavaSchemaRDD results = sqlContext.sql("SELECT id, action, daily.epoch, 
daily.version FROM baseline, daily  WHERE key=id AND action='u' AND daily.epoch 
> baseline.epoch").cache();

I get a new SchemaRDD results with only 6 records (and the RDD has 200 
partitions).  When the job runs, I can see that 200 tasks were used to do this 
join.  Does this make sense? I'm currently not doing anything special along the 
lines of partitioning (such as hash).  Even if 200 tasks would have been 
required, since the result is only 6 (shouldn't some of these empty partitions 
been 'deleted').
I'm using Apache Spark 1.1 and I'm running this in local mode (localhost[1]).
Any insight would be appreciated.
Thanks.
Darin.

Re: Spark SQL and confused about number of partitions/tasks to do a simple join.

Reply via email to