[jira] [Updated] (SPARK-9872) Allow passing of 'numPartitions' to DataFrame joins

2015-08-12 Thread Al M (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Al M updated SPARK-9872:

Description: 
When I join two normal RDDs, I can set the number of shuffle partitions in the 
'numPartitions' argument.  When I join two DataFrames I do not have this option.

My spark job loads in 2 large files and 2 small files.  When I perform a join, 
this will use the spark.sql.shuffle.partitions to determine the number of 
partitions.  This means that the join with my small files will use exactly the 
same number of partitions as the join with my large files.

I can either use a low number of partitions and run out of memory on my large 
join, or use a high number of partitions and my small join will take far too 
long.

If we were able to specify the number of shuffle partitions in a DataFrame join 
like in an RDD join, this would not be an issue.

My long term ideal solution would be dynamic partition determination as 
described in SPARK-4630.  However I appreciate that it is not particularly easy 
to do.

  was:
When I join two normal RDDs, I can set the number of shuffle partitions in the 
'numPartitions' argument.  When I join two DataFrames I do not have this option.

My spark job loads in 2 large files and 2 small files.  When I perform a join, 
this will use the spark.sql.shuffle.partitions to determine the number of 
partitions.  This means that the join with my small files will use exactliy the 
same number of partitions as the join with my large files.

I can either use a low number of partitions and run out of memory on my large 
join, or use a high number of partitions and my small join will take far too 
long.

If we were able to specify the number of shuffle partitions in a DataFrame join 
like in an RDD join, this would not be an issue.

My long term ideal solution would be dynamic partition determination as 
described in SPARK-4630.  However I appreciate that it is not particularly easy 
to do.


 Allow passing of 'numPartitions' to DataFrame joins
 ---

 Key: SPARK-9872
 URL: https://issues.apache.org/jira/browse/SPARK-9872
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.4.1
Reporter: Al M
Priority: Minor

 When I join two normal RDDs, I can set the number of shuffle partitions in 
 the 'numPartitions' argument.  When I join two DataFrames I do not have this 
 option.
 My spark job loads in 2 large files and 2 small files.  When I perform a 
 join, this will use the spark.sql.shuffle.partitions to determine the 
 number of partitions.  This means that the join with my small files will use 
 exactly the same number of partitions as the join with my large files.
 I can either use a low number of partitions and run out of memory on my large 
 join, or use a high number of partitions and my small join will take far too 
 long.
 If we were able to specify the number of shuffle partitions in a DataFrame 
 join like in an RDD join, this would not be an issue.
 My long term ideal solution would be dynamic partition determination as 
 described in SPARK-4630.  However I appreciate that it is not particularly 
 easy to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9872) Allow passing of 'numPartitions' to DataFrame joins

2015-08-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9872:
-
Component/s: SQL

 Allow passing of 'numPartitions' to DataFrame joins
 ---

 Key: SPARK-9872
 URL: https://issues.apache.org/jira/browse/SPARK-9872
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.1
Reporter: Al M
Priority: Minor

 When I join two normal RDDs, I can set the number of shuffle partitions in 
 the 'numPartitions' argument.  When I join two DataFrames I do not have this 
 option.
 My spark job loads in 2 large files and 2 small files.  When I perform a 
 join, this will use the spark.sql.shuffle.partitions to determine the 
 number of partitions.  This means that the join with my small files will use 
 exactly the same number of partitions as the join with my large files.
 I can either use a low number of partitions and run out of memory on my large 
 join, or use a high number of partitions and my small join will take far too 
 long.
 If we were able to specify the number of shuffle partitions in a DataFrame 
 join like in an RDD join, this would not be an issue.
 My long term ideal solution would be dynamic partition determination as 
 described in SPARK-4630.  However I appreciate that it is not particularly 
 easy to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org