[
https://issues.apache.org/jira/browse/SPARK-9872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Al M updated SPARK-9872:
Description:
When I join two normal RDDs, I can set the number of shuffle partitions in the
'numPartitions' argument. When I join two DataFrames I do not have this option.
My spark job loads in 2 large files and 2 small files. When I perform a join,
this will use the spark.sql.shuffle.partitions to determine the number of
partitions. This means that the join with my small files will use exactly the
same number of partitions as the join with my large files.
I can either use a low number of partitions and run out of memory on my large
join, or use a high number of partitions and my small join will take far too
long.
If we were able to specify the number of shuffle partitions in a DataFrame join
like in an RDD join, this would not be an issue.
My long term ideal solution would be dynamic partition determination as
described in SPARK-4630. However I appreciate that it is not particularly easy
to do.
was:
When I join two normal RDDs, I can set the number of shuffle partitions in the
'numPartitions' argument. When I join two DataFrames I do not have this option.
My spark job loads in 2 large files and 2 small files. When I perform a join,
this will use the spark.sql.shuffle.partitions to determine the number of
partitions. This means that the join with my small files will use exactliy the
same number of partitions as the join with my large files.
I can either use a low number of partitions and run out of memory on my large
join, or use a high number of partitions and my small join will take far too
long.
If we were able to specify the number of shuffle partitions in a DataFrame join
like in an RDD join, this would not be an issue.
My long term ideal solution would be dynamic partition determination as
described in SPARK-4630. However I appreciate that it is not particularly easy
to do.
Allow passing of 'numPartitions' to DataFrame joins
---
Key: SPARK-9872
URL: https://issues.apache.org/jira/browse/SPARK-9872
Project: Spark
Issue Type: Improvement
Affects Versions: 1.4.1
Reporter: Al M
Priority: Minor
When I join two normal RDDs, I can set the number of shuffle partitions in
the 'numPartitions' argument. When I join two DataFrames I do not have this
option.
My spark job loads in 2 large files and 2 small files. When I perform a
join, this will use the spark.sql.shuffle.partitions to determine the
number of partitions. This means that the join with my small files will use
exactly the same number of partitions as the join with my large files.
I can either use a low number of partitions and run out of memory on my large
join, or use a high number of partitions and my small join will take far too
long.
If we were able to specify the number of shuffle partitions in a DataFrame
join like in an RDD join, this would not be an issue.
My long term ideal solution would be dynamic partition determination as
described in SPARK-4630. However I appreciate that it is not particularly
easy to do.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org