[ https://issues.apache.org/jira/browse/SPARK-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiao Li resolved SPARK-22665. ----------------------------- Resolution: Fixed Fix Version/s: 2.3.0 > Dataset API: .repartition() inconsistency / issue > ------------------------------------------------- > > Key: SPARK-22665 > URL: https://issues.apache.org/jira/browse/SPARK-22665 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.2.0 > Reporter: Adrian Ionescu > Assignee: Marco Gaido > Fix For: 2.3.0 > > > We currently have two functions for explicitly repartitioning a Dataset: > {code} > def repartition(numPartitions: Int) > {code} > and > {code} > def repartition(numPartitions: Int, partitionExprs: Column*) > {code} > The second function's signature allows it to be called with an empty list of > expressions as well. > However: > * {{df.repartition(numPartitions)}} does RoundRobin partitioning > * {{df.repartition(numPartitions, Seq.empty: _*)}} does HashPartitioning on a > constant, effectively moving all tuples to a single partition > Not only is this inconsistent, but the latter behavior is very undesirable: > it may hide problems in small-scale prototype code, but will inevitably fail > (or have terrible performance) in production. > I suggest we should make it: > - either throw an {{IllegalArgumentException}} > - or do RoundRobin partitioning, just like {{df.repartition(numPartitions)}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org