sam created SPARK-24425:
---------------------------

             Summary: Regression from 1.6 to 2.x - Spark no longer respects 
input partitions, unnecessary shuffle required
                 Key: SPARK-24425
                 URL: https://issues.apache.org/jira/browse/SPARK-24425
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.0.2
            Reporter: sam


I think this is a regression. We used to be able to easily control the number 
of output files / tasks based on num files and coalesce. Now I have to use 
`repartition` to get the desired num files / partitions which is unnecessarily 
expensive.

I've tried playing with spark.sql.files.maxPartitionBytes and 
spark.sql.files.openCostInBytes to see if I can force the conventional 
behaviour.
{code:java}
val ss = SparkSession.builder().appName("uber-cp").master(conf.master())
             .config("spark.sql.files.maxPartitionBytes", 1)
             .config("spark.sql.files.openCostInBytes", Long.MaxValue)
{code}
This didn't work. Spark just squashes all my parquet files into less partitions.

Suggest a simple `option` on DataFrameReader that can disable this (or enable 
it, default behaviour should be same as 1.6).

 

This relates to https://issues.apache.org/jira/browse/SPARK-5997, in that if 
SPARK-5997 was implemented this ticket wouldn't really be necessary.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to