philipse created SPARK-31588:
--------------------------------

             Summary: merge small files may need more common setting
                 Key: SPARK-31588
                 URL: https://issues.apache.org/jira/browse/SPARK-31588
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.4.5
         Environment: spark:2.4.5

hdp:2.7
            Reporter: philipse


Hi ,

SparkSql now allow us to use  repartition or coalesce to manually control the 
small files like the following

/*+ REPARTITION(1) */

/*+ COALESCE(1) */

But it can only be  tuning case by case ,we need to decide whether we need to 
use COALESCE or REPARTITION,can we try a more common way to reduce the decision 
by set the target size  as hive did

*Good points:*

1)we will also the new partitions number

2)with an ON-OFF parameter  provided , user can close it if needed

3)the parmeter can be set at cluster level instand of user side,it will be more 
easier to controll samll files.

4)greatly reduce the pressue of namenode

 

*Not good points:*

1)It will add a new task to calculate the target numbers by stastics the out 
files.

 

I don't know whether we have planned this in future.

 

Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to