[ 
https://issues.apache.org/jira/browse/HIVE-17004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bing Li reassigned HIVE-17004:
------------------------------

    Assignee: Bing Li

> Calculating Number Of Reducers Looks At All Files
> -------------------------------------------------
>
>                 Key: HIVE-17004
>                 URL: https://issues.apache.org/jira/browse/HIVE-17004
>             Project: Hive
>          Issue Type: Improvement
>          Components: Hive
>    Affects Versions: 2.1.1
>            Reporter: BELUGA BEHR
>            Assignee: Bing Li
>
> When calculating the number of Mappers and Reducers, the two algorithms are 
> looking at different data sets.  The number of Mappers are calculated based 
> on the number of splits and the number of Reducers are based on the number of 
> files within the HDFS directory.  What you see is that if I add files to a 
> sub-directory of the HDFS directory, the number of splits remains the same 
> since I did not tell Hive to search recursively, and the number of Reducers 
> increases.  Please improve this so that Reducers are looking at the same 
> files that are considered for splits and not at files within sub-directories 
> (unless configured to do so).
> {code}
> CREATE EXTERNAL TABLE Complaints (
>   a string,
>   b string,
>   c string,
>   d string,
>   e string,
>   f string,
>   g string
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> LOCATION '/user/admin/complaints';
> {code}
> {code}
> [root@host ~]# sudo -u hdfs hdfs dfs -ls -R /user/admin/complaints
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 
> /user/admin/complaints/Consumer_Complaints.1.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 
> /user/admin/complaints/Consumer_Complaints.2.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 
> /user/admin/complaints/Consumer_Complaints.3.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 
> /user/admin/complaints/Consumer_Complaints.4.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 
> /user/admin/complaints/Consumer_Complaints.5.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 
> /user/admin/complaints/Consumer_Complaints.csv
> {code}
> {code}
> INFO  : Compiling 
> command(queryId=hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae): 
> select a, count(1) from complaints group by a limit 10
> INFO  : Semantic Analysis Completed
> INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:a, 
> type:string, comment:null), FieldSchema(name:_c1, type:bigint, 
> comment:null)], properties:null)
> INFO  : Completed compiling 
> command(queryId=hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae); 
> Time taken: 0.077 seconds
> INFO  : Executing 
> command(queryId=hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae): 
> select a, count(1) from complaints group by a limit 10
> INFO  : Query ID = hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae
> INFO  : Total jobs = 1
> INFO  : Launching Job 1 out of 1
> INFO  : Starting task [Stage-1:MAPRED] in serial mode
> INFO  : Number of reduce tasks not specified. Estimated from input data size: 
> 11
> INFO  : In order to change the average load for a reducer (in bytes):
> INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
> INFO  : In order to limit the maximum number of reducers:
> INFO  :   set hive.exec.reducers.max=<number>
> INFO  : In order to set a constant number of reducers:
> INFO  :   set mapreduce.job.reduces=<number>
> INFO  : number of splits:2
> INFO  : Submitting tokens for job: job_1493729203063_0003
> INFO  : The url to track the job: 
> http://host:8088/proxy/application_1493729203063_0003/
> INFO  : Starting Job = job_1493729203063_0003, Tracking URL = 
> http://host:8088/proxy/application_1493729203063_0003/
> INFO  : Kill Command = 
> /opt/cloudera/parcels/CDH-5.8.4-1.cdh5.8.4.p0.5/lib/hadoop/bin/hadoop job  
> -kill job_1493729203063_0003
> INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of 
> reducers: 11
> INFO  : 2017-05-02 14:20:14,206 Stage-1 map = 0%,  reduce = 0%
> INFO  : 2017-05-02 14:20:22,520 Stage-1 map = 100%,  reduce = 0%, Cumulative 
> CPU 4.48 sec
> INFO  : 2017-05-02 14:20:34,029 Stage-1 map = 100%,  reduce = 27%, Cumulative 
> CPU 15.72 sec
> INFO  : 2017-05-02 14:20:35,069 Stage-1 map = 100%,  reduce = 55%, Cumulative 
> CPU 21.94 sec
> INFO  : 2017-05-02 14:20:36,110 Stage-1 map = 100%,  reduce = 64%, Cumulative 
> CPU 23.97 sec
> INFO  : 2017-05-02 14:20:39,233 Stage-1 map = 100%,  reduce = 73%, Cumulative 
> CPU 25.26 sec
> INFO  : 2017-05-02 14:20:43,392 Stage-1 map = 100%,  reduce = 100%, 
> Cumulative CPU 30.9 sec
> INFO  : MapReduce Total cumulative CPU time: 30 seconds 900 msec
> INFO  : Ended Job = job_1493729203063_0003
> INFO  : MapReduce Jobs Launched: 
> INFO  : Stage-Stage-1: Map: 2  Reduce: 11   Cumulative CPU: 30.9 sec   HDFS 
> Read: 735691149 HDFS Write: 153 SUCCESS
> INFO  : Total MapReduce CPU Time Spent: 30 seconds 900 msec
> INFO  : Completed executing 
> command(queryId=hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae); 
> Time taken: 36.035 seconds
> INFO  : OK
> {code}
> {code}
> [root@host ~]# sudo -u hdfs hdfs dfs -ls -R /user/admin/complaints
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 
> /user/admin/complaints/Consumer_Complaints.1.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 
> /user/admin/complaints/Consumer_Complaints.2.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 
> /user/admin/complaints/Consumer_Complaints.3.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 
> /user/admin/complaints/Consumer_Complaints.4.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 
> /user/admin/complaints/Consumer_Complaints.5.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 
> /user/admin/complaints/Consumer_Complaints.csv
> drwxr-xr-x   - admin admin          0 2017-05-02 14:16 
> /user/admin/complaints/t
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:16 
> /user/admin/complaints/t/Consumer_Complaints.1.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:16 
> /user/admin/complaints/t/Consumer_Complaints.2.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:16 
> /user/admin/complaints/t/Consumer_Complaints.3.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:16 
> /user/admin/complaints/t/Consumer_Complaints.4.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:16 
> /user/admin/complaints/t/Consumer_Complaints.5.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:16 
> /user/admin/complaints/t/Consumer_Complaints.csv
> {code}
> {code}
> INFO  : Compiling 
> command(queryId=hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e): 
> select a, count(1) from complaints group by a limit 10
> INFO  : Semantic Analysis Completed
> INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:a, 
> type:string, comment:null), FieldSchema(name:_c1, type:bigint, 
> comment:null)], properties:null)
> INFO  : Completed compiling 
> command(queryId=hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e); 
> Time taken: 0.073 seconds
> INFO  : Executing 
> command(queryId=hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e): 
> select a, count(1) from complaints group by a limit 10
> INFO  : Query ID = hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e
> INFO  : Total jobs = 1
> INFO  : Launching Job 1 out of 1
> INFO  : Starting task [Stage-1:MAPRED] in serial mode
> INFO  : Number of reduce tasks not specified. Estimated from input data size: 
> 22
> INFO  : In order to change the average load for a reducer (in bytes):
> INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
> INFO  : In order to limit the maximum number of reducers:
> INFO  :   set hive.exec.reducers.max=<number>
> INFO  : In order to set a constant number of reducers:
> INFO  :   set mapreduce.job.reduces=<number>
> INFO  : number of splits:2
> INFO  : Submitting tokens for job: job_1493729203063_0004
> INFO  : The url to track the job: 
> http://host:8088/proxy/application_1493729203063_0004/
> INFO  : Starting Job = job_1493729203063_0004, Tracking URL = 
> http://host:8088/proxy/application_1493729203063_0004/
> INFO  : Kill Command = 
> /opt/cloudera/parcels/CDH-5.8.4-1.cdh5.8.4.p0.5/lib/hadoop/bin/hadoop job  
> -kill job_1493729203063_0004
> INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of 
> reducers: 22
> INFO  : 2017-05-02 14:29:27,464 Stage-1 map = 0%,  reduce = 0%
> INFO  : 2017-05-02 14:29:36,829 Stage-1 map = 100%,  reduce = 0%, Cumulative 
> CPU 10.2 sec
> INFO  : 2017-05-02 14:29:47,287 Stage-1 map = 100%,  reduce = 14%, Cumulative 
> CPU 15.36 sec
> INFO  : 2017-05-02 14:29:49,381 Stage-1 map = 100%,  reduce = 27%, Cumulative 
> CPU 20.76 sec
> INFO  : 2017-05-02 14:29:50,433 Stage-1 map = 100%,  reduce = 32%, Cumulative 
> CPU 22.69 sec
> INFO  : 2017-05-02 14:29:56,743 Stage-1 map = 100%,  reduce = 45%, Cumulative 
> CPU 27.73 sec
> INFO  : 2017-05-02 14:30:00,916 Stage-1 map = 100%,  reduce = 64%, Cumulative 
> CPU 34.95 sec
> INFO  : 2017-05-02 14:30:06,142 Stage-1 map = 100%,  reduce = 77%, Cumulative 
> CPU 41.49 sec
> INFO  : 2017-05-02 14:30:10,297 Stage-1 map = 100%,  reduce = 82%, Cumulative 
> CPU 42.92 sec
> INFO  : 2017-05-02 14:30:11,334 Stage-1 map = 100%,  reduce = 86%, Cumulative 
> CPU 45.24 sec
> INFO  : 2017-05-02 14:30:12,365 Stage-1 map = 100%,  reduce = 100%, 
> Cumulative CPU 50.33 sec
> INFO  : MapReduce Total cumulative CPU time: 50 seconds 330 msec
> INFO  : Ended Job = job_1493729203063_0004
> INFO  : MapReduce Jobs Launched: 
> INFO  : Stage-Stage-1: Map: 2  Reduce: 22   Cumulative CPU: 50.33 sec   HDFS 
> Read: 735731640 HDFS Write: 153 SUCCESS
> INFO  : Total MapReduce CPU Time Spent: 50 seconds 330 msec
> INFO  : Completed executing 
> command(queryId=hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e); 
> Time taken: 51.841 seconds
> INFO  : OK
> {code}
> https://github.com/apache/hive/blob/bc510f63de9d6baab3a5ad8a4bf4eed9c6fde8b1/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L2959
> Number of splits (Mappers) stay the same between the two runs, number of 
> Reducers increases.
> *INFO  : number of splits:2*
> # Number of reduce tasks not specified. Estimated from input data size: 11
> # Number of reduce tasks not specified. Estimated from input data size: 22



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to