[jira] [Updated] (MAPREDUCE-6003) Resource Estimator suggests huge map output in some cases

2015-05-05 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated MAPREDUCE-6003:

Labels: BB2015-05-TBR  (was: )

 Resource Estimator suggests huge map output in some cases
 -

 Key: MAPREDUCE-6003
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6003
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobtracker
Affects Versions: 1.2.1
Reporter: Chengbing Liu
Assignee: Chengbing Liu
  Labels: BB2015-05-TBR
 Attachments: MAPREDUCE-6003-branch-1.2.patch


 In some cases, ResourceEstimator can return way too large map output 
 estimation. This happens when input size is not correctly calculated.
 A typical case is when joining two Hive tables (one in HDFS and the other in 
 HBase). The maps that process the HBase table finish first, which has a 0 
 length of inputs due to its TableInputFormat. Then for a map that processes 
 HDFS table, the estimated output size is very large because of the wrong 
 input size, causing the map task not possible to be assigned.
 There are two possible solutions to this problem:
 (1) Make input size correct for each case, e.g. HBase, etc.
 (2) Use another algorithm to estimate the map output, or at least make it 
 closer to reality.
 I prefer the second way, since the first would require all possibilities to 
 be taken care of. It is not easy for some inputs such as URIs.
 In my opinion, we could make a second estimation which is independent of the 
 input size:
 estimationB = (completedMapOutputSize / completedMaps) * totalMaps * 10
 Here, multiplying by 10 makes the estimation more conservative, so that it 
 will be less likely to assign it to some where not big enough.
 The former estimation goes like this:
 estimationA = (inputSize * completedMapOutputSize * 2.0) / 
 completedMapInputSize
 My suggestion is to take minimum of the two estimations:
 estimation = min(estimationA, estimationB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6003) Resource Estimator suggests huge map output in some cases

2014-07-25 Thread Chengbing Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengbing Liu updated MAPREDUCE-6003:
-

Assignee: Chengbing Liu
  Status: Patch Available  (was: Open)

 Resource Estimator suggests huge map output in some cases
 -

 Key: MAPREDUCE-6003
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6003
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobtracker
Affects Versions: 1.2.1
Reporter: Chengbing Liu
Assignee: Chengbing Liu

 In some cases, ResourceEstimator can return way too large map output 
 estimation. This happens when input size is not correctly calculated.
 A typical case is when joining two Hive tables (one in HDFS and the other in 
 HBase). The maps that process the HBase table finish first, which has a 0 
 length of inputs due to its TableInputFormat. Then for a map that processes 
 HDFS table, the estimated output size is very large because of the wrong 
 input size, causing the map task not possible to be assigned.
 There are two possible solutions to this problem:
 (1) Make input size correct for each case, e.g. HBase, etc.
 (2) Use another algorithm to estimate the map output, or at least make it 
 closer to reality.
 I prefer the second way, since the first would require all possibilities to 
 be taken care of. It is not easy for some inputs such as URIs.
 In my opinion, we could make a second estimation which is independent of the 
 input size:
 estimationB = (completedMapOutputSize / completedMaps) * totalMaps * 10
 Here, multiplying by 10 makes the estimation more conservative, so that it 
 will be less likely to assign it to some where not big enough.
 The former estimation goes like this:
 estimationA = (inputSize * completedMapOutputSize * 2.0) / 
 completedMapInputSize
 My suggestion is to take minimum of the two estimations:
 estimation = min(estimationA, estimationB)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-6003) Resource Estimator suggests huge map output in some cases

2014-07-25 Thread Chengbing Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengbing Liu updated MAPREDUCE-6003:
-

Attachment: MAPREDUCE-6003.patch

 Resource Estimator suggests huge map output in some cases
 -

 Key: MAPREDUCE-6003
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6003
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobtracker
Affects Versions: 1.2.1
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Attachments: MAPREDUCE-6003.patch


 In some cases, ResourceEstimator can return way too large map output 
 estimation. This happens when input size is not correctly calculated.
 A typical case is when joining two Hive tables (one in HDFS and the other in 
 HBase). The maps that process the HBase table finish first, which has a 0 
 length of inputs due to its TableInputFormat. Then for a map that processes 
 HDFS table, the estimated output size is very large because of the wrong 
 input size, causing the map task not possible to be assigned.
 There are two possible solutions to this problem:
 (1) Make input size correct for each case, e.g. HBase, etc.
 (2) Use another algorithm to estimate the map output, or at least make it 
 closer to reality.
 I prefer the second way, since the first would require all possibilities to 
 be taken care of. It is not easy for some inputs such as URIs.
 In my opinion, we could make a second estimation which is independent of the 
 input size:
 estimationB = (completedMapOutputSize / completedMaps) * totalMaps * 10
 Here, multiplying by 10 makes the estimation more conservative, so that it 
 will be less likely to assign it to some where not big enough.
 The former estimation goes like this:
 estimationA = (inputSize * completedMapOutputSize * 2.0) / 
 completedMapInputSize
 My suggestion is to take minimum of the two estimations:
 estimation = min(estimationA, estimationB)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-6003) Resource Estimator suggests huge map output in some cases

2014-07-25 Thread Chengbing Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengbing Liu updated MAPREDUCE-6003:
-

Attachment: (was: MAPREDUCE-6003.patch)

 Resource Estimator suggests huge map output in some cases
 -

 Key: MAPREDUCE-6003
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6003
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobtracker
Affects Versions: 1.2.1
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Attachments: MAPREDUCE-6003-branch-1.2.patch


 In some cases, ResourceEstimator can return way too large map output 
 estimation. This happens when input size is not correctly calculated.
 A typical case is when joining two Hive tables (one in HDFS and the other in 
 HBase). The maps that process the HBase table finish first, which has a 0 
 length of inputs due to its TableInputFormat. Then for a map that processes 
 HDFS table, the estimated output size is very large because of the wrong 
 input size, causing the map task not possible to be assigned.
 There are two possible solutions to this problem:
 (1) Make input size correct for each case, e.g. HBase, etc.
 (2) Use another algorithm to estimate the map output, or at least make it 
 closer to reality.
 I prefer the second way, since the first would require all possibilities to 
 be taken care of. It is not easy for some inputs such as URIs.
 In my opinion, we could make a second estimation which is independent of the 
 input size:
 estimationB = (completedMapOutputSize / completedMaps) * totalMaps * 10
 Here, multiplying by 10 makes the estimation more conservative, so that it 
 will be less likely to assign it to some where not big enough.
 The former estimation goes like this:
 estimationA = (inputSize * completedMapOutputSize * 2.0) / 
 completedMapInputSize
 My suggestion is to take minimum of the two estimations:
 estimation = min(estimationA, estimationB)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-6003) Resource Estimator suggests huge map output in some cases

2014-07-25 Thread Chengbing Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengbing Liu updated MAPREDUCE-6003:
-

Attachment: MAPREDUCE-6003-branch-1.2.patch

 Resource Estimator suggests huge map output in some cases
 -

 Key: MAPREDUCE-6003
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6003
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobtracker
Affects Versions: 1.2.1
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Attachments: MAPREDUCE-6003-branch-1.2.patch


 In some cases, ResourceEstimator can return way too large map output 
 estimation. This happens when input size is not correctly calculated.
 A typical case is when joining two Hive tables (one in HDFS and the other in 
 HBase). The maps that process the HBase table finish first, which has a 0 
 length of inputs due to its TableInputFormat. Then for a map that processes 
 HDFS table, the estimated output size is very large because of the wrong 
 input size, causing the map task not possible to be assigned.
 There are two possible solutions to this problem:
 (1) Make input size correct for each case, e.g. HBase, etc.
 (2) Use another algorithm to estimate the map output, or at least make it 
 closer to reality.
 I prefer the second way, since the first would require all possibilities to 
 be taken care of. It is not easy for some inputs such as URIs.
 In my opinion, we could make a second estimation which is independent of the 
 input size:
 estimationB = (completedMapOutputSize / completedMaps) * totalMaps * 10
 Here, multiplying by 10 makes the estimation more conservative, so that it 
 will be less likely to assign it to some where not big enough.
 The former estimation goes like this:
 estimationA = (inputSize * completedMapOutputSize * 2.0) / 
 completedMapInputSize
 My suggestion is to take minimum of the two estimations:
 estimation = min(estimationA, estimationB)



--
This message was sent by Atlassian JIRA
(v6.2#6252)