date:20140717


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064671#comment-14064671
 ] 

Hadoop QA commented on MAPREDUCE-5957:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12656184/MAPREDUCE-5957.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4746//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4746//console

This message is automatically generated.

 AM throws ClassNotFoundException with job classloader enabled if custom 
 output format/committer is used
 ---

 Key: MAPREDUCE-5957
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5957
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: MAPREDUCE-5957.patch, MAPREDUCE-5957.patch, 
 MAPREDUCE-5957.patch, MAPREDUCE-5957.patch, MAPREDUCE-5957.patch


 With the job classloader enabled, the MR AM throws ClassNotFoundException if 
 a custom output format class is specified.
 {noformat}
 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
 java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
 com.foo.test.TestOutputFormat not found
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:473)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:374)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1459)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1456)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1389)
 Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: 
 Class com.foo.test.TestOutputFormat not found
   at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1895)
   at 
 org.apache.hadoop.mapreduce.task.JobContextImpl.getOutputFormatClass(JobContextImpl.java:222)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:469)
   ... 8 more
 Caused by: java.lang.ClassNotFoundException: Class 
 com.foo.test.TestOutputFormat not found
   at 
 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1801)
   at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1893)
   ... 10 more
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5910) MRAppMaster should handle Resync from RM instead of shutting down.

2014-07-17 Thread Rohith (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064692#comment-14064692
 ] 

Rohith commented on MAPREDUCE-5910:
---

The test failure is same as MAPREDUCE-5973

 MRAppMaster should handle Resync from RM instead of shutting down.
 --

 Key: MAPREDUCE-5910
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5910
 Project: Hadoop Map/Reduce
  Issue Type: Task
  Components: applicationmaster
Reporter: Rohith
Assignee: Rohith
 Attachments: MAPREDUCE-5910.1.patch, MAPREDUCE-5910.2.patch, 
 MAPREDUCE-5910.3.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The MRAppMaster behavior is expected to change 
 to calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5971) Move the default options for distcp -p to DistCpOptionSwitch


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064793#comment-14064793
 ] 

Hudson commented on MAPREDUCE-5971:
---

FAILURE: Integrated in Hadoop-Yarn-trunk #615 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/615/])
MAPREDUCE-5971. Move the default options for distcp -p to DistCpOptionSwitch. 
Contributed by Charles Lamb. (wang: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611217)
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptionSwitch.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java


 Move the default options for distcp -p to DistCpOptionSwitch
 

 Key: MAPREDUCE-5971
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5971
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp
Affects Versions: trunk
Reporter: Charles Lamb
Assignee: Charles Lamb
Priority: Trivial
 Fix For: 2.6.0

 Attachments: MAPREDUCE-5971.001.patch, MAPREDUCE-5971.002.patch


 The default preserve flags for distcp -p are embedded in the OptionsParser 
 code. Refactor to co-locate them with the actual flag initialization.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5952) LocalContainerLauncher#renameMapOutputForReduce incorrectly assumes a single dir for mapOutIndex


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064792#comment-14064792
 ] 

Hudson commented on MAPREDUCE-5952:
---

FAILURE: Integrated in Hadoop-Yarn-trunk #615 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/615/])
MAPREDUCE-5952. LocalContainerLauncher#renameMapOutputForReduce incorrectly 
assumes a single dir for mapOutIndex. (Gera Shegalov via kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611196)
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/LocalContainerLauncher.java
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapred/TestLocalContainerLauncher.java


 LocalContainerLauncher#renameMapOutputForReduce incorrectly assumes a single 
 dir for mapOutIndex
 

 Key: MAPREDUCE-5952
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5952
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am, mrv2
Affects Versions: 2.3.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
Priority: Blocker
 Fix For: 2.5.0

 Attachments: MAPREDUCE-5952.v01.patch, MAPREDUCE-5952.v02.patch, 
 MAPREDUCE-5952.v03.patch, MAPREDUCE-5952.v04.patch


 The javadoc comment for {{renameMapOutputForReduce}} incorrectly refers to a 
 single map output directory, whereas this depends on LOCAL_DIRS.
 mapOutIndex should be set to subMapOutputFile.getOutputIndexFile()
 {code}
 2014-06-30 14:48:35,574 WARN [uber-SubtaskRunner] 
 org.apache.hadoop.mapred.LocalContainerLauncher: Exception running local 
 (uberized) 'child' : java.io.FileNotFoundException: File 
 /Users/gshegalov/workspace/hadoop-common/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/org.apache.hadoop.mapreduce.v2.TestMRJobs/org.apache.hadoop.mapreduce.v2.
   
 TestMRJobs-localDir-nm-2_3/usercache/gshegalov/appcache/application_1404164272885_0001/output/file.out.index
  does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:517)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:726)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:507)
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
   
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
   
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:334)   
  
   at 
 org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:504)
   at 
 org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.renameMapOutputForReduce(LocalContainerLauncher.java:471)
   at 
 org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runSubtask(LocalContainerLauncher.java:370)
   at 
 org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runTask(LocalContainerLauncher.java:292)
   at 
 org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.access$200(LocalContainerLauncher.java:178)
   at 
 org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler$1.run(LocalContainerLauncher.java:221)
   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)  
   
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)   
   
   at java.util.concurrent.FutureTask.run(FutureTask.java:138) 
   
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
   at java.lang.Thread.run(Thread.java:695) 
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAPREDUCE-2841) Task level native optimization


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated MAPREDUCE-2841:
--

Status: Open  (was: Patch Available)

 Task level native optimization
 --

 Key: MAPREDUCE-2841
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
 Environment: x86-64 Linux/Unix
Reporter: Binglin Chang
Assignee: Sean Zhong
 Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
 MAPREDUCE-2841.v2.patch, dualpivot-0.patch, dualpivotv20-0.patch, 
 fb-shuffle.patch


 I'm recently working on native optimization for MapTask based on JNI. 
 The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
 emitted by mapper, therefore sort, spill, IFile serialization can all be done 
 in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
 results:
 1. Sort is about 3x-10x as fast as java(only binary string compare is 
 supported)
 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
 CRC32C is used, things can get much faster(1G/
 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
 prevent mid-spill
 This leads to a total speed up of 2x~3x for the whole MapTask, if 
 IdentityMapper(mapper does nothing) is used
 There are limitations of course, currently only Text and BytesWritable is 
 supported, and I have not think through many things right now, such as how to 
 support map side combine. I had some discussion with somebody familiar with 
 hive, it seems that these limitations won't be much problem for Hive to 
 benefit from those optimizations, at least. Advices or discussions about 
 improving compatibility are most welcome:) 
 Currently NativeMapOutputCollector has a static method called canEnable(), 
 which checks if key/value type, comparator type, combiner are all compatible, 
 then MapTask can choose to enable NativeMapOutputCollector.
 This is only a preliminary test, more work need to be done. I expect better 
 final results, and I believe similar optimization can be adopt to reduce task 
 and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-2841) Task level native optimization


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064878#comment-14064878
 ] 

Sean Zhong commented on MAPREDUCE-2841:
---

Hi Todd,

The patch is uploaded to: 
https://raw.githubusercontent.com/intel-hadoop/nativetask/native_output_collector/patch/hadoop-3.0-mapreduce-2841-2014-7-17.patch
 
(It is too big to be uploaded to here)

It is patched against hadoop3.0 trunk. 




 Task level native optimization
 --

 Key: MAPREDUCE-2841
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
 Environment: x86-64 Linux/Unix
Reporter: Binglin Chang
Assignee: Sean Zhong
 Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
 MAPREDUCE-2841.v2.patch, dualpivot-0.patch, dualpivotv20-0.patch, 
 fb-shuffle.patch


 I'm recently working on native optimization for MapTask based on JNI. 
 The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
 emitted by mapper, therefore sort, spill, IFile serialization can all be done 
 in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
 results:
 1. Sort is about 3x-10x as fast as java(only binary string compare is 
 supported)
 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
 CRC32C is used, things can get much faster(1G/
 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
 prevent mid-spill
 This leads to a total speed up of 2x~3x for the whole MapTask, if 
 IdentityMapper(mapper does nothing) is used
 There are limitations of course, currently only Text and BytesWritable is 
 supported, and I have not think through many things right now, such as how to 
 support map side combine. I had some discussion with somebody familiar with 
 hive, it seems that these limitations won't be much problem for Hive to 
 benefit from those optimizations, at least. Advices or discussions about 
 improving compatibility are most welcome:) 
 Currently NativeMapOutputCollector has a static method called canEnable(), 
 which checks if key/value type, comparator type, combiner are all compatible, 
 then MapTask can choose to enable NativeMapOutputCollector.
 This is only a preliminary test, more work need to be done. I expect better 
 final results, and I believe similar optimization can be adopt to reduce task 
 and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5952) LocalContainerLauncher#renameMapOutputForReduce incorrectly assumes a single dir for mapOutIndex


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064900#comment-14064900
 ] 

Hudson commented on MAPREDUCE-5952:
---

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1834 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1834/])
MAPREDUCE-5952. LocalContainerLauncher#renameMapOutputForReduce incorrectly 
assumes a single dir for mapOutIndex. (Gera Shegalov via kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611196)
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/LocalContainerLauncher.java
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapred/TestLocalContainerLauncher.java


 LocalContainerLauncher#renameMapOutputForReduce incorrectly assumes a single 
 dir for mapOutIndex
 

 Key: MAPREDUCE-5952
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5952
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am, mrv2
Affects Versions: 2.3.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
Priority: Blocker
 Fix For: 2.5.0

 Attachments: MAPREDUCE-5952.v01.patch, MAPREDUCE-5952.v02.patch, 
 MAPREDUCE-5952.v03.patch, MAPREDUCE-5952.v04.patch


 The javadoc comment for {{renameMapOutputForReduce}} incorrectly refers to a 
 single map output directory, whereas this depends on LOCAL_DIRS.
 mapOutIndex should be set to subMapOutputFile.getOutputIndexFile()
 {code}
 2014-06-30 14:48:35,574 WARN [uber-SubtaskRunner] 
 org.apache.hadoop.mapred.LocalContainerLauncher: Exception running local 
 (uberized) 'child' : java.io.FileNotFoundException: File 
 /Users/gshegalov/workspace/hadoop-common/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/org.apache.hadoop.mapreduce.v2.TestMRJobs/org.apache.hadoop.mapreduce.v2.
   
 TestMRJobs-localDir-nm-2_3/usercache/gshegalov/appcache/application_1404164272885_0001/output/file.out.index
  does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:517)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:726)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:507)
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
   
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
   
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:334)   
  
   at 
 org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:504)
   at 
 org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.renameMapOutputForReduce(LocalContainerLauncher.java:471)
   at 
 org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runSubtask(LocalContainerLauncher.java:370)
   at 
 org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runTask(LocalContainerLauncher.java:292)
   at 
 org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.access$200(LocalContainerLauncher.java:178)
   at 
 org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler$1.run(LocalContainerLauncher.java:221)
   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)  
   
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)   
   
   at java.util.concurrent.FutureTask.run(FutureTask.java:138) 
   
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
   at java.lang.Thread.run(Thread.java:695) 
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5971) Move the default options for distcp -p to DistCpOptionSwitch


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064922#comment-14064922
 ] 

Hudson commented on MAPREDUCE-5971:
---

FAILURE: Integrated in Hadoop-Hdfs-trunk #1807 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1807/])
MAPREDUCE-5971. Move the default options for distcp -p to DistCpOptionSwitch. 
Contributed by Charles Lamb. (wang: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611217)
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptionSwitch.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java


 Move the default options for distcp -p to DistCpOptionSwitch
 

 Key: MAPREDUCE-5971
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5971
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp
Affects Versions: trunk
Reporter: Charles Lamb
Assignee: Charles Lamb
Priority: Trivial
 Fix For: 2.6.0

 Attachments: MAPREDUCE-5971.001.patch, MAPREDUCE-5971.002.patch


 The default preserve flags for distcp -p are embedded in the OptionsParser 
 code. Refactor to co-locate them with the actual flag initialization.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5952) LocalContainerLauncher#renameMapOutputForReduce incorrectly assumes a single dir for mapOutIndex


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064921#comment-14064921
 ] 

Hudson commented on MAPREDUCE-5952:
---

FAILURE: Integrated in Hadoop-Hdfs-trunk #1807 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1807/])
MAPREDUCE-5952. LocalContainerLauncher#renameMapOutputForReduce incorrectly 
assumes a single dir for mapOutIndex. (Gera Shegalov via kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611196)
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/LocalContainerLauncher.java
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapred/TestLocalContainerLauncher.java


 LocalContainerLauncher#renameMapOutputForReduce incorrectly assumes a single 
 dir for mapOutIndex
 

 Key: MAPREDUCE-5952
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5952
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am, mrv2
Affects Versions: 2.3.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
Priority: Blocker
 Fix For: 2.5.0

 Attachments: MAPREDUCE-5952.v01.patch, MAPREDUCE-5952.v02.patch, 
 MAPREDUCE-5952.v03.patch, MAPREDUCE-5952.v04.patch


 The javadoc comment for {{renameMapOutputForReduce}} incorrectly refers to a 
 single map output directory, whereas this depends on LOCAL_DIRS.
 mapOutIndex should be set to subMapOutputFile.getOutputIndexFile()
 {code}
 2014-06-30 14:48:35,574 WARN [uber-SubtaskRunner] 
 org.apache.hadoop.mapred.LocalContainerLauncher: Exception running local 
 (uberized) 'child' : java.io.FileNotFoundException: File 
 /Users/gshegalov/workspace/hadoop-common/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/org.apache.hadoop.mapreduce.v2.TestMRJobs/org.apache.hadoop.mapreduce.v2.
   
 TestMRJobs-localDir-nm-2_3/usercache/gshegalov/appcache/application_1404164272885_0001/output/file.out.index
  does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:517)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:726)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:507)
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
   
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
   
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:334)   
  
   at 
 org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:504)
   at 
 org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.renameMapOutputForReduce(LocalContainerLauncher.java:471)
   at 
 org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runSubtask(LocalContainerLauncher.java:370)
   at 
 org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runTask(LocalContainerLauncher.java:292)
   at 
 org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.access$200(LocalContainerLauncher.java:178)
   at 
 org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler$1.run(LocalContainerLauncher.java:221)
   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)  
   
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)   
   
   at java.util.concurrent.FutureTask.run(FutureTask.java:138) 
   
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
   at java.lang.Thread.run(Thread.java:695) 
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAPREDUCE-5957) AM throws ClassNotFoundException with job classloader enabled if custom output format/committer is used

2014-07-17 Thread Sangjin Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated MAPREDUCE-5957:
---

Status: Open  (was: Patch Available)

 AM throws ClassNotFoundException with job classloader enabled if custom 
 output format/committer is used
 ---

 Key: MAPREDUCE-5957
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5957
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: MAPREDUCE-5957.patch, MAPREDUCE-5957.patch, 
 MAPREDUCE-5957.patch, MAPREDUCE-5957.patch, MAPREDUCE-5957.patch, 
 MAPREDUCE-5957.patch


 With the job classloader enabled, the MR AM throws ClassNotFoundException if 
 a custom output format class is specified.
 {noformat}
 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
 java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
 com.foo.test.TestOutputFormat not found
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:473)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:374)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1459)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1456)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1389)
 Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: 
 Class com.foo.test.TestOutputFormat not found
   at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1895)
   at 
 org.apache.hadoop.mapreduce.task.JobContextImpl.getOutputFormatClass(JobContextImpl.java:222)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:469)
   ... 8 more
 Caused by: java.lang.ClassNotFoundException: Class 
 com.foo.test.TestOutputFormat not found
   at 
 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1801)
   at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1893)
   ... 10 more
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAPREDUCE-5957) AM throws ClassNotFoundException with job classloader enabled if custom output format/committer is used

2014-07-17 Thread Sangjin Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated MAPREDUCE-5957:
---

Attachment: MAPREDUCE-5957.patch

Fixed the javadoc.

 AM throws ClassNotFoundException with job classloader enabled if custom 
 output format/committer is used
 ---

 Key: MAPREDUCE-5957
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5957
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: MAPREDUCE-5957.patch, MAPREDUCE-5957.patch, 
 MAPREDUCE-5957.patch, MAPREDUCE-5957.patch, MAPREDUCE-5957.patch, 
 MAPREDUCE-5957.patch


 With the job classloader enabled, the MR AM throws ClassNotFoundException if 
 a custom output format class is specified.
 {noformat}
 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
 java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
 com.foo.test.TestOutputFormat not found
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:473)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:374)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1459)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1456)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1389)
 Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: 
 Class com.foo.test.TestOutputFormat not found
   at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1895)
   at 
 org.apache.hadoop.mapreduce.task.JobContextImpl.getOutputFormatClass(JobContextImpl.java:222)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:469)
   ... 8 more
 Caused by: java.lang.ClassNotFoundException: Class 
 com.foo.test.TestOutputFormat not found
   at 
 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1801)
   at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1893)
   ... 10 more
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAPREDUCE-5957) AM throws ClassNotFoundException with job classloader enabled if custom output format/committer is used

2014-07-17 Thread Sangjin Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated MAPREDUCE-5957:
---

Status: Patch Available  (was: Open)

 AM throws ClassNotFoundException with job classloader enabled if custom 
 output format/committer is used
 ---

 Key: MAPREDUCE-5957
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5957
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: MAPREDUCE-5957.patch, MAPREDUCE-5957.patch, 
 MAPREDUCE-5957.patch, MAPREDUCE-5957.patch, MAPREDUCE-5957.patch, 
 MAPREDUCE-5957.patch


 With the job classloader enabled, the MR AM throws ClassNotFoundException if 
 a custom output format class is specified.
 {noformat}
 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
 java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
 com.foo.test.TestOutputFormat not found
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:473)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:374)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1459)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1456)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1389)
 Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: 
 Class com.foo.test.TestOutputFormat not found
   at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1895)
   at 
 org.apache.hadoop.mapreduce.task.JobContextImpl.getOutputFormatClass(JobContextImpl.java:222)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:469)
   ... 8 more
 Caused by: java.lang.ClassNotFoundException: Class 
 com.foo.test.TestOutputFormat not found
   at 
 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1801)
   at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1893)
   ... 10 more
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-4085) Kill task attempts longer than a configured queue max time


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-4085.
-

Resolution: Won't Fix

 Kill task attempts longer than a configured queue max time
 --

 Key: MAPREDUCE-4085
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4085
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: task
Reporter: Allen Wittenauer
 Attachments: MAPREDUCE-4085-branch-1.0.4.txt, 
 MAPREDUCE-4085-branch-1.0.txt


 For some environments, it is desirable to have certain queues have an SLA 
 with regards to task turnover.  (i.e., a slot will be free in X minutes and 
 scheduled to the appropriate job)  Queues should have a 'task time limit' 
 that would cause task attempts over this time to be killed. This leaves open 
 the possibility that if the task was on a bad node, it could still be 
 rescheduled up to max.task.attempt times.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-4058) adjustable task priority


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-4058.
-

Resolution: Won't Fix

Y! was able to get a different patch into 0.23 and 2.x that provides similar 
(if limited) capability.

 adjustable task priority
 

 Key: MAPREDUCE-4058
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4058
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: task-controller
Affects Versions: 1.0.0
Reporter: Allen Wittenauer
Assignee: Mark Wagner
 Attachments: MAPREDUCE-4058-branch-1.0.patch


 For those of us that completely destroy our CPUs, it is beneficial to be able 
 to run user tasks at a different priority than the tasktracker. This would 
 allow for TTs (and by extension, DNs) to get more CPU clock cycles so that 
 things like heartbeats don't disappear.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-2378) Reduce fails when running on 1 small file.


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-2378.
-

Resolution: Cannot Reproduce

Closing this as 'cannot reproduce' as log4j has since been upgraded.  A few 
times, actually.

 Reduce fails when running on 1 small file. 
 ---

 Key: MAPREDUCE-2378
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2378
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.21.0
 Environment: java version 1.6.0_07
 Diablo Java(TM) SE Runtime Environment (build 1.6.0_07-b02)
 Diablo Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode)
Reporter: Simon Dircks
  Labels: 1, failed, file, log4j, reduce, single, small, tiny
 Attachments: failed reduce task log.html


 If i run the wordcount example on 1 small (less than 2MB) file i get the 
 following error:
 log4j:ERROR Failed to flush writer,
 java.io.InterruptedIOException
 at java.io.FileOutputStream.writeBytes(Native Method)
 at java.io.FileOutputStream.write(FileOutputStream.java:260)
 at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
 at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
 at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:276)
 at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)
 at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)
 at org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:58)
 at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:316)
 at org.apache.log4j.WriterAppender.append(WriterAppender.java:160)
 at 
 org.apache.hadoop.mapred.TaskLogAppender.append(TaskLogAppender.java:58)
 at 
 org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
 at 
 org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
 at org.apache.log4j.Category.callAppenders(Category.java:206)
 at org.apache.log4j.Category.forcedLog(Category.java:391)
 at org.apache.log4j.Category.log(Category.java:856)
 at 
 org.apache.commons.logging.impl.Log4JLogger.info(Log4JLogger.java:199)
 at 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler.freeHost(ShuffleScheduler.java:345)
 at 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:152)
 If i run the wordcount test with 2 files, it works fine. 
 I have actually repeated this with my own code. I am working on something 
 that requires me to map/reduce a small file and I had to work around the 
 problem by splitting the file into 2 1MB pieces for my job to run. 
 All our jobs that run on 1 single larger file (over 1GB) work flawlessly. I 
 am not exactly sure the threshold, From the testing i have done it seems to 
 be any file smaller than the default HDFS block size (64MB) Sometimes it 
 seems random in the 5-64MB range. But its 100% for the 5MB and smaller files. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-2444) error connect tasktracker for jobtracker


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-2444.
-

Resolution: Duplicate

This was essentially resolved with the move to Protobuf for the RPC layer.

 error connect tasktracker for jobtracker
 

 Key: MAPREDUCE-2444
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2444
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: tasktracker
Affects Versions: 0.20.1
Reporter: Alexey Diomin
  Labels: hadoop

 In TaskTracker.java on create connection to JobTracker we check compare build 
 version
 if(!VersionInfo.getBuildVersion().equals(jobTrackerBV)) 
 but
 public static String getBuildVersion(){
 return VersionInfo.getVersion() +
  from  + VersionInfo.getRevision() +
  by  + VersionInfo.getUser() +
  source checksum  + VersionInfo.getSrcChecksum();
 }
 in result identical version/revision/srcChecksum but compiled different users 
 incompatible



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-2841) Task level native optimization


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065042#comment-14065042
 ] 

Todd Lipcon commented on MAPREDUCE-2841:


Hey Sean. Something seems to be wrong with that patchfile -- some of the files 
seem to be present 11 times in it:

{code}
todd@todd-ThinkPad-T540p:~$ grep '+++.*TextSerializer' 
hadoop-3.0-mapreduce-2841-2014-7-17.patch  | less -S | wc -l
11
{code}

That might also explain why the file is too large to upload here as an 
attachment. Could you try to regenerate the patch?

 Task level native optimization
 --

 Key: MAPREDUCE-2841
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
 Environment: x86-64 Linux/Unix
Reporter: Binglin Chang
Assignee: Sean Zhong
 Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
 MAPREDUCE-2841.v2.patch, dualpivot-0.patch, dualpivotv20-0.patch, 
 fb-shuffle.patch


 I'm recently working on native optimization for MapTask based on JNI. 
 The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
 emitted by mapper, therefore sort, spill, IFile serialization can all be done 
 in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
 results:
 1. Sort is about 3x-10x as fast as java(only binary string compare is 
 supported)
 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
 CRC32C is used, things can get much faster(1G/
 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
 prevent mid-spill
 This leads to a total speed up of 2x~3x for the whole MapTask, if 
 IdentityMapper(mapper does nothing) is used
 There are limitations of course, currently only Text and BytesWritable is 
 supported, and I have not think through many things right now, such as how to 
 support map side combine. I had some discussion with somebody familiar with 
 hive, it seems that these limitations won't be much problem for Hive to 
 benefit from those optimizations, at least. Advices or discussions about 
 improving compatibility are most welcome:) 
 Currently NativeMapOutputCollector has a static method called canEnable(), 
 which checks if key/value type, comparator type, combiner are all compatible, 
 then MapTask can choose to enable NativeMapOutputCollector.
 This is only a preliminary test, more work need to be done. I expect better 
 final results, and I believe similar optimization can be adopt to reduce task 
 and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAPREDUCE-2841) Task level native optimization


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated MAPREDUCE-2841:
--

Attachment: hadoop-3.0-mapreduce-2841-2014-7-17.patch

 Task level native optimization
 --

 Key: MAPREDUCE-2841
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
 Environment: x86-64 Linux/Unix
Reporter: Binglin Chang
Assignee: Sean Zhong
 Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
 MAPREDUCE-2841.v2.patch, dualpivot-0.patch, dualpivotv20-0.patch, 
 fb-shuffle.patch, hadoop-3.0-mapreduce-2841-2014-7-17.patch


 I'm recently working on native optimization for MapTask based on JNI. 
 The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
 emitted by mapper, therefore sort, spill, IFile serialization can all be done 
 in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
 results:
 1. Sort is about 3x-10x as fast as java(only binary string compare is 
 supported)
 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
 CRC32C is used, things can get much faster(1G/
 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
 prevent mid-spill
 This leads to a total speed up of 2x~3x for the whole MapTask, if 
 IdentityMapper(mapper does nothing) is used
 There are limitations of course, currently only Text and BytesWritable is 
 supported, and I have not think through many things right now, such as how to 
 support map side combine. I had some discussion with somebody familiar with 
 hive, it seems that these limitations won't be much problem for Hive to 
 benefit from those optimizations, at least. Advices or discussions about 
 improving compatibility are most welcome:) 
 Currently NativeMapOutputCollector has a static method called canEnable(), 
 which checks if key/value type, comparator type, combiner are all compatible, 
 then MapTask can choose to enable NativeMapOutputCollector.
 This is only a preliminary test, more work need to be done. I expect better 
 final results, and I believe similar optimization can be adopt to reduce task 
 and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (MAPREDUCE-5974) Allow map output collector fallback

Todd Lipcon created MAPREDUCE-5974:
--

 Summary: Allow map output collector fallback
 Key: MAPREDUCE-5974
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5974
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: task
Affects Versions: 2.6.0
Reporter: Todd Lipcon


Currently we only allow specifying a single MapOutputCollector implementation 
class in a job. It would be nice to allow a comma-separated list of classes: we 
should try each collector implementation in the user-specified order until we 
find one that can be successfully instantiated and initted.

This is useful for cases where a particular optimized collector implementation 
cannot operate on all key/value types, or requires native code. The cluster 
administrator can configure the cluster to try to use the optimized collector 
and fall back to the default collector.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-2841) Task level native optimization

[
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065064#comment-14065064
]

Sean Zhong commented on MAPREDUCE-2841:
---

Ah, thanks for pointing this out. I am not sure why this happen.

I just uploaded the patch to this jira
https://issues.apache.org/jira/secure/attachment/12656288/hadoop-3.0-mapreduce-2841-2014-7-17.patch

updates:
1. Remove Hbase/hive/hive/mahout/pig related code, those code will be posted
elsewhere in another jira or hosted on github.
2. Use ServiceLoader to discover custom platform(to support custom key types)

Task level native optimization
--

Key: MAPREDUCE-2841
URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
Project: Hadoop Map/Reduce
Issue Type: Improvement
Components: task
Environment: x86-64 Linux/Unix
Reporter: Binglin Chang
Assignee: Sean Zhong
Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch,
MAPREDUCE-2841.v2.patch, dualpivot-0.patch, dualpivotv20-0.patch,
fb-shuffle.patch, hadoop-3.0-mapreduce-2841-2014-7-17.patch

I'm recently working on native optimization for MapTask based on JNI.
The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs
emitted by mapper, therefore sort, spill, IFile serialization can all be done
in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising
results:
1. Sort is about 3x-10x as fast as java(only binary string compare is
supported)
2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware
CRC32C is used, things can get much faster(1G/
3. Merge code is not completed yet, so the test use enough io.sort.mb to
prevent mid-spill
This leads to a total speed up of 2x~3x for the whole MapTask, if
IdentityMapper(mapper does nothing) is used
There are limitations of course, currently only Text and BytesWritable is
supported, and I have not think through many things right now, such as how to
support map side combine. I had some discussion with somebody familiar with
hive, it seems that these limitations won't be much problem for Hive to
benefit from those optimizations, at least. Advices or discussions about
improving compatibility are most welcome:)
Currently NativeMapOutputCollector has a static method called canEnable(),
which checks if key/value type, comparator type, combiner are all compatible,
then MapTask can choose to enable NativeMapOutputCollector.
This is only a preliminary test, more work need to be done. I expect better
final results, and I believe similar optimization can be adopt to reduce task
and shuffle too.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5957) AM throws ClassNotFoundException with job classloader enabled if custom output format/committer is used


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065068#comment-14065068
 ] 

Hadoop QA commented on MAPREDUCE-5957:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12656274/MAPREDUCE-5957.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4747//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4747//console

This message is automatically generated.

 AM throws ClassNotFoundException with job classloader enabled if custom 
 output format/committer is used
 ---

 Key: MAPREDUCE-5957
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5957
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: MAPREDUCE-5957.patch, MAPREDUCE-5957.patch, 
 MAPREDUCE-5957.patch, MAPREDUCE-5957.patch, MAPREDUCE-5957.patch, 
 MAPREDUCE-5957.patch


 With the job classloader enabled, the MR AM throws ClassNotFoundException if 
 a custom output format class is specified.
 {noformat}
 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
 java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
 com.foo.test.TestOutputFormat not found
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:473)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:374)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1459)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1456)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1389)
 Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: 
 Class com.foo.test.TestOutputFormat not found
   at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1895)
   at 
 org.apache.hadoop.mapreduce.task.JobContextImpl.getOutputFormatClass(JobContextImpl.java:222)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:469)
   ... 8 more
 Caused by: java.lang.ClassNotFoundException: Class 
 com.foo.test.TestOutputFormat not found
   at 
 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1801)
   at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1893)
   ... 10 more
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-296) job statistics should be displayed in the web/ui


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-296.


Resolution: Duplicate

This was done back in 0.20 as part of the y! security merge.

 job statistics should be displayed in the web/ui
 

 Key: MAPREDUCE-296
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-296
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Owen O'Malley
  Labels: newbie

 It would be really nice, if the job page in the web/ui showed the time that:
   1. first map started
   2. last map finished
   3. last reduce finished shuffle
   4. last reduce finished sort
   5. last reduce finished



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-603) Fix unchecked warnings in contrib code


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-603.


Resolution: Unresolved

Stale. Closing this out.

 Fix unchecked warnings in contrib code
 --

 Key: MAPREDUCE-603
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-603
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
Reporter: Tom White

 There are unchecked warnings in abacus, data_join and streaming.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-135) speculative task failure can kill jobs


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-135.


Resolution: Incomplete

I'm going to close this out as stale.  I suspect this is no longer an issue.

 speculative task failure can kill jobs
 --

 Key: MAPREDUCE-135
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-135
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Owen O'Malley

 We had a case where the random writer example was killed by speculative 
 execution. It happened like:
 task_0001_m_000123_0 - starts
 task_0001_m_000123_1 - starts and fails because attempt 0 is creating the 
 file
 task_0001_m_000123_2 - starts and fails because attempt 0 is creating the 
 file
 task_0001_m_000123_3 - starts and fails because attempt 0 is creating the 
 file
 task_0001_m_000123_4 - starts and fails because attempt 0 is creating the 
 file
 job_0001 is killed because map_000123 failed 4 times. From this experience, I 
 think we should change the scheduling so that:
   1. Tasks are only allowed 1 speculative attempt.
   2. TIPs don't kill jobs until they have 4 failures AND the last task under 
 that tip fails.
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-585) A corrupt text file causes the maps to hang


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-585.


Resolution: Incomplete

I'm going to close this out as stale.  I suspect this is no longer an issue.

 A corrupt text file causes the maps to hang
 ---

 Key: MAPREDUCE-585
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-585
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
Reporter: Mahadev konar
Priority: Minor

 A corrupt file hangs a map. The map keeps reading the same record again and 
 again and never finishes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-167) SAXParseException causes test to run forever


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-167.


Resolution: Incomplete

I'm going to close this out as stale.  I suspect this is no longer an issue.

 SAXParseException causes test to run forever
 

 Key: MAPREDUCE-167
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-167
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Nigel Daley
 Attachments: thread.dump.txt


 Occassionally, while running TestMiniMRClasspath, I get a SAXParseException 
 that causes the test to run forever.  Two questions I have:
 1) what is the underlying cause of the SAXParseException? 
 2) does the JobTracker realize that a task got lost?
 Here's the pertinent test trace:
 [junit] 2007-02-13 19:26:56,058 INFO  mapred.JobClient 
 (JobClient.java:runJob(400)) - Running job: job_0001
 [junit] 2007-02-13 19:26:57,062 INFO  mapred.JobClient 
 (JobClient.java:runJob(417)) -  map 0% reduce 0%
 [junit] 2007-02-13 19:27:05,258 INFO  mapred.JobInProgress 
 (JobInProgress.java:findNewTask(421)) - Choosing cached task tip_0001_m_00
 [junit] 2007-02-13 19:27:05,259 INFO  mapred.JobTracker 
 (JobTracker.java:createTaskEntry(690)) - Adding task 'task_0001_m_00_0' 
 to tip tip_0001_m_00, for tracker 
 'tracker_ucdev15.yst.corp.yahoo.com:50067'
 [junit] 2007-02-13 19:27:05,260 INFO  mapred.JobInProgress 
 (JobInProgress.java:findNewTask(421)) - Choosing cached task tip_0001_m_01
 [junit] 2007-02-13 19:27:05,261 INFO  mapred.JobTracker 
 (JobTracker.java:createTaskEntry(690)) - Adding task 'task_0001_m_01_0' 
 to tip tip_0001_m_01, for tracker 
 'tracker_ucdev15.yst.corp.yahoo.com:50063'
 [junit] 2007-02-13 19:27:05,262 INFO  mapred.TaskTracker 
 (TaskTracker.java:startNewTask(822)) - LaunchTaskAction: task_0001_m_00_0
 [junit] 2007-02-13 19:27:05,262 INFO  mapred.JobInProgress 
 (JobInProgress.java:findNewTask(421)) - Choosing cached task tip_0001_m_02
 [junit] 2007-02-13 19:27:05,263 INFO  mapred.JobTracker 
 (JobTracker.java:createTaskEntry(690)) - Adding task 'task_0001_m_02_0' 
 to tip tip_0001_m_02, for tracker 
 'tracker_ucdev15.yst.corp.yahoo.com:50066'
 [junit] 2007-02-13 19:27:05,263 INFO  mapred.TaskTracker 
 (TaskTracker.java:startNewTask(822)) - LaunchTaskAction: task_0001_m_01_0
 [junit] 2007-02-13 19:27:05,267 INFO  mapred.TaskTracker 
 (TaskTracker.java:startNewTask(822)) - LaunchTaskAction: task_0001_m_02_0
 [junit] 2007-02-13 19:27:05,270 INFO  mapred.JobInProgress 
 (JobInProgress.java:findNewTask(453)) - Choosing normal task tip_0001_r_00
 [junit] 2007-02-13 19:27:05,270 INFO  mapred.JobTracker 
 (JobTracker.java:createTaskEntry(690)) - Adding task 'task_0001_r_00_0' 
 to tip tip_0001_r_00, for tracker 
 'tracker_ucdev15.yst.corp.yahoo.com:50062'
 [junit] 2007-02-13 19:27:05,271 INFO  mapred.TaskTracker 
 (TaskTracker.java:startNewTask(822)) - LaunchTaskAction: task_0001_r_00_0
 [junit] 2007-02-13 19:27:05,285 INFO  dfs.DataNode 
 (DataNode.java:readBlock(719)) - Served block blk_-4805938806139473507 to 
 /66.228.166.95
 [junit] 2007-02-13 19:27:05,289 INFO  dfs.DataNode 
 (DataNode.java:readBlock(719)) - Served block blk_-4805938806139473507 to 
 /66.228.166.95
 [junit] 2007-02-13 19:27:05,292 INFO  dfs.DataNode 
 (DataNode.java:readBlock(719)) - Served block blk_-4805938806139473507 to 
 /66.228.166.95
 [junit] 2007-02-13 19:27:05,295 INFO  dfs.DataNode 
 (DataNode.java:readBlock(719)) - Served block blk_-4805938806139473507 to 
 /66.228.166.95
 [junit] 2007-02-13 19:27:05,312 INFO  dfs.DataNode 
 (DataNode.java:readBlock(719)) - Served block blk_3019208026182045172 to 
 /66.228.166.95
 [junit] 2007-02-13 19:27:05,312 INFO  dfs.DataNode 
 (DataNode.java:readBlock(719)) - Served block blk_3019208026182045172 to 
 /66.228.166.95
 [junit] 2007-02-13 19:27:05,352 INFO  dfs.DataNode 
 (DataNode.java:readBlock(719)) - Served block blk_-1390246588917827761 to 
 /66.228.166.95
 [junit] 2007-02-13 19:27:05,355 INFO  dfs.DataNode 
 (DataNode.java:readBlock(719)) - Served block blk_-1390246588917827761 to 
 /66.228.166.95
 [junit] 2007-02-13 19:27:05,367 INFO  dfs.DataNode 
 (DataNode.java:readBlock(719)) - Served block blk_4739954315939188869 to 
 /66.228.166.95
 [junit] 2007-02-13 19:27:05,368 INFO  dfs.DataNode 
 (DataNode.java:readBlock(719)) - Served block blk_3019208026182045172 to 
 /66.228.166.95
 [junit] 2007-02-13 19:27:05,367 INFO  dfs.DataNode 
 (DataNode.java:readBlock(719)) - Served block blk_4739954315939188869 to 
 /66.228.166.95
 [junit] 2007-02-13 19:27:05,416 FATAL conf.Configuration 
 (Configuration.java:loadResource(552)) - error parsing conf file:

[jira] [Resolved] (MAPREDUCE-584) In Streaming, crashes after all the input is consumed, are not detected


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-584.


Resolution: Fixed

I'm going to close this out as stale.  I suspect this is no longer an issue.

 In Streaming, crashes after all the input is consumed, are not detected
 ---

 Key: MAPREDUCE-584
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-584
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
Reporter: arkady borkovsky
Priority: Minor

 In a Hadoop Streaming, if the user code crashes after all the input has been 
 consumed, the framework considers the process to be successful.
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-306) job submission protocol should have a method for getting the task capacity of the cluster


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-306.


Resolution: Fixed

Now that we have more experience, it is generally recognized that using all of 
the map slots for a job is a terrible idea.  Closing.

 job submission protocol should have a method for getting the task capacity 
 of the cluster
 ---

 Key: MAPREDUCE-306
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-306
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Owen O'Malley

 It would help the InputFormats make informed decisions if the 
 JobSubmissionProtocol had a method the returned the number of tasks that the 
 cluster can run at once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-191) Mapper fail rate increases significantly as the number of reduces increase


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-191.


Resolution: Cannot Reproduce

I'm going to close this out as stale.  I suspect this is no longer an issue.

 Mapper fail rate increases significantly as the number of reduces increase
 --

 Key: MAPREDUCE-191
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-191
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Runping Qi

 I ran a large sort job, with about 8400 mappers.
 In the first run, I used 301 reducers. About 600 mapper tasks failed.
 In another run, I used 607 reducers. More than 3800 mapper tasks failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-172) Reducers stuck in 'sort'


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-172.


Resolution: Fixed

I'm going to close this out as stale.  I suspect this is no longer an issue.

 Reducers stuck in 'sort'
 

 Key: MAPREDUCE-172
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-172
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Arun C Murthy

 A couple of reduces seem stuck on a small 20-node cluster in the 'sort' phase 
 for almost an hour:
 TaskTracker logs:
 
 2007-03-28 14:13:46,471 INFO org.apache.hadoop.mapred.TaskTracker: 
 task_0002_r_05_0 0.3334% reduce  sort
 2007-03-28 14:13:46,478 INFO org.apache.hadoop.mapred.TaskTracker: 
 task_0002_r_09_0 0.3334% reduce  sort
 2007-03-28 14:13:47,476 INFO org.apache.hadoop.mapred.TaskTracker: 
 task_0002_r_05_0 0.3334% reduce  sort
 2007-03-28 14:13:47,483 INFO org.apache.hadoop.mapred.TaskTracker: 
 task_0002_r_09_0 0.3334% reduce  sort
 ...
 ...
 ...
 2007-03-28 15:06:04,376 INFO org.apache.hadoop.mapred.TaskTracker: 
 task_0002_r_05_0 0.3334% reduce  sort
 2007-03-28 15:06:04,411 INFO org.apache.hadoop.mapred.TaskTracker: 
 task_0002_r_09_0 0.3334% reduce  sort
 2007-03-28 15:06:05,379 INFO org.apache.hadoop.mapred.TaskTracker: 
 task_0002_r_05_0 0.3334% reduce  sort
 2007-03-28 15:06:05,414 INFO org.apache.hadoop.mapred.TaskTracker: 
 task_0002_r_09_0 0.3334% reduce  sort
 Eventually the JobTracker declared the same TT 'lost' (presumably for no 
 heartbeats):
 2007-03-28 15:18:20,341 INFO org.apache.hadoop.mapred.JobTracker: Lost 
 tracker 'tracker_XXX:9020'



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-602) The streaming code should be moved from contrib to Hadoop main framework


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-602.


Resolution: Fixed

This happened forever ago.

 The streaming code should be moved from contrib to Hadoop main framework
 

 Key: MAPREDUCE-602
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-602
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
Reporter: Runping Qi

 Before the actual move, the code needs a bit of further clean up in the 
 following areas:
 1. coding style/convention, and code quality
 2. XMLRecordReader: the current implementation is too hacky.
 3. Better javadoc



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-417) Logging could hang/fail when drive is filled by mapred outputs.


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-417.


Resolution: Fixed

I'm going to mark this as fixed due to how user logging is now handled.

 Logging could hang/fail when drive is filled by mapred outputs.
 ---

 Key: MAPREDUCE-417
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-417
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Koji Noguchi
Priority: Minor

 HADOOP-1252 addresses the mapred disk problems.
 In addition to those problems, if mapred fills up the drive used for logging, 
 it might affect TaskTracker/DataNodes.
 Simple solution for now is not to use the logging drive in MapReduce. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-452) tasktracker checkpointing capability


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-452.


Resolution: Fixed

Marking this as fixed since YARN provides this capability.

 tasktracker checkpointing capability
 

 Key: MAPREDUCE-452
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-452
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Pete Wyckoff
Priority: Minor

 This relates to allowing a resource manager (e.g., hadoop on demand) to grow 
 and (rarely) shrink jobs on the fly.
 Growing is already supported. Shrinking could be done in 2 ways - (1) 
 consider the machine dead and allow speculative execution to take care of it 
 or (2) moving the existing map outputs from that machine somewhere else 
 (another machine, dfs) - task tracker checkpointing 
 In the case of IO only intensive jobs,  checkpointing the tasktracker doesn't 
 do much for you.  But, in the case of CPU or other scarce resource (e.g., a 
 DB or Webpage cache...), the checkpointing could be very useful.  The 
 question is how often is this the case and how useful?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-107) Tasks fail due to lost mapout


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-107.


Resolution: Fixed

I'm going to close this out as stale.  I suspect this is no longer an issue.

 Tasks fail due to lost mapout
 -

 Key: MAPREDUCE-107
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-107
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Hairong Kuang

 When I ran a job, each map task of which generates Gbytes of map output, I 
 saw many tasks failed with following errors:
 Map output lost, rescheduling: getMapOutput(task_0993_m_13_0,140) failed :
 java.io.FileNotFoundException: 
 /hadoop/mapred/local/task_0993_m_13_0/file.out
   at 
 org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:332)
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
   at 
 org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:1657)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
   at 
 org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
   at 
 org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
   at 
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
   at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
   at 
 org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
   at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
   at org.mortbay.http.HttpServer.service(HttpServer.java:954)
   at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
   at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
   at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
   at 
 org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
   at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
   at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-319) Use Grizzly for Fetching Map Output in Shuffle


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-319.


Resolution: Fixed

Yes. Closing.

 Use Grizzly for Fetching Map Output in Shuffle
 --

 Key: MAPREDUCE-319
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-319
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Tahir Hashmi
Assignee: Devaraj Das
 Attachments: 1432.patch, grizzly.tgz


 As mentioned in HADOOP-1273 and references therefrom, Jetty 6 still doesn't 
 seem to be stable enough for use in Hadoop. Instead, we've decided to 
 consider the usage of Grizzly Framework [https://grizzly.dev.java.net/] for 
 NIO based communication.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-316) Splittability of input should be controllable by application


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-316.


Resolution: Won't Fix

Closing this as won't fix.  As was pointed out, there are other ways.

 Splittability of input should be controllable by application
 

 Key: MAPREDUCE-316
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-316
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
 Environment: ALL
Reporter: Milind Bhandarkar
Assignee: Senthil Subramanian
 Attachments: HADOOP-1441_1.patch


 Currently, isSplittable method of FileInputFormat always returns true. For 
 some applications, it becomes necessary that the map task process entire 
 file, rather than a block. Therefore, splittability of input (i.e. 
 block-level split vs file-level-split) should be controllable by user via a 
 configuration variable. The default could be block-level split, as is.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-320) Map/Reduce should use IP addresses to identify nodes rather than hostnames


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065159#comment-14065159
 ] 

Allen Wittenauer commented on MAPREDUCE-320:


I'm tempted to close this as won't fix.  The fundamental problem is that the MR 
framework (and all other Hadoop systems) need to a way to distinguish a process 
as the same host on machines with multiple interfaces.  This is certainly 
fixable... and the various token systems we have floating around may actually 
do that.  But it's a tremendous amount of work (especially from a security 
perspective) and I don't see much interest in doing that work.

 Map/Reduce should use IP addresses to identify nodes rather than hostnames
 --

 Key: MAPREDUCE-320
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-320
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Owen O'Malley
Assignee: Owen O'Malley

 We should move the Map/Reduce framework to identify hosts as IP addresses 
 rather than hostnames to prevent problems with DNS.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (MAPREDUCE-5974) Allow map output collector fallback


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned MAPREDUCE-5974:
--

Assignee: Todd Lipcon

 Allow map output collector fallback
 ---

 Key: MAPREDUCE-5974
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5974
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: task
Affects Versions: 2.6.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: mapreduce-5974.txt


 Currently we only allow specifying a single MapOutputCollector implementation 
 class in a job. It would be nice to allow a comma-separated list of classes: 
 we should try each collector implementation in the user-specified order until 
 we find one that can be successfully instantiated and initted.
 This is useful for cases where a particular optimized collector 
 implementation cannot operate on all key/value types, or requires native 
 code. The cluster administrator can configure the cluster to try to use the 
 optimized collector and fall back to the default collector.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAPREDUCE-5974) Allow map output collector fallback


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated MAPREDUCE-5974:
---

Attachment: mapreduce-5974.txt

Attached patch implements the improvement as described. I did not include any 
new unit test since this code path is exercised by existing paths, and mocking 
out this section of the code is really quite difficult. If folks would like to 
see a unit test, I can add a full functional test which specifies some bogus 
collector class ahead of the real implementation, but figured that it's a 
trivial enough change we could get by without.

 Allow map output collector fallback
 ---

 Key: MAPREDUCE-5974
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5974
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: task
Affects Versions: 2.6.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: mapreduce-5974.txt


 Currently we only allow specifying a single MapOutputCollector implementation 
 class in a job. It would be nice to allow a comma-separated list of classes: 
 we should try each collector implementation in the user-specified order until 
 we find one that can be successfully instantiated and initted.
 This is useful for cases where a particular optimized collector 
 implementation cannot operate on all key/value types, or requires native 
 code. The cluster administrator can configure the cluster to try to use the 
 optimized collector and fall back to the default collector.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-1766) Hadoop Streaming should not use TextInputFormat class as the default input format class.


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-1766.
-

Resolution: Fixed

Closing as stale, especially given that one can set the inputformat for 
streaming jobs now.

 Hadoop Streaming should not use TextInputFormat class as the default input 
 format class.
 

 Key: MAPREDUCE-1766
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1766
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
Reporter: Runping Qi

 The TextInputFormat class does not work with IdentityMapper class.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAPREDUCE-5910) MRAppMaster should handle Resync from RM instead of shutting down.


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated MAPREDUCE-5910:
---

Status: Open  (was: Patch Available)

 MRAppMaster should handle Resync from RM instead of shutting down.
 --

 Key: MAPREDUCE-5910
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5910
 Project: Hadoop Map/Reduce
  Issue Type: Task
  Components: applicationmaster
Reporter: Rohith
Assignee: Rohith
 Attachments: MAPREDUCE-5910.1.patch, MAPREDUCE-5910.2.patch, 
 MAPREDUCE-5910.3.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The MRAppMaster behavior is expected to change 
 to calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAPREDUCE-5910) MRAppMaster should handle Resync from RM instead of shutting down.


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated MAPREDUCE-5910:
---

Attachment: MAPREDUCE-5910.4.patch

I see, thanks for investigating.  added one code comment myself, re-submit the 
patch.

 MRAppMaster should handle Resync from RM instead of shutting down.
 --

 Key: MAPREDUCE-5910
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5910
 Project: Hadoop Map/Reduce
  Issue Type: Task
  Components: applicationmaster
Reporter: Rohith
Assignee: Rohith
 Attachments: MAPREDUCE-5910.1.patch, MAPREDUCE-5910.2.patch, 
 MAPREDUCE-5910.3.patch, MAPREDUCE-5910.4.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The MRAppMaster behavior is expected to change 
 to calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAPREDUCE-5910) MRAppMaster should handle Resync from RM instead of shutting down.


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated MAPREDUCE-5910:
---

Status: Patch Available  (was: Open)

 MRAppMaster should handle Resync from RM instead of shutting down.
 --

 Key: MAPREDUCE-5910
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5910
 Project: Hadoop Map/Reduce
  Issue Type: Task
  Components: applicationmaster
Reporter: Rohith
Assignee: Rohith
 Attachments: MAPREDUCE-5910.1.patch, MAPREDUCE-5910.2.patch, 
 MAPREDUCE-5910.3.patch, MAPREDUCE-5910.4.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The MRAppMaster behavior is expected to change 
 to calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-259) Rack-aware Shuffle


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-259.


Resolution: Duplicate

 Rack-aware Shuffle
 --

 Key: MAPREDUCE-259
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-259
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Arun C Murthy
Assignee: Arun C Murthy

 We could try and experiment with *rack-aware* scheduling of fetches 
 per-reducer. Given the disparities between in-rack and off-rack bandwidth it 
 could be a improvement to do something along these lines:
 {noformat}
 if (no. of known map-output locations  than no. of copier threads) {
   try to schedule 75% of copies off-rack
   try schedule 25% of copies in-rack
 }
 {noformat}
 This could lead to better utilization of both in-rack  switch b/w...
 Clearly we want to schedule more cross-switch than in-rack since off-rack 
 copies will take significantly more time; hence the 75-25 split.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-426) Race condition in LaunchTaskAction and KillJobAction


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-426.


Resolution: Fixed

I'm going to close this as a stale issue. There have been a lot of race 
conditions fixed in this area and suspect this is one of them.

 Race condition in LaunchTaskAction and KillJobAction
 

 Key: MAPREDUCE-426
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-426
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Koji Noguchi
Priority: Minor

 One task wasn't killed when its job was killed. 
 On the TaskTracker log, it showed, 
 2007-08-21 17:02:29,219 INFO org.apache.hadoop.mapred.TaskTracker: 
 LaunchTaskAction: task_0133_r_80_2**
 2007-08-21 17:02:29,232 INFO org.apache.hadoop.mapred.TaskTracker: Received 
 'KillJobAction' for job: job_0131 **
 2007-08-21 17:02:29,233 INFO org.apache.hadoop.mapred.TaskRunner: 
 task_0131_m_77_0 done; removing files.
 2007-08-21 17:02:29,376 INFO org.apache.hadoop.mapred.TaskTracker: Received 
 'KillJobAction' for job: job_0133
 2007-08-21 17:02:29,376 INFO org.apache.hadoop.mapred.TaskRunner: 
 task_0133_r_60_0 done; removing files.
 2007-08-21 17:02:29,378 INFO org.apache.hadoop.mapred.TaskRunner: 
 task_0133_r_71_2 done; removing files.
 2007-08-21 17:02:29,381 INFO org.apache.hadoop.mapred.TaskRunner: 
 task_0133_r_66_1 done; removing files.
 2007-08-21 17:02:31,272 INFO org.apache.hadoop.mapred.TaskTracker: 
 task_0133_r_80_2 0.0% reduce  copy 
 2007-08-21 17:02:32,275 INFO org.apache.hadoop.mapred.TaskTracker: 
 task_0133_r_80_2 0.0% reduce  copy 
 2007-08-21 17:02:33,277 INFO org.apache.hadoop.mapred.TaskTracker: 
 task_0133_r_80_2 0.0% reduce  copy 
 ...
 [task_0133_r_80_2 continue to run]
 Of course the JobTracker kept on complaining
 2007-08-22 19:06:37,880 INFO org.apache.hadoop.mapred.JobTracker: Serious 
 problem.  While updating status, cannot find taskid task_0133_r_80_2
 2007-08-22 19:06:38,124 INFO org.apache.hadoop.mapred.JobTracker: Serious 
 problem.  While updating status, cannot find taskid task_0133_r_80_2
 2007-08-22 19:06:47,885 INFO org.apache.hadoop.mapred.JobTracker: Serious 
 problem.  While updating status, cannot find taskid task_0133_r_80_2



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-338) Need more complete API of JobClient class


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-338.


Resolution: Fixed

I'm going to close this as stale.  We're now 7 years on and many API changes 
later, including getting the information being asked for in this JIRA.

 Need more complete API of JobClient class
 -

 Key: MAPREDUCE-338
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-338
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Runping Qi

 We need a programmatic way to find out the information about a map/reduce 
 cluster and the jobs on the cluster.
 The current API is not complete.
 In particular, the following API functions are needed:
 1. jobs()  currently, there is an API function JobsToComplete, which returns 
 running/waiting jobs only.  jobs() should return the complete list.
 2. TaskReport[] getMap/ReduceTaskReports(String jobid)
 3. getStartTime()
 4. getJobStatus(String jobid);
 5. getJobProfile(String jobid);



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-2841) Task level native optimization

[
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065207#comment-14065207
]

Todd Lipcon commented on MAPREDUCE-2841:

Thanks Sean. This patch looks better. I committed it as the initial import onto
the new feature branch (MR-2841).

I had some issues building on my Ubuntu 13.10 system, but one of the purposes
of the feature branch is to be able to iterate on it more collaboratively. I'll
file a couple of subtasks for the issues I'm running into on my box.

Task level native optimization
--

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-210) want InputFormat for zip files


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065216#comment-14065216
 ] 

Allen Wittenauer commented on MAPREDUCE-210:


Ping!



 want InputFormat for zip files
 --

 Key: MAPREDUCE-210
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-210
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Doug Cutting
Assignee: indrajit
 Attachments: ZipInputFormat_fixed.patch


 HDFS is inefficient with large numbers of small files.  Thus one might pack 
 many small files into large, compressed, archives.  But, for efficient 
 map-reduce operation, it is desireable to be able to split inputs into 
 smaller chunks, with one or more small original file per split.  The zip 
 format, unlike tar, permits enumeration of files in the archive without 
 scanning the entire archive.  Thus a zip InputFormat could efficiently permit 
 splitting large archives into splits that contain one or more archived files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (MAPREDUCE-5975) Fix native-task build on Ubuntu 13.10

Todd Lipcon created MAPREDUCE-5975:
--

 Summary: Fix native-task build on Ubuntu 13.10
 Key: MAPREDUCE-5975
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5975
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: task
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Blocker


I'm having some issues building the native-task branch on my Ubuntu 13.10 box. 
This JIRA is to figure out and fix whatever's going on.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (MAPREDUCE-5976) native-task should not fail to build if snappy is missing

Todd Lipcon created MAPREDUCE-5976:
--

 Summary: native-task should not fail to build if snappy is missing
 Key: MAPREDUCE-5976
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5976
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: task
Reporter: Todd Lipcon
Assignee: Sean Zhong


Other native parts of Hadoop will automatically disable snappy support if 
snappy is not present and -Drequire.snappy is not passed. native-task should do 
the same. (right now, it fails to build if snappy is missing)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-428) Reducers reported completion % is generally incorrect when consuming compressed map outputs


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-428.


Resolution: Duplicate

 Reducers reported completion % is generally incorrect when consuming 
 compressed map outputs
 ---

 Key: MAPREDUCE-428
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-428
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Riccardo Boscolo
Priority: Minor

 When processing compressed map outputs, reducers often report over 100% 
 completion (up to 220%). This is regardless of the compression codec and of 
 whether native compression is used or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-209) Support for metrics aggregation module in JobTracker


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-209.


Resolution: Fixed

Closing this as stale.

 Support for metrics aggregation module in JobTracker
 

 Key: MAPREDUCE-209
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-209
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Senthil Subramanian

 JobTracker should support starting up and shutting down a generic metrics 
 aggregation module. We are currently thinking about plugging in a module that 
 gets time series data from the task trackers, aggregates it and log this data 
 into a global DFS so that it is analysed later (even after the map reduce 
 cluster is shutdown). Some of this data can also be plotted on the JobTracker 
 UI in realtime. This is particulary useful for analyzing data from dynamic 
 mapreduce cluster like the ones deployed using HOD.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (MAPREDUCE-5977) Fix or suppress native-task gcc warnings

Todd Lipcon created MAPREDUCE-5977:
--

 Summary: Fix or suppress native-task gcc warnings
 Key: MAPREDUCE-5977
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5977
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: task
Reporter: Todd Lipcon
Assignee: Todd Lipcon


Currently, building the native task code on gcc 4.8 has a fair number of 
warnings. We should fix or suppress them so that new warnings are easier to see.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAPREDUCE-5975) Fix native-task build on Ubuntu 13.10


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated MAPREDUCE-5975:
---

Attachment: mr-5975.txt

Trivial patch to add some missing unistd.h includes which were necessary to 
build on my box.

 Fix native-task build on Ubuntu 13.10
 -

 Key: MAPREDUCE-5975
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5975
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: task
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Blocker
 Attachments: mr-5975.txt


 I'm having some issues building the native-task branch on my Ubuntu 13.10 
 box. This JIRA is to figure out and fix whatever's going on.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5975) Fix native-task build on Ubuntu 13.10


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065248#comment-14065248
 ] 

Todd Lipcon commented on MAPREDUCE-5975:


[~clockfly] and [~decster] -- since you guys are branch committers, you can 
review (and +1) patches targeted towards the native task feature branch. Mind 
taking a look at this one (and the others that I am filing?)

 Fix native-task build on Ubuntu 13.10
 -

 Key: MAPREDUCE-5975
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5975
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: task
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Blocker
 Attachments: mr-5975.txt


 I'm having some issues building the native-task branch on my Ubuntu 13.10 
 box. This JIRA is to figure out and fix whatever's going on.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-2833) Job Tracker needs to collect more job/task execution stats and save them to DFS file


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-2833.
-

Resolution: Fixed

I'm closing this as fixed since the history files pretty much over this request.

 Job Tracker needs to collect more job/task execution stats and save them to 
 DFS file
 

 Key: MAPREDUCE-2833
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2833
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Runping Qi
  Labels: newbie

 In order to facilitate offline analysis on the dynamic behaviors and 
 performance characterics of map/reduce jobs, 
 we need the job tracker to collect some data about jobs and save them to DFS 
 files. Some data are  in time series form, 
 and some are not.
 Below is a preliminary list of desired data. Some of them are already 
 available in the current job trackers. Some are new.
 For each map/reduce job, we need the following non time series data:
1. jobid, jobname,  number of mappers, number of reducers, start time, end 
 time, end of mapper phase
2. Average (median, min, max) of successful mapper execution time, 
 input/output records/bytes
3. Average (median, min, max) of uncessful mapper execution time, 
 input/output records/bytes
4.Total mapper retries,  max, average number of re-tries per mapper
5. The reasons for mapper task fails.
6. Average (median, min, max) of successful reducer execution time, 
 input/output reocrds/bytes
Execution time is the difference between the sort end time and the 
 task end time
7. Average (median, min, max) of successful copy time (from the mapper 
 phase end time  to the sort start time).
8. Average (median, min, max) of successful sorting time for successful 
 reducers
9. Average (median, min, max) of unsuccessful reducer execution time (from 
 the end of mapper phase or the start of the task, 
whichever later, to the end of task)
10. Total reducer retries,  max, average number of per reducer retries
11. The reasons for reducer task fails (user code error, lost tracker, 
 failed to write to DFS, etc.)
 For each map/reduce job, we collect the following  time series data (with one 
 minute interval):
 1. Numbers of pending mappers, reducers
 2. Number of running mappers, reducers
 For the job tracker, we need the following data:
 1. Number of trackers 
 2. Start time 
 3. End time 
 4. The list of map reduce jobs (their ids, starttime/endtime)
 
 The following time series data (with one minute interval):
 1. The number of running jobs
 2. The numbers of running mappers/reducers
 3. The number pending mappers/reducers 
 The data collection should be optional. That is, a job tracker can turn off 
 such data collection, and 
 in that case, it should not pay the cost.
 The job tracker should organize the in memory version of the collected data 
 in such a way that:
 1. it does not consume excessive amount of memory
 2. the data may be suitable for presenting through the Web status pages.
 The data saved on DFS files should be in hadoop record format.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-230) Need to document the controls for sorting and grouping into the reduce


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-230.


Resolution: Won't Fix

This ship sailed a looong time ago.

 Need to document the controls for sorting and grouping into the reduce
 --

 Key: MAPREDUCE-230
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-230
 Project: Hadoop Map/Reduce
  Issue Type: Task
Reporter: Owen O'Malley
Assignee: Arun C Murthy

 The JavaDoc for the Reducer should document how to control the sort order of 
 keys and values via the JobConf methods:
 {code}
   setOutputKeyComparatorClass
   setOutputValueGroupingComparator
 {code}
 Both methods desperately need better names. (I'd vote for 
 setKeySortingComparator and setKeyGroupingComparator.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-272) Job tracker should report the number of splits that are local to some task trackers


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065261#comment-14065261
 ] 

Allen Wittenauer commented on MAPREDUCE-272:


Don't we essentially have this information logged now?  Or am I missing 
something here?

 Job tracker should report the number of splits that are local to some task 
 trackers
 ---

 Key: MAPREDUCE-272
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-272
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Runping Qi
Assignee: Runping Qi
 Attachments: hadoop-2015.txt


 Right now, jon tracker keeps track the number of launched mappers with local 
 data.
 However, it is not clear how many mappers that are potentially be launched 
 with data locality.
 This information is readily available in Job Tracker. It is just a matter to 
 create a separate global  counter 
 and set it at the Job Tracker initialization time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-129) job_null_0001 in jobid


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-129.


Resolution: Fixed

 job_null_0001 in jobid
 --

 Key: MAPREDUCE-129
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-129
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Koji Noguchi
Assignee: Owen O'Malley

 When I submit a job before jobtracker is fully up, I occasionally get  jobid 
 of job_null_0001.
 [knoguchi ]$ hadoop -jar ...
 07/10/12 00:15:07 INFO mapred.FileInputFormat: Total input paths to process : 
 4
 07/10/12 00:15:08 INFO mapred.JobClient: Running job: *job_null_0001*
 07/10/12 00:15:09 INFO mapred.JobClient:  map 0% reduce 0%



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5974) Allow map output collector fallback


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065268#comment-14065268
 ] 

Hadoop QA commented on MAPREDUCE-5974:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12656299/mapreduce-5974.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4749//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4749//console

This message is automatically generated.

 Allow map output collector fallback
 ---

 Key: MAPREDUCE-5974
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5974
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: task
Affects Versions: 2.6.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: mapreduce-5974.txt


 Currently we only allow specifying a single MapOutputCollector implementation 
 class in a job. It would be nice to allow a comma-separated list of classes: 
 we should try each collector implementation in the user-specified order until 
 we find one that can be successfully instantiated and initted.
 This is useful for cases where a particular optimized collector 
 implementation cannot operate on all key/value types, or requires native 
 code. The cluster administrator can configure the cluster to try to use the 
 optimized collector and fall back to the default collector.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-492) Pending, running, completed tasks should also be shown as percentage


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-492.


Resolution: Fixed

A different percentage was committed eons ago.

 Pending, running, completed tasks should also be shown as percentage
 

 Key: MAPREDUCE-492
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-492
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Amar Kamat
Assignee: Amar Kamat
Priority: Minor
 Attachments: HADOOP-2099.patch, percent.png






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (MAPREDUCE-5962) Support CRC32C in IFile


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned MAPREDUCE-5962:
--

Assignee: Todd Lipcon  (was: James Thomas)

 Support CRC32C in IFile
 ---

 Key: MAPREDUCE-5962
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5962
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: performance, task
Affects Versions: 2.5.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon

 Currently, the IFile format used by the MR shuffle checksums all data using 
 the zlib CRC32 polynomial. If we allow use of CRC32C instead, we can get a 
 large reduction in CPU usage by leveraging the native hardware CRC32C 
 implementation (approx half a second of CPU time savings per GB checksummed).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-2830) Document config parameters for each Map-Reduce class/interface


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065280#comment-14065280
 ] 

Allen Wittenauer commented on MAPREDUCE-2830:
-

Ping!

 Document config parameters for each Map-Reduce class/interface
 --

 Key: MAPREDUCE-2830
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2830
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: documentation
Reporter: Arun C Murthy
  Labels: newbie

 I propose we add a table in the javadoc for each user-facing Map-Reduce 
 interface/class which lists, and provides details, of each and every config 
 parameter which has any bearing on that interface/class. Clearly some 
 parameters affect more than one place and they should be put in more than one 
 table.
 For e.g. 
 Mapper - io.sort.mb, io.sort.factor
 Reducer - fs.inmemory.size.mb
 ...
 etc.
 It would very nice to explain how it interacts with the framework and rest of 
 config params etc.
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAPREDUCE-5962) Support CRC32C in IFile


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated MAPREDUCE-5962:
---

Attachment: mapreduce-5962.txt

Attached patch adds a new configuration to set the IFile checksum type. I 
changed the default to CRC32C since it's much faster if you have the native 
libraries available.

I don't believe this is an incompatible change, since IFiles are only used 
internal to a single job (written by map, read by reduce). So, one would never 
have a different version reader compared to writer. That said, if anyone has 
any issues with this, they can configure the default back to CRC32 cluster-wide.

 Support CRC32C in IFile
 ---

 Key: MAPREDUCE-5962
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5962
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: performance, task
Affects Versions: 2.5.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: mapreduce-5962.txt


 Currently, the IFile format used by the MR shuffle checksums all data using 
 the zlib CRC32 polynomial. If we allow use of CRC32C instead, we can get a 
 large reduction in CPU usage by leveraging the native hardware CRC32C 
 implementation (approx half a second of CPU time savings per GB checksummed).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5910) MRAppMaster should handle Resync from RM instead of shutting down.


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065293#comment-14065293
 ] 

Hadoop QA commented on MAPREDUCE-5910:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12656302/MAPREDUCE-5910.4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4748//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4748//console

This message is automatically generated.

 MRAppMaster should handle Resync from RM instead of shutting down.
 --

 Key: MAPREDUCE-5910
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5910
 Project: Hadoop Map/Reduce
  Issue Type: Task
  Components: applicationmaster
Reporter: Rohith
Assignee: Rohith
 Attachments: MAPREDUCE-5910.1.patch, MAPREDUCE-5910.2.patch, 
 MAPREDUCE-5910.3.patch, MAPREDUCE-5910.4.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The MRAppMaster behavior is expected to change 
 to calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAPREDUCE-5962) Support CRC32C in IFile


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated MAPREDUCE-5962:
---

Status: Patch Available  (was: Open)

 Support CRC32C in IFile
 ---

 Key: MAPREDUCE-5962
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5962
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: performance, task
Affects Versions: 2.5.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: mapreduce-5962.txt


 Currently, the IFile format used by the MR shuffle checksums all data using 
 the zlib CRC32 polynomial. If we allow use of CRC32C instead, we can get a 
 large reduction in CPU usage by leveraging the native hardware CRC32C 
 implementation (approx half a second of CPU time savings per GB checksummed).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5962) Support CRC32C in IFile


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065291#comment-14065291
 ] 

Todd Lipcon commented on MAPREDUCE-5962:


(fwiw this depends on [~james.thomas]'s work to enable native checksumming on 
byte arrays. So we won't see an immediate benefit, but will once that patch is 
done)

 Support CRC32C in IFile
 ---

 Key: MAPREDUCE-5962
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5962
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: performance, task
Affects Versions: 2.5.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: mapreduce-5962.txt


 Currently, the IFile format used by the MR shuffle checksums all data using 
 the zlib CRC32 polynomial. If we allow use of CRC32C instead, we can get a 
 large reduction in CPU usage by leveraging the native hardware CRC32C 
 implementation (approx half a second of CPU time savings per GB checksummed).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-108) Blacklisted hosts may not be able to serve map outputs


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-108.


Resolution: Incomplete

I'm going to close this out as stale.  Blacklisting has undergone quite a few 
changes in the past 6-7 years.

 Blacklisted hosts may not be able to serve map outputs
 --

 Key: MAPREDUCE-108
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-108
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Runping Qi
Assignee: Amar Kamat
 Attachments: HADOOP-2175-v1.1.patch, HADOOP-2175-v1.patch, 
 HADOOP-2175-v2.patch, HADOOP-2175-v2.patch


 After a node fails 4 mappers (tasks), it is added to blacklist thus it will 
 no longer accept tasks.
 But, it will continue serve the map outputs of any mappers that ran 
 successfully there. 
 However, the node may not be able serve the map outputs either. 
 This will cause the reducers to mark the corresponding map outputs as from 
 slow hosts, 
 but continue to try to get the map outputs from that node.
 This may lead to waiting forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-155) Each task tracker should not execute more than one speculative task


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-155.


Resolution: Incomplete

Closing this as stale since spec exec has undergone quite a bit of overhauling 
which I'm confident included fixing this... at least, I've never seen a job get 
multiple versions of the same task assigned to the same node in practice.

 Each task tracker should not execute more than one speculative task
 ---

 Key: MAPREDUCE-155
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-155
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Runping Qi

 I noticed that sometimes, a tasktracker started 2 or three speculative mapper 
 tasks.
 That seems counter productive. You want to speculative execution complete as 
 soon as possible.
 Thus, it is better to spread speculative execution over multiple trackers. A 
 simple way to 
 achieve that is to limit the number of speculative eecution concurrently.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAPREDUCE-636) Inconsistency in config parameter for RandomWriter


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated MAPREDUCE-636:
---

Labels: newbie  (was: )

 Inconsistency in config parameter for RandomWriter
 --

 Key: MAPREDUCE-636
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-636
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: examples
Reporter: Arun C Murthy
Assignee: Arun C Murthy
Priority: Minor
  Labels: newbie

 All configuration parameters for RandomWriter start with the suffix 
 {{test.randomwrite}}, except for {{test.randomwriter.maps_per_host}}. Minor 
 inconsistency to fix.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-1769) Streaming command should be able to take its output to a file, rather then to stdout


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-1769.
-

Resolution: Won't Fix

Same reasoning as its buddy jira.

 Streaming command should be able to take its output to a file, rather then 
 to stdout
 --

 Key: MAPREDUCE-1769
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1769
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: contrib/streaming
Reporter: arkady borkovsky

 In some cases, especially when a streaming command is a 3rd party or legacy 
 application,
 it is impossible of inconvenient to make it write its output to stdout
 The command may require that the ouput file name is specified as a command 
 line option, or the output file name is hard coded.
 Related to https://issues.apache.org/jira/browse/HADOOP-2235



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5910) MRAppMaster should handle Resync from RM instead of shutting down.


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065309#comment-14065309
 ] 

Jian He commented on MAPREDUCE-5910:


committing this.

 MRAppMaster should handle Resync from RM instead of shutting down.
 --

 Key: MAPREDUCE-5910
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5910
 Project: Hadoop Map/Reduce
  Issue Type: Task
  Components: applicationmaster
Reporter: Rohith
Assignee: Rohith
 Attachments: MAPREDUCE-5910.1.patch, MAPREDUCE-5910.2.patch, 
 MAPREDUCE-5910.3.patch, MAPREDUCE-5910.4.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The MRAppMaster behavior is expected to change 
 to calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-1767) Steaming infrastructures should provide statisics about job


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-1767.
-

Resolution: Fixed

All of this information is available via other means.

 Steaming infrastructures should provide statisics about job
 ---

 Key: MAPREDUCE-1767
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1767
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: contrib/streaming
Reporter: arkady borkovsky

 This should include
 -- the commands (mapper and reducer commands) executed
 -- time information (e.g. min, max, and avg start time, end time, elapsed 
 time for tasks, total elapsed time )
 -- sizes -- bytes and records, min, max, avg per task and total, input and 
 output
 -- information about input and output data sets (all output data sets, if 
 there are several)
 -- all user counters (when they are implemented for streaming)
 the information should be stored in a file -- e.g. in the working directory 
 from where the job was launched, with a name derived from the job name



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-611) Streaming infrastructure should report information about runtime errors


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-611.


Resolution: Won't Fix

The stderr log was built to handle this type of thing. Closing as won't fix.

 Streaming infrastructure should report information about runtime errors 
 

 Key: MAPREDUCE-611
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-611
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
Reporter: arkady borkovsky

 For example, if the streaming command is Perl script an syntax error or a 
 runtime error occurs during script execution, the error message (the stack 
 trace) should be reported to the user, separate from and in addition to the 
 rest of the logs and the stderr output.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-2829) TestMiniMRMapRedDebugScript times out


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-2829.
-

Resolution: Fixed

Yup. Definitely. Closing.

 TestMiniMRMapRedDebugScript times out
 -

 Key: MAPREDUCE-2829
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2829
 Project: Hadoop Map/Reduce
  Issue Type: Bug
 Environment: Linux
Reporter: Konstantin Shvachko
 Attachments: Hadoop-2260.log, testrun-2260.log


 I am running TestMiniMRMapRedDebugScript from trunc.
 This is what I see in the stdout:
 {code}
 2007-11-22 02:21:23,494 WARN  conf.Configuration 
 (Configuration.java:loadResource(808)) - 
 hadoop/build/test/mapred/local/1_0/taskTracker/jobcache/job_200711220217_0001/task_200711220217_0001_m_00_0/job.xml:a
  attempt to override final parameter: hadoop.tmp.dir;  Ignoring.
 2007-11-22 02:21:28,940 INFO  jvm.JvmMetrics (JvmMetrics.java:init(56)) - 
 Initializing JVM Metrics with processName=MAP, sessionId=
 2007-11-22 02:22:09,504 INFO  mapred.MapTask (MapTask.java:run(127)) - 
 numReduceTasks: 0
 2007-11-22 02:22:42,434 WARN  mapred.TaskTracker 
 (TaskTracker.java:main(1982)) - Error running child
 java.io.IOException
   at 
 org.apache.hadoop.mapred.TestMiniMRMapRedDebugScript$MapClass.map(TestMiniMRMapRedDebugScript.java:41)
   at 
 org.apache.hadoop.mapred.TestMiniMRMapRedDebugScript$MapClass.map(TestMiniMRMapRedDebugScript.java:35)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1977)
 {code}
 Stderr and debugout both say: Bailing out.
 BTW on Windows everything works just fine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-229) Provide a command line option to check if a Hadoop jobtracker is idle


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-229.


Resolution: Fixed

I'm going to close this as fixed.

Clients can ask the JT if they are currently busy.  By doing this periodically, 
they can build an idle time.

 Provide a command line option to check if a Hadoop jobtracker is idle
 -

 Key: MAPREDUCE-229
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-229
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Hemanth Yamijala

 This is an RFE for providing a way to determine from the hadoop command line 
 whether a jobtracker is idle. One possibility is to have something like 
 hadoop jobtracker -idle time. Hadoop would return true (maybe via some 
 stdout output) if the jobtracker had no work to do (jobs running / prepared) 
 since time seconds, false otherwise.
 This would be useful for management / provisioning systems like 
 Hadoop-On-Demand [HADOOP-1301], which can then deallocate the idle, 
 provisioned clusters automatically, and release resources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-620) Streaming: support local execution


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-620.


Resolution: Won't Fix

Unix already provides this functionality.

 Streaming: support local execution
 --

 Key: MAPREDUCE-620
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-620
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: contrib/streaming
Reporter: arkady borkovsky

 For streaming, local execution does not involve hadoop.
 It is just
   hdfs -cat input | mapper-command | sort | reducer command
 While a user can do this herself, having an option to do this by using the 
 infrastructure would greatly simplify user script and and make it easier to 
 ensure that the process will run on the cluster as expected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-598) Streaming: better conrol over input splits


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-598.


Resolution: Fixed

This has been fixed in a variety of use cases. Closing.

 Streaming: better conrol over input splits
 --

 Key: MAPREDUCE-598
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-598
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: contrib/streaming
Reporter: arkady borkovsky

 In steaming, the map command usually expect to receive it's input 
 uninterpreted -- just as it is stored in DFS.
 However, the split (the beginning and the end of the portion of data that 
 goes to a single map task) is often important and is not any line break.
 Often the input consists of multi-line docments -- e.g. in XML.
 There should be a way to specify a pattern that separates logical records.
 Existing Streaming XML record reader kind of provides this functionality.  
 However, it is accepted that Streaming XML is a hack and needs to be 
 replaced 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5910) MRAppMaster should handle Resync from RM instead of shutting down.


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065361#comment-14065361
 ] 

Hudson commented on MAPREDUCE-5910:
---

FAILURE: Integrated in Hadoop-trunk-Commit #5902 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5902/])
MAPREDUCE-5910. Make MR AM resync with RM in case of work-preserving 
RM-restart. Contributed by Rohith (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611434)
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/local/LocalContainerAllocator.java
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/rm/TestRMContainerAllocator.java


 MRAppMaster should handle Resync from RM instead of shutting down.
 --

 Key: MAPREDUCE-5910
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5910
 Project: Hadoop Map/Reduce
  Issue Type: Task
  Components: applicationmaster
Reporter: Rohith
Assignee: Rohith
 Attachments: MAPREDUCE-5910.1.patch, MAPREDUCE-5910.2.patch, 
 MAPREDUCE-5910.3.patch, MAPREDUCE-5910.4.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The MRAppMaster behavior is expected to change 
 to calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-618) Need to be able to re-run specific map tasks (when -reducer NONE)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065373#comment-14065373
 ] 

Allen Wittenauer commented on MAPREDUCE-618:


I'm tempted to close this because retry logic is way better now.  However... 
isn't this essentially the same request as the various (real) preemption jiras?

 Need to be able to re-run specific map tasks (when -reducer NONE)
 -

 Key: MAPREDUCE-618
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-618
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: arkady borkovsky

 Sometimes, a few map tasks fail and -reducer NONE.  
 It should be possible to rerun the failed map tasks .
 There are several failure modes:
* a task is hanging, so the job is killed
* from the infrastructure perspective, the task has completed successfully 
 , but it failed to produces correct result
* failed in the proper Hadoop sense
 It is often too expensive to rerun the whole job.  And for larger jobs, 
 chances are each run will have a few failed tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-386) The combiner in pipes is closed before the last values are passed in.


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-386.


Resolution: Fixed

Stale issue. I'm sure this has been fixed by now!

 The combiner in pipes is closed before the last values are passed in.
 -

 Key: MAPREDUCE-386
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-386
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Owen O'Malley
Assignee: Owen O'Malley

 Currently the last spill is sent to the combiner after the close method is 
 called.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-594) Streaming: org.apache.hadoop.mapred.lib.IdentityMapper should not inserted unnecessary keys


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-594.


Resolution: Fixed

Closing this as fixed for a variety of reasons (alternative provided, you can 
now provide your own input format, etc, etc)

 Streaming: org.apache.hadoop.mapred.lib.IdentityMapper should not inserted 
 unnecessary keys
 ---

 Key: MAPREDUCE-594
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-594
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
Reporter: arkady borkovsky

 When streaming command specifies 
 -mapper org.apache.hadoop.mapred.lib.IdentityMapper
 the reducer should receive exactly the same text lines as where present in 
 the input.
 The only modification is the reordering the input.
 Currently, org.apache.hadoop.mapred.lib.IdentityMapper inserts ofsets in the 
 input as keys.  Which renders it useless.
 Moreover, in the latest release org.apache.hadoop.mapred.lib.IdentityMapper 
 just crashes:
 java.io.IOException: Type mismatch in key from map: e
 xpected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:331)
 at 
 org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:40)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
 at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
 (I open only one bug, as it is broken anyway, the new behavior does not 
 actually make it any worse than before)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-455) Hadoop needs a better XML Input


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065392#comment-14065392
 ] 

Allen Wittenauer commented on MAPREDUCE-455:


Is it time to revisit this?

 Hadoop needs a better XML Input
 ---

 Key: MAPREDUCE-455
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-455
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Alan Ho
Priority: Minor
 Attachments: HADOOP-2439Patch.patch


 Hadoop does not have a good XML parser for XML input. The XML parser in the 
 streaming class is fairly difficult to work with and doesn't have proper test 
 cases around it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-215) Improve facilities for job-control, job-queues etc.


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-215.


Resolution: Fixed

I'm going to close as fixed for a variety of reasons:

- some of these features are now native to Hadoop
- some of these features are now part of Oozie, Azkaban, etc
- some of these features are part of Tez



 Improve facilities for job-control, job-queues etc.
 ---

 Key: MAPREDUCE-215
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-215
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Arun C Murthy

 Today, Map-Reduce has _some_ support for job-control - basically JobClient 
 provides a facility to monitor jobs, one can setup a job-ending notification 
 and there is {{JobControl}}.
 Links:
 http://lucene.apache.org/hadoop/docs/r0.15.1/mapred_tutorial.html#Job+Control
 http://lucene.apache.org/hadoop/docs/r0.15.1/mapred_tutorial.html#JobControl
 Looks like users could do more with better facilities for job-control and 
 maybe more advanced features like job-queues etc.
 Lets discuss... 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-265) check permissions for job inputs and outputs


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-265.


Resolution: Fixed

I believe this has been fixed.

 check permissions for job inputs and outputs
 

 Key: MAPREDUCE-265
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-265
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Doug Cutting
 Attachments: HADOOP-2528-0.patch, HADOOP-2528-1.patch


 On job submission, filesystem permissions should be checked to ensure that 
 the input directory is readable and that the output directory is writable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-186) TaskLogServlet returns 410 when trying to access log early in task life


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-186.


Resolution: Incomplete

I'm going to close this out as stale.

 TaskLogServlet returns 410 when trying to access log early in task life
 ---

 Key: MAPREDUCE-186
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-186
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Michael Bieniosek
 Attachments: hadoop-2572.patch


 Early in a map task life, or for tasks that died quickly, the file 
 $task/syslog might not exist.  In this case, the TaskLogServlet gives a 
 status 410.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-224) limit running tasks per job


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-224.


Resolution: Duplicate

I'm going to close this out as a duplicate of MAPREDEUCE-5583.

 limit running tasks per job
 ---

 Key: MAPREDUCE-224
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-224
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Doug Cutting

 It should be possible to specify a limit to the number of tasks per job 
 permitted to run simultaneously.  If, for example, you have a cluster of 50 
 nodes, with 100 map task slots and 100 reduce task slots, and the configured 
 limit is 25 simultaneous tasks/job, then four or more jobs will be able to 
 run at a time.  This will permit short jobs to pass longer-running jobs.  
 This also avoids some problems we've seen with HOD, where nodes are 
 underutilized in their tail, and it should permit improved input locality.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-574) Fix -file option in Streaming to use Distributed Cache


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-574.


Resolution: Fixed

Stale.

 Fix -file option in Streaming to use Distributed Cache
 --

 Key: MAPREDUCE-574
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-574
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distributed-cache
Reporter: Amareshwari Sriramadasu
 Attachments: patch-2622.txt


 The -file option works by putting the script into the job's jar file by 
 unjar-ing, copying and then jar-ing it again.
 We should rework the -file option to use the DistributedCache and the symlink 
 option it provides.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-165) the map task output servlet doesn't protect against .. attacks


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065441#comment-14065441
 ] 

Allen Wittenauer commented on MAPREDUCE-165:


I suspect this issue can be closed, but it would be good to have some 
verification.

 the map task output servlet doesn't protect against .. attacks
 

 Key: MAPREDUCE-165
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-165
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Owen O'Malley
  Labels: security

 The servlet we use to export the map outputs doesn't protect itself against 
 .. attacks. However, because the code adds a /file.out.index and /file.out 
 to it, it can only be used to read files with those names.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAPREDUCE-165) the map task output servlet doesn't protect against .. attacks


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated MAPREDUCE-165:
---

Labels: security  (was: )

 the map task output servlet doesn't protect against .. attacks
 

 Key: MAPREDUCE-165
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-165
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Owen O'Malley
  Labels: security

 The servlet we use to export the map outputs doesn't protect itself against 
 .. attacks. However, because the code adds a /file.out.index and /file.out 
 to it, it can only be used to read files with those names.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-302) Maintaining cluster information across multiple job submissions

[
https://issues.apache.org/jira/browse/MAPREDUCE-302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065445#comment-14065445
]

Allen Wittenauer commented on MAPREDUCE-302:

I think YARN sort of makes this JIRA obsolete, but I'd like some verification
before closing it.

Maintaining cluster information across multiple job submissions
---

Key: MAPREDUCE-302
URL: https://issues.apache.org/jira/browse/MAPREDUCE-302
Project: Hadoop Map/Reduce
Issue Type: Improvement
Reporter: Lohit Vijayarenu
Assignee: dhruba borthakur

Could we have a way to maintain cluster state across multiple job submissions.
Consider a scenario where we run multiple jobs in iteration on a cluster back
to back. The nature of the job is same, but input/output might differ.
Now, if a node is blacklisted in one iteration of job run, it would be useful
to maintain this information and blacklist this node for next iteration of
job as well.
Another situation which we saw is, if there are failures less than
mapred.map.max.attempts in each iterations few nodes are never marked for
blacklisting. But in we consider two or three iterations, these nodes fail
all jobs and should be taken out of cluster. This hampers overall performance
of the job.
Could have have config variables something which matches a job type (provided
by user) and maintains the cluster status for that job type alone?

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAPREDUCE-165) the map task output servlet doesn't protect against .. attacks


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated MAPREDUCE-165:
---

Labels: newbie security  (was: security)

 the map task output servlet doesn't protect against .. attacks
 

 Key: MAPREDUCE-165
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-165
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Owen O'Malley
  Labels: newbie, security

 The servlet we use to export the map outputs doesn't protect itself against 
 .. attacks. However, because the code adds a /file.out.index and /file.out 
 to it, it can only be used to read files with those names.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAPREDUCE-489) Force the task tracker to exit when the task is complete, prevents nodes from dying due to resource starvation from impropertly written map/reduce tasks


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated MAPREDUCE-489:
---

Labels: newbie  (was: )

 Force the task tracker to exit when the task is complete, prevents nodes from 
 dying due to resource starvation from impropertly written map/reduce tasks
 

 Key: MAPREDUCE-489
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-489
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Jason
Priority: Minor
  Labels: newbie
   Original Estimate: 1h
  Remaining Estimate: 1h

 We have map/reduce jobs that sometimes run additional threads that are not at 
 daemon priority, and these threads prevent the Task from properly exiting. 
 When enough of these accumulate, the node falls over.
 The included patch forces the Tasks to exit when completed.
 Index: src/java/org/apache/hadoop/mapred/TaskTracker.java
 ===
 --- src/java/org/apache/hadoop/mapred/TaskTracker.java  (revision 608611)
 +++ src/java/org/apache/hadoop/mapred/TaskTracker.java  (working copy)
 @@ -1801,6 +1801,8 @@
  // This assumes that on return from Task.run()
  // there is no more logging done.
  LogManager.shutdown();
 +
 +System.exit(0);
}
  }
} 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-489) Force the task tracker to exit when the task is complete, prevents nodes from dying due to resource starvation from impropertly written map/reduce tasks


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065462#comment-14065462
 ] 

Allen Wittenauer commented on MAPREDUCE-489:


Doing a quick pass through the source seems to indicate that we don't always 
call system.exit on the way out.  We probably should.

 Force the task tracker to exit when the task is complete, prevents nodes from 
 dying due to resource starvation from impropertly written map/reduce tasks
 

 Key: MAPREDUCE-489
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-489
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Jason
Priority: Minor
  Labels: newbie
   Original Estimate: 1h
  Remaining Estimate: 1h

 We have map/reduce jobs that sometimes run additional threads that are not at 
 daemon priority, and these threads prevent the Task from properly exiting. 
 When enough of these accumulate, the node falls over.
 The included patch forces the Tasks to exit when completed.
 Index: src/java/org/apache/hadoop/mapred/TaskTracker.java
 ===
 --- src/java/org/apache/hadoop/mapred/TaskTracker.java  (revision 608611)
 +++ src/java/org/apache/hadoop/mapred/TaskTracker.java  (working copy)
 @@ -1801,6 +1801,8 @@
  // This assumes that on return from Task.run()
  // there is no more logging done.
  LogManager.shutdown();
 +
 +System.exit(0);
}
  }
} 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-344) Remove dead code block in JobInProgress.completedTask


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-344.


Resolution: Fixed

Stale.

 Remove dead code block in  JobInProgress.completedTask
 --

 Key: MAPREDUCE-344
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-344
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Arun C Murthy

 Since the taskCommitThread ensures that one and only one task of a given TIP 
 is marked as SUCCEEDED, we don't need the code block in 
 JobInProgress.completedTask which checks if the TIP is complete and then just 
 marks the task as complete:
 {noformat}
 // Sanity check: is the TIP already complete? 
 if (tip.isComplete()) {
   // Mark this task as KILLED
   tip.alreadyCompletedTask(taskid);
   // Let the JobTracker cleanup this taskid if the job isn't running
   if (this.status.getRunState() != JobStatus.RUNNING) {
 jobtracker.markCompletedTaskAttempt(status.getTaskTracker(), taskid);
   }
   return false;
 } 
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-390) Corner case exists in detecting Java process deaths that might lead to orphan pipes processes lying around in memory


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved MAPREDUCE-390.


Resolution: Fixed

I believe this one has actually been resolved with the rewrite of the pipes 
interface.

 Corner case exists in detecting Java process deaths that might lead to orphan 
 pipes processes lying around in memory
 

 Key: MAPREDUCE-390
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-390
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Devaraj Das
Priority: Minor

 In HADOOP-2092, the child pipes process periodically pings the parent Java 
 process to find out whether it is alive. The ping cycle is 5 seconds. 
 Consider the following scenario:
 1) The Java task dies at the beginning of the ping cycle
 2) A new Java task starts and binds to the same port as the earlier Java 
 task's port
 3) The pipes process wakes up and does a ping - it will still be successful 
 since the port number hasn't changed
 This will lead to orphan processes lying around in memory. The detection of 
 parent process deaths can be made more reliable at least on Unix'ish 
 platforms by checking whether the parent process ID is 1, and if so exit. 
 This will take care of the most common platform that hadoop is run on. For 
 non-unix platforms, the existing ping mechanism can be retained. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-2593) Random read benchmark for DFS