[jira] [Commented] (MAPREDUCE-6555) TestMRAppMaster fails on trunk

2015-11-25 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15026767#comment-15026767
 ] 

Junping Du commented on MAPREDUCE-6555:
---

bq. I'm checking by running the test case. Please wait a moment.
The Jenkins test report above already show it. Isn't it?

> TestMRAppMaster fails on trunk
> --
>
> Key: MAPREDUCE-6555
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6555
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Varun Saxena
>Assignee: Junping Du
> Attachments: MAPREDUCE-6555.patch
>
>
> Observed in QA report of YARN-3840 
> {noformat}
> Running org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster
> Tests run: 9, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 20.699 sec 
> <<< FAILURE! - in org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster
> testMRAppMasterMidLock(org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster)  
> Time elapsed: 0.474 sec  <<< FAILURE!
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster.testMRAppMasterMidLock(TestMRAppMaster.java:174)
> testMRAppMasterSuccessLock(org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster)
>   Time elapsed: 0.175 sec  <<< ERROR!
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.io.FileNotFoundException: File 
> file:/home/varun/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/staging/history/done_intermediate/TestAppMasterUser/job_1317529182569_0004-1448100479292-TestAppMasterUser-%3Cmissing+job+name%3E-1448100479413-0-0-SUCCEEDED-default-1448100479292.jhist_tmp
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:640)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:866)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:630)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:372)
>   at 
> org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:513)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.moveTmpToDone(JobHistoryEventHandler.java:1346)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processDoneFiles(JobHistoryEventHandler.java:1154)
>   at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>   at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
>   at 
> org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStop(MRAppMaster.java:1751)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.stop(MRAppMaster.java:1247)
>   at 
> org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster.testMRAppMasterSuccessLock(TestMRAppMaster.java:254)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6557) Some tests in mapreduce-client-app are writing outside of target

2015-11-25 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15026743#comment-15026743
 ] 

Junping Du commented on MAPREDUCE-6557:
---

Thanks for the patch, [~ajisakaa]! As I just mentioned in MAPREDUCE-6555, may 
be we also want to fix the issue that the directory is not cleanup after test 
finish?

> Some tests in mapreduce-client-app are writing outside of target
> 
>
> Key: MAPREDUCE-6557
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6557
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: build
>Affects Versions: 3.0.0
>Reporter: Allen Wittenauer
>Assignee: Akira AJISAKA
>Priority: Blocker
> Attachments: MAPREDUCE-6557.00.patch
>
>
> There is a staging directory appearing. It should not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6555) TestMRAppMaster fails on trunk

2015-11-25 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15026732#comment-15026732
 ] 

Junping Du commented on MAPREDUCE-6555:
---

bq. TestMRAppMaster is using directory other than test.build.data, so the 
intermediate files are checked by Apache Rat. We should fix it in a separate 
jira.
Agree. The worse thing is the test doesn't cleanup the directory in the end 
because the cleanup() get called before every tests rather than after. Let's 
file a separate JIRA to fix them.

> TestMRAppMaster fails on trunk
> --
>
> Key: MAPREDUCE-6555
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6555
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Varun Saxena
>Assignee: Junping Du
> Attachments: MAPREDUCE-6555.patch
>
>
> Observed in QA report of YARN-3840 
> {noformat}
> Running org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster
> Tests run: 9, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 20.699 sec 
> <<< FAILURE! - in org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster
> testMRAppMasterMidLock(org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster)  
> Time elapsed: 0.474 sec  <<< FAILURE!
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster.testMRAppMasterMidLock(TestMRAppMaster.java:174)
> testMRAppMasterSuccessLock(org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster)
>   Time elapsed: 0.175 sec  <<< ERROR!
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.io.FileNotFoundException: File 
> file:/home/varun/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/staging/history/done_intermediate/TestAppMasterUser/job_1317529182569_0004-1448100479292-TestAppMasterUser-%3Cmissing+job+name%3E-1448100479413-0-0-SUCCEEDED-default-1448100479292.jhist_tmp
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:640)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:866)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:630)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:372)
>   at 
> org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:513)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.moveTmpToDone(JobHistoryEventHandler.java:1346)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processDoneFiles(JobHistoryEventHandler.java:1154)
>   at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>   at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
>   at 
> org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStop(MRAppMaster.java:1751)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.stop(MRAppMaster.java:1247)
>   at 
> org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster.testMRAppMasterSuccessLock(TestMRAppMaster.java:254)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MAPREDUCE-6545) Test committer.commitJob() behavior during committing when MR AM get failed.

2015-11-24 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du reassigned MAPREDUCE-6545:
-

Assignee: Junping Du

> Test committer.commitJob() behavior during committing when MR AM get failed.
> 
>
> Key: MAPREDUCE-6545
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6545
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Junping Du
>Assignee: Junping Du
>
> In MAPREDUCE-5485, we are adding additional API (isCommitJobRepeatable) to 
> allow job commit can tolerate AM failure in some cases (like 
> FileOutputCommitter in v2 algorithm). Although we have unit test to cover 
> most of flows, we may want a completed end to end test to verify the whole 
> work flow.
> The scenario include:
> 1. For FileOutputCommitter (or some sub class), emulate a MR AM failure or 
> restart during commitJob() in progress
> 2. Check different behavior for v1 and v2 (support isCommitJobRepeatable() or 
> not)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6555) TestMRAppMaster fails on trunk

2015-11-24 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6555:
--
Target Version/s: 3.0.0

> TestMRAppMaster fails on trunk
> --
>
> Key: MAPREDUCE-6555
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6555
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Varun Saxena
>Assignee: Junping Du
> Attachments: MAPREDUCE-6555.patch
>
>
> Observed in QA report of YARN-3840 
> {noformat}
> Running org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster
> Tests run: 9, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 20.699 sec 
> <<< FAILURE! - in org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster
> testMRAppMasterMidLock(org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster)  
> Time elapsed: 0.474 sec  <<< FAILURE!
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster.testMRAppMasterMidLock(TestMRAppMaster.java:174)
> testMRAppMasterSuccessLock(org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster)
>   Time elapsed: 0.175 sec  <<< ERROR!
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.io.FileNotFoundException: File 
> file:/home/varun/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/staging/history/done_intermediate/TestAppMasterUser/job_1317529182569_0004-1448100479292-TestAppMasterUser-%3Cmissing+job+name%3E-1448100479413-0-0-SUCCEEDED-default-1448100479292.jhist_tmp
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:640)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:866)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:630)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:372)
>   at 
> org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:513)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.moveTmpToDone(JobHistoryEventHandler.java:1346)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processDoneFiles(JobHistoryEventHandler.java:1154)
>   at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>   at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
>   at 
> org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStop(MRAppMaster.java:1751)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.stop(MRAppMaster.java:1247)
>   at 
> org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster.testMRAppMasterSuccessLock(TestMRAppMaster.java:254)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6555) TestMRAppMaster fails on trunk

2015-11-24 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6555:
--
Status: Patch Available  (was: Open)

> TestMRAppMaster fails on trunk
> --
>
> Key: MAPREDUCE-6555
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6555
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Varun Saxena
>Assignee: Junping Du
> Attachments: MAPREDUCE-6555.patch
>
>
> Observed in QA report of YARN-3840 
> {noformat}
> Running org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster
> Tests run: 9, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 20.699 sec 
> <<< FAILURE! - in org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster
> testMRAppMasterMidLock(org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster)  
> Time elapsed: 0.474 sec  <<< FAILURE!
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster.testMRAppMasterMidLock(TestMRAppMaster.java:174)
> testMRAppMasterSuccessLock(org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster)
>   Time elapsed: 0.175 sec  <<< ERROR!
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.io.FileNotFoundException: File 
> file:/home/varun/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/staging/history/done_intermediate/TestAppMasterUser/job_1317529182569_0004-1448100479292-TestAppMasterUser-%3Cmissing+job+name%3E-1448100479413-0-0-SUCCEEDED-default-1448100479292.jhist_tmp
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:640)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:866)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:630)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:372)
>   at 
> org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:513)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.moveTmpToDone(JobHistoryEventHandler.java:1346)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processDoneFiles(JobHistoryEventHandler.java:1154)
>   at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>   at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
>   at 
> org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStop(MRAppMaster.java:1751)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.stop(MRAppMaster.java:1247)
>   at 
> org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster.testMRAppMasterSuccessLock(TestMRAppMaster.java:254)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6555) TestMRAppMaster fails on trunk

2015-11-24 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024466#comment-15024466
 ] 

Junping Du commented on MAPREDUCE-6555:
---

The previous failure is because since MAPREDUCE-5485, we allow MR job can retry 
on AM failure during committing stage (if Committer is repeatable). So 
MRAppMaster.initAndStartAppMaster() won't throw fatal exception if there are 
commit start file exists (which hints previous AM failed in the middle of 
commit) for FileOutputCommitter which is default for version 2 algorithm in 
trunk. I think we don't need this fix in branch-2 as the version in branch-2 is 
1.

> TestMRAppMaster fails on trunk
> --
>
> Key: MAPREDUCE-6555
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6555
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Varun Saxena
>Assignee: Junping Du
> Attachments: MAPREDUCE-6555.patch
>
>
> Observed in QA report of YARN-3840 
> {noformat}
> Running org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster
> Tests run: 9, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 20.699 sec 
> <<< FAILURE! - in org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster
> testMRAppMasterMidLock(org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster)  
> Time elapsed: 0.474 sec  <<< FAILURE!
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster.testMRAppMasterMidLock(TestMRAppMaster.java:174)
> testMRAppMasterSuccessLock(org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster)
>   Time elapsed: 0.175 sec  <<< ERROR!
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.io.FileNotFoundException: File 
> file:/home/varun/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/staging/history/done_intermediate/TestAppMasterUser/job_1317529182569_0004-1448100479292-TestAppMasterUser-%3Cmissing+job+name%3E-1448100479413-0-0-SUCCEEDED-default-1448100479292.jhist_tmp
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:640)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:866)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:630)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:372)
>   at 
> org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:513)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.moveTmpToDone(JobHistoryEventHandler.java:1346)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processDoneFiles(JobHistoryEventHandler.java:1154)
>   at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>   at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
>   at 
> org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStop(MRAppMaster.java:1751)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.stop(MRAppMaster.java:1247)
>   at 
> org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster.testMRAppMasterSuccessLock(TestMRAppMaster.java:254)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6555) TestMRAppMaster fails on trunk

2015-11-24 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6555:
--
Attachment: MAPREDUCE-6555.patch

Upload a quick fix to test failure.

> TestMRAppMaster fails on trunk
> --
>
> Key: MAPREDUCE-6555
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6555
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Varun Saxena
>Assignee: Junping Du
> Attachments: MAPREDUCE-6555.patch
>
>
> Observed in QA report of YARN-3840 
> {noformat}
> Running org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster
> Tests run: 9, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 20.699 sec 
> <<< FAILURE! - in org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster
> testMRAppMasterMidLock(org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster)  
> Time elapsed: 0.474 sec  <<< FAILURE!
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster.testMRAppMasterMidLock(TestMRAppMaster.java:174)
> testMRAppMasterSuccessLock(org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster)
>   Time elapsed: 0.175 sec  <<< ERROR!
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.io.FileNotFoundException: File 
> file:/home/varun/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/staging/history/done_intermediate/TestAppMasterUser/job_1317529182569_0004-1448100479292-TestAppMasterUser-%3Cmissing+job+name%3E-1448100479413-0-0-SUCCEEDED-default-1448100479292.jhist_tmp
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:640)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:866)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:630)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:372)
>   at 
> org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:513)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.moveTmpToDone(JobHistoryEventHandler.java:1346)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processDoneFiles(JobHistoryEventHandler.java:1154)
>   at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>   at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
>   at 
> org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStop(MRAppMaster.java:1751)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.stop(MRAppMaster.java:1247)
>   at 
> org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster.testMRAppMasterSuccessLock(TestMRAppMaster.java:254)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MAPREDUCE-6555) TestMRAppMaster fails on trunk

2015-11-24 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du reassigned MAPREDUCE-6555:
-

Assignee: Junping Du

> TestMRAppMaster fails on trunk
> --
>
> Key: MAPREDUCE-6555
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6555
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Varun Saxena
>Assignee: Junping Du
> Attachments: MAPREDUCE-6555.patch
>
>
> Observed in QA report of YARN-3840 
> {noformat}
> Running org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster
> Tests run: 9, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 20.699 sec 
> <<< FAILURE! - in org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster
> testMRAppMasterMidLock(org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster)  
> Time elapsed: 0.474 sec  <<< FAILURE!
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster.testMRAppMasterMidLock(TestMRAppMaster.java:174)
> testMRAppMasterSuccessLock(org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster)
>   Time elapsed: 0.175 sec  <<< ERROR!
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.io.FileNotFoundException: File 
> file:/home/varun/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/staging/history/done_intermediate/TestAppMasterUser/job_1317529182569_0004-1448100479292-TestAppMasterUser-%3Cmissing+job+name%3E-1448100479413-0-0-SUCCEEDED-default-1448100479292.jhist_tmp
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:640)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:866)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:630)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:372)
>   at 
> org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:513)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.moveTmpToDone(JobHistoryEventHandler.java:1346)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processDoneFiles(JobHistoryEventHandler.java:1154)
>   at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>   at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
>   at 
> org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStop(MRAppMaster.java:1751)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.stop(MRAppMaster.java:1247)
>   at 
> org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster.testMRAppMasterSuccessLock(TestMRAppMaster.java:254)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-11-24 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024401#comment-15024401
 ] 

Junping Du commented on MAPREDUCE-5485:
---

Sure. Thanks [~ajisakaa] for reminding on this!

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
>Priority: Critical
> Fix For: 2.7.3
>
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch, MAPREDUCE-5485-v2.patch, MAPREDUCE-5485-v3.1.patch, 
> MAPREDUCE-5485-v3.patch, MAPREDUCE-5485-v4.1.patch, MAPREDUCE-5485-v4.patch, 
> MAPREDUCE-5485-v5-branch-2.7.patch, MAPREDUCE-5485-v5.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-11-16 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-5485:
--
Attachment: MAPREDUCE-5485-v5-branch-2.7.patch

Upload a patch for branch-2.7 in case we want to commit it for 2.7.3.

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch, MAPREDUCE-5485-v2.patch, MAPREDUCE-5485-v3.1.patch, 
> MAPREDUCE-5485-v3.patch, MAPREDUCE-5485-v4.1.patch, MAPREDUCE-5485-v4.patch, 
> MAPREDUCE-5485-v5-branch-2.7.patch, MAPREDUCE-5485-v5.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-11-16 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15007167#comment-15007167
 ] 

Junping Du commented on MAPREDUCE-5485:
---

bq. Junping Du, there are some findbugs and ut failures, mind checking ?
The unit test is unrelated and just fixed in MAPREDUCE-6533. The fingbugs 
warning belongs to wrong checking as 
org.apache.hadoop.mapred.FileOutputCommitter.isCommitJobRepeatable(Context) 
override the right method in parent abstract class (it wrongly recognize to 
override another abstract method). So I think we should ignore these warnings.

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch, MAPREDUCE-5485-v2.patch, MAPREDUCE-5485-v3.1.patch, 
> MAPREDUCE-5485-v3.patch, MAPREDUCE-5485-v4.1.patch, MAPREDUCE-5485-v4.patch, 
> MAPREDUCE-5485-v5.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6542) HistoryViewer use SimpleDateFormat,But SimpleDateFormat is not threadsafe

2015-11-12 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6542:
--
Hadoop Flags:   (was: Reviewed)

> HistoryViewer use SimpleDateFormat,But SimpleDateFormat is not threadsafe
> -
>
> Key: MAPREDUCE-6542
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6542
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Affects Versions: 2.2.0, 2.7.1
> Environment: CentOS6.5 Hadoop  
>Reporter: zhangyubiao
>Assignee: zhangyubiao
> Attachments: MAPREDUCE-6542-v2.patch, MAPREDUCE-6542.patch
>
>
> I use SimpleDateFormat to Parse the JobHistory File before 
> private static final SimpleDateFormat dateFormat =
> new SimpleDateFormat("-MM-dd HH:mm:ss");
>  public static String getJobDetail(JobInfo job) {
> StringBuffer jobDetails = new StringBuffer("");
> SummarizedJob ts = new SummarizedJob(job);
> jobDetails.append(job.getJobId().toString().trim()).append("\t");
> jobDetails.append(job.getUsername()).append("\t");
> jobDetails.append(job.getJobname().replaceAll("\\n", 
> "")).append("\t");
> jobDetails.append(job.getJobQueueName()).append("\t");
> jobDetails.append(job.getPriority()).append("\t");
> jobDetails.append(job.getJobConfPath()).append("\t");
> jobDetails.append(job.getUberized()).append("\t");
> 
> jobDetails.append(dateFormat.format(job.getSubmitTime())).append("\t");
> 
> jobDetails.append(dateFormat.format(job.getLaunchTime())).append("\t");
> 
> jobDetails.append(dateFormat.format(job.getFinishTime())).append("\t");
>return jobDetails.toString();
> }
> But I find I query the SubmitTime and LaunchTime in hive and compare 
> JobHistory File time , I find that the submitTime  and launchTime was wrong.
> Finally,I chang to use the FastDateFormat to parse the time format and the 
> time become right 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6542) HistoryViewer use SimpleDateFormat,But SimpleDateFormat is not threadsafe

2015-11-12 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6542:
--
Status: Patch Available  (was: Open)

> HistoryViewer use SimpleDateFormat,But SimpleDateFormat is not threadsafe
> -
>
> Key: MAPREDUCE-6542
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6542
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Affects Versions: 2.7.1, 2.2.0
> Environment: CentOS6.5 Hadoop  
>Reporter: zhangyubiao
>Assignee: zhangyubiao
> Attachments: MAPREDUCE-6542-v2.patch, MAPREDUCE-6542.patch
>
>
> I use SimpleDateFormat to Parse the JobHistory File before 
> private static final SimpleDateFormat dateFormat =
> new SimpleDateFormat("-MM-dd HH:mm:ss");
>  public static String getJobDetail(JobInfo job) {
> StringBuffer jobDetails = new StringBuffer("");
> SummarizedJob ts = new SummarizedJob(job);
> jobDetails.append(job.getJobId().toString().trim()).append("\t");
> jobDetails.append(job.getUsername()).append("\t");
> jobDetails.append(job.getJobname().replaceAll("\\n", 
> "")).append("\t");
> jobDetails.append(job.getJobQueueName()).append("\t");
> jobDetails.append(job.getPriority()).append("\t");
> jobDetails.append(job.getJobConfPath()).append("\t");
> jobDetails.append(job.getUberized()).append("\t");
> 
> jobDetails.append(dateFormat.format(job.getSubmitTime())).append("\t");
> 
> jobDetails.append(dateFormat.format(job.getLaunchTime())).append("\t");
> 
> jobDetails.append(dateFormat.format(job.getFinishTime())).append("\t");
>return jobDetails.toString();
> }
> But I find I query the SubmitTime and LaunchTime in hive and compare 
> JobHistory File time , I find that the submitTime  and launchTime was wrong.
> Finally,I chang to use the FastDateFormat to parse the time format and the 
> time become right 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6542) HistoryViewer use SimpleDateFormat,But SimpleDateFormat is not threadsafe

2015-11-12 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002265#comment-15002265
 ] 

Junping Du commented on MAPREDUCE-6542:
---

Hi [~piaoyu zhang], thanks for the patch! The "Reviewed" flag is only set after 
some committer give +1 on your patch. 
In addition, you should click "Submit Patch" when you upload a new patch so 
your JIRA status will become "Patch Available" and Jenkins can trigger test 
automatically on your patch. I will fix it for you at this time.


> HistoryViewer use SimpleDateFormat,But SimpleDateFormat is not threadsafe
> -
>
> Key: MAPREDUCE-6542
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6542
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Affects Versions: 2.2.0, 2.7.1
> Environment: CentOS6.5 Hadoop  
>Reporter: zhangyubiao
>Assignee: zhangyubiao
> Attachments: MAPREDUCE-6542-v2.patch, MAPREDUCE-6542.patch
>
>
> I use SimpleDateFormat to Parse the JobHistory File before 
> private static final SimpleDateFormat dateFormat =
> new SimpleDateFormat("-MM-dd HH:mm:ss");
>  public static String getJobDetail(JobInfo job) {
> StringBuffer jobDetails = new StringBuffer("");
> SummarizedJob ts = new SummarizedJob(job);
> jobDetails.append(job.getJobId().toString().trim()).append("\t");
> jobDetails.append(job.getUsername()).append("\t");
> jobDetails.append(job.getJobname().replaceAll("\\n", 
> "")).append("\t");
> jobDetails.append(job.getJobQueueName()).append("\t");
> jobDetails.append(job.getPriority()).append("\t");
> jobDetails.append(job.getJobConfPath()).append("\t");
> jobDetails.append(job.getUberized()).append("\t");
> 
> jobDetails.append(dateFormat.format(job.getSubmitTime())).append("\t");
> 
> jobDetails.append(dateFormat.format(job.getLaunchTime())).append("\t");
> 
> jobDetails.append(dateFormat.format(job.getFinishTime())).append("\t");
>return jobDetails.toString();
> }
> But I find I query the SubmitTime and LaunchTime in hive and compare 
> JobHistory File time , I find that the submitTime  and launchTime was wrong.
> Finally,I chang to use the FastDateFormat to parse the time format and the 
> time become right 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-11-11 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000916#comment-15000916
 ] 

Junping Du commented on MAPREDUCE-5485:
---

bq. About the overall test. The main overall change is to allow the retry AM to 
continue after seeing an in-progress commit from the previous AM. It seems 
incomplete to not have a test for that. 
I agree that it is better to add as many cases as possible in unit test. But 
due to limitations of our current unit test framework, we could miss many 
functional tests, especially related to MR AM failed/restart, like: in rolling 
upgrade story, we don't have tests to check AM failed over during NM/RM 
restart. Instead, we may have to split the whole functionality into pieces and 
test each piece. Sometime it is sad that this may not be good enough and that's 
why we still need to test/verify the feature works end to end on a real cluster.

bq. However if you think that we dont have existing infra for that code path 
then we should create a follow up jira to add that infra and relevant tests. I 
have not followed the MR AM code changes for a while and so I cannot recall of 
the top of my head about any existing test cases. Maybe other committers may 
have some ideas.
Just filed MAPREDUCE-6545 to track more test effort that comes later.

bq. With that caveat, the latest patch looks good to me. Thanks for your 
patience through the reviews.
Thanks Bikas for your carefully review.

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch, MAPREDUCE-5485-v2.patch, MAPREDUCE-5485-v3.1.patch, 
> MAPREDUCE-5485-v3.patch, MAPREDUCE-5485-v4.1.patch, MAPREDUCE-5485-v4.patch, 
> MAPREDUCE-5485-v5.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Moved] (MAPREDUCE-6545) Test committer.commitJob() behavior during committing when MR AM get failed.

2015-11-11 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du moved YARN-4346 to MAPREDUCE-6545:
-

Target Version/s:   (was: 2.8.0)
 Key: MAPREDUCE-6545  (was: YARN-4346)
 Project: Hadoop Map/Reduce  (was: Hadoop YARN)

> Test committer.commitJob() behavior during committing when MR AM get failed.
> 
>
> Key: MAPREDUCE-6545
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6545
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Junping Du
>
> In MAPREDUCE-5485, we are adding additional API (isCommitJobRepeatable) to 
> allow job commit can tolerate AM failure in some cases (like 
> FileOutputCommitter in v2 algorithm). Although we have unit test to cover 
> most of flows, we may want a completed end to end test to verify the whole 
> work flow.
> The scenario include:
> 1. For FileOutputCommitter (or some sub class), emulate a MR AM failure or 
> restart during commitJob() in progress
> 2. Check different behavior for v1 and v2 (support isCommitJobRepeatable() or 
> not)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAPREDUCE-6544) yarn rmadmin -updateNodeResource doesn't work

2015-11-11 Thread Junping Du (JIRA)
Junping Du created MAPREDUCE-6544:
-

 Summary: yarn rmadmin -updateNodeResource doesn't work
 Key: MAPREDUCE-6544
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6544
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.1
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical


YARN-313 add CLI to update node resource. It works fine for batch mode update. 
However, for single node update "yarn rmadmin -updateNodeResource" failed to 
work because resource is not set properly in sending request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-11-11 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000727#comment-15000727
 ] 

Junping Du commented on MAPREDUCE-5485:
---

The test failure is not related and I believe nothing could do with 
findbugs/checkstyle issues. 

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch, MAPREDUCE-5485-v2.patch, MAPREDUCE-5485-v3.1.patch, 
> MAPREDUCE-5485-v3.patch, MAPREDUCE-5485-v4.1.patch, MAPREDUCE-5485-v4.patch, 
> MAPREDUCE-5485-v5.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-11-11 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-5485:
--
Attachment: MAPREDUCE-5485-v5.patch

Update v5 patch to address Bikas comments above.

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch, MAPREDUCE-5485-v2.patch, MAPREDUCE-5485-v3.1.patch, 
> MAPREDUCE-5485-v3.patch, MAPREDUCE-5485-v4.1.patch, MAPREDUCE-5485-v4.patch, 
> MAPREDUCE-5485-v5.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-11-11 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000472#comment-15000472
 ] 

Junping Du commented on MAPREDUCE-5485:
---

bq. This introduces duplication of code for checking commit status and can 
cause a bug if the logic changes in either place. And also makes extra RPC 
calls to HDFS for checking file status - which is avoidable. Moving the code to 
the place where earlier we were failing due to in-progress commit, will allow 
this method to do exactly as it name suggests - cleanup in progress commit 
markers. Does that clarify?
Thanks for clarifying. That sounds good. Will update in v5 patch.

bq. 1) Test MR Appmaster new functionality that allows commit to proceed in a 
retried AM if commit is repeatable. 
Theoretically, I agree it is nice to have something fully functional. However, 
I don't think it is easy to have for this case. Do we have other tests on job 
commit (not retry) with launching AppMaster fully functional? If not, I would 
prefer to add it later in another JIRA if we have more ideas on how to do it.

bq. 2) Test in FileOutputCommitter that for repeatable commit - a 
filenotfoundexception is not counted as an error (new behavior).
Can you check FileOutputCommitter#testCommitterRepeatableV1() and 
FileOutputCommitter#testCommitterRepeatableV2()?

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch, MAPREDUCE-5485-v2.patch, MAPREDUCE-5485-v3.1.patch, 
> MAPREDUCE-5485-v3.patch, MAPREDUCE-5485-v4.1.patch, MAPREDUCE-5485-v4.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6533) testDetermineCacheVisibilities of TestClientDistributedCacheManager is broken

2015-11-11 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000263#comment-15000263
 ] 

Junping Du commented on MAPREDUCE-6533:
---

The test failure for testDetermineCache() is quite annoying. Thanks 
[~lichangleo] for working on it and [~jlowe] for reviewing it. 
v4 patch looks good in overall except one place:
{code}
+  private static final Path TEST_ROOT_DIR =
+  new Path(System.getProperty("test.build.data", "/tmp"));
{code}
We should replace "/tmp" with System.getProperty("java.io.tmpdir") for tests to 
run smoothly in platform other than Linux.

> testDetermineCacheVisibilities of TestClientDistributedCacheManager is broken
> -
>
> Key: MAPREDUCE-6533
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6533
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: MAPREDUCE-6533.2.patch, MAPREDUCE-6533.3.patch, 
> MAPREDUCE-6533.4.patch, MAPREDUCE-6533.4.patch, MAPREDUCE-6533.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6505) Migrate io test cases

2015-11-11 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000240#comment-15000240
 ] 

Junping Du commented on MAPREDUCE-6505:
---

I think this jira should be moved to hadoop project instead of mapreduce 
project as all changes happen in hadoop-common.

> Migrate io test cases
> -
>
> Key: MAPREDUCE-6505
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6505
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: test
>Reporter: Dustin Cote
>Assignee: Dustin Cote
>Priority: Trivial
> Attachments: MAPREDUCE-6505-1.patch, MAPREDUCE-6505-2.patch, 
> MAPREDUCE-6505-3.patch
>
>
> Migrating just the io test cases 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-11-11 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-5485:
--
Attachment: MAPREDUCE-5485-v4.1.patch

Fix minor issues with white spaces, findbugs, etc. in v4.1 patch.

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch, MAPREDUCE-5485-v2.patch, MAPREDUCE-5485-v3.1.patch, 
> MAPREDUCE-5485-v3.patch, MAPREDUCE-5485-v4.1.patch, MAPREDUCE-5485-v4.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-11-10 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-5485:
--
Attachment: MAPREDUCE-5485-v4.patch

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch, MAPREDUCE-5485-v2.patch, MAPREDUCE-5485-v3.1.patch, 
> MAPREDUCE-5485-v3.patch, MAPREDUCE-5485-v4.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-11-10 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998995#comment-14998995
 ] 

Junping Du commented on MAPREDUCE-5485:
---

Thanks [~bikassaha] for comments!
bq. Is the above check too early in the code. E.g. IIRC, at this point we have 
not checked whether the previous job commit was succeeded or failed - in which 
case we cannot recover and there is nothing to do.
The cleanupInterruptedCommit() already check previous job commit succeed or 
failed. Am I missing anything here?

bq. Also, we have already changed the startCommit operation to be repeatable 
via the overwrite flag. After that is there a need to delete the files upfront. 
Delete may be an expensive operation on some cloud stores.
I don't see much different with deletion/write a small or empty file with 
overwrite an existing file (updating timestamp, contents) in any cloud stores. 
I just prefer not to add additional if-else cases to existing ones that is 
already complicated to me. If we do observe the performance differences in real 
cluster, we can optimize it then. What do you think?

bq. Mapred javadoc fixes are missing. Also there are some typos in there. E.g.
Nice catch! Will fix it in v4 patch.

bq. This part of the code change could use some tests.
Ok. Add TestFileOutputCommitter for class in Mapred package.

bq. Tests for repeatable success marker file and FileExistsException for 
repeatable deletes would be good to have.
Previously, success marker file is being added as overwrite (fs.create() 
default to be overwrite), so no much different here. Add additional tests on 
duplicated job commit in v4 patch.

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch, MAPREDUCE-5485-v2.patch, MAPREDUCE-5485-v3.1.patch, 
> MAPREDUCE-5485-v3.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-11-09 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997796#comment-14997796
 ] 

Junping Du commented on MAPREDUCE-5485:
---

Actually, the test failure of 
TestClientDistributedCacheManager.testDetermineCacheVisibilities is already 
tracked by MAPREDUCE-6533.

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch, MAPREDUCE-5485-v2.patch, MAPREDUCE-5485-v3.1.patch, 
> MAPREDUCE-5485-v3.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-11-09 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-5485:
--
Attachment: MAPREDUCE-5485-v3.1.patch

v3.1 patch to fix minor issues (whitespace, checkstyle, java doc, etc.) 
reported by Jenkins. The unit test failure 
TestClientDistributedCacheManager.testDetermineCacheVisibilities is not 
related, will file a separated JIRA to fix it.

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch, MAPREDUCE-5485-v2.patch, MAPREDUCE-5485-v3.1.patch, 
> MAPREDUCE-5485-v3.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-11-09 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997661#comment-14997661
 ] 

Junping Du commented on MAPREDUCE-5485:
---

Upload v3 patch to address following comments:
bq. I am not disagreeing with the AM retry in an absolute sense. However, it 
does not seem to belong to this jira and is likely better done as a follow up.
Make sense. We can separate this part (AM retry after commit failure) out.

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch, MAPREDUCE-5485-v2.patch, MAPREDUCE-5485-v3.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-11-09 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-5485:
--
Attachment: MAPREDUCE-5485-v3.patch

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch, MAPREDUCE-5485-v2.patch, MAPREDUCE-5485-v3.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-11-09 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14996979#comment-14996979
 ] 

Junping Du commented on MAPREDUCE-5485:
---

bq. doing ++retries here can remove code duplication for the < check in the 
while?
Sorry. I miss this comment in my patch just uploaded. Will update in next patch.

bq. Even for a non-repeatable committer, if there is a classpath issue (which 
can get fixed by retrying the AM) then the AM should retry, right?
I agree this could be a potentially separated topic. However, it could take 
more time and effort to make sure the retry on non-repeatable committer won't 
bring risk to cause a successl commit which is not right for result and should 
get failed earlier. For repeatable committer, it seems no such risk but it 
could paid price of unnecessary retry in some cases but earn more chance for 
succeed in commit stage in other cases, especially you cannot differentiate the 
case belongs to former or later. Just like the exception of deleting temp 
directory failed, it could due to AM connection with HDFS (we should retry) or 
HDFS down permanently (we shouldn't retry). I would prefer the current 
trade-off: simple and best effort for commit success in repeatable case.

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch, MAPREDUCE-5485-v2.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-11-09 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-5485:
--
Attachment: MAPREDUCE-5485-v2.patch

Update patch according to review comments from Bikas.

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch, MAPREDUCE-5485-v2.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6096) SummarizedJob class NPEs with some jhist files

2015-11-06 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6096:
--
Assignee: zhangyubiao

> SummarizedJob class NPEs with some jhist files
> --
>
> Key: MAPREDUCE-6096
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6096
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Reporter: zhangyubiao
>Assignee: zhangyubiao
>  Labels: easyfix, patch
> Attachments: MAPREDUCE-6096-v8.patch, 
> job_1446203652278_66705-1446308686422-dd_edw-insert+overwrite+table+bkactiv...dp%3D%27ACTIVE%27%28Stage-1446308802181-233-0-SUCCEEDED-bdp_jdw_corejob.jhist
>
>
> When I Parse  the JobHistory in the HistoryFile,I use the Hadoop System's  
> map-reduce-client-core project 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryParser  class and 
> HistoryViewer$SummarizedJob to Parse the JobHistoryFile(Just Like 
> job_1408862281971_489761-1410883171851_XXX.jhist)  
> and it throw an Exception Just Like 
> Exception in thread "pool-1-thread-1" java.lang.NullPointerException
>   at 
> org.apache.hadoop.mapreduce.jobhistory.HistoryViewer$SummarizedJob.(HistoryViewer.java:626)
>   at 
> com.jd.hadoop.log.parse.ParseLogService.getJobDetail(ParseLogService.java:70)
> After I'm see the SummarizedJob class I  find that attempt.getTaskStatus() is 
> NULL , So I change the order of 
> attempt.getTaskStatus().equals (TaskStatus.State.FAILED.toString())  to 
> TaskStatus.State.FAILED.toString().equals(attempt.getTaskStatus()) 
> and it works well .
> So I wonder If we can change all  attempt.getTaskStatus()  after 
> TaskStatus.State.XXX.toString() ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6096) SummarizedJob class NPEs with some jhist files

2015-11-06 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6096:
--
Labels: easyfix patch  (was: BB2015-05-TBR easyfix patch)

> SummarizedJob class NPEs with some jhist files
> --
>
> Key: MAPREDUCE-6096
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6096
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Reporter: zhangyubiao
>  Labels: easyfix, patch
> Attachments: MAPREDUCE-6096-v8.patch, 
> job_1446203652278_66705-1446308686422-dd_edw-insert+overwrite+table+bkactiv...dp%3D%27ACTIVE%27%28Stage-1446308802181-233-0-SUCCEEDED-bdp_jdw_corejob.jhist
>
>
> When I Parse  the JobHistory in the HistoryFile,I use the Hadoop System's  
> map-reduce-client-core project 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryParser  class and 
> HistoryViewer$SummarizedJob to Parse the JobHistoryFile(Just Like 
> job_1408862281971_489761-1410883171851_XXX.jhist)  
> and it throw an Exception Just Like 
> Exception in thread "pool-1-thread-1" java.lang.NullPointerException
>   at 
> org.apache.hadoop.mapreduce.jobhistory.HistoryViewer$SummarizedJob.(HistoryViewer.java:626)
>   at 
> com.jd.hadoop.log.parse.ParseLogService.getJobDetail(ParseLogService.java:70)
> After I'm see the SummarizedJob class I  find that attempt.getTaskStatus() is 
> NULL , So I change the order of 
> attempt.getTaskStatus().equals (TaskStatus.State.FAILED.toString())  to 
> TaskStatus.State.FAILED.toString().equals(attempt.getTaskStatus()) 
> and it works well .
> So I wonder If we can change all  attempt.getTaskStatus()  after 
> TaskStatus.State.XXX.toString() ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-11-05 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992102#comment-14992102
 ] 

Junping Du commented on MAPREDUCE-5485:
---

Thanks Bikas for review and comments!
bq. If the commit actually failed then there does not seem any reason to assume 
that retrying it will succeed. IMO if the commit reports a failure then AM 
should fail. Similarly, if a commit failure file exists (from a previous AM 
version) then the new version of the AM should respect that and fail since the 
commit has been reported to be failed.
There are still reasons that related to AM specific, i.e. previous AM cannot 
connect to FS (FS or other CloudFS), committer mis-behavior because of getting 
loaded incorrect (due to classpath or other defect), etc. I think it make sense 
to do the best effort to retry the commit failure (like other reason to cause 
AM failure) given the commit is repeatable and all tasks are done successfully.

bq. Javadoc could be improved. Inline
Yes. I will.

bq. num-retries instead of retries? Also, if its num-retries then default 
should be 0. If its num-attempts then default should be 1.
Ok. I will update to attempts. default to be 1 means no retry to keep 
consistent with previous behavior.

bq. Retry count checking code in the catch block subsumes the check retry count 
check in the while block?
I don't think so. Can you take a look at it again?

bq. The previous operation could delete the path after the if check has 
succeeded. So we probably also need to catch FileNotFoundException exception 
class here and ignore it if repeatableCommit is true.
That's good point. Will fix it.

bq. Do testcases need an @Test annotation?
No. The test class extends TestCase, so all method start with "test" will be 
executed automatically.

bq. firstTimeFail is probably a more clear name for what its doing - failing on 
the first attempt. Would be good to have a test that version 2 and retry = 1 
will fail also. Testcases missing for specific changes in FileOutputCommitter 
for create/delete operation changes?
Sounds good. Will fix/add later.



> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-11-04 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-5485:
--
Priority: Critical  (was: Major)

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-11-04 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-5485:
--
Target Version/s: 2.6.3, 2.7.3

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-11-04 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-5485:
--
Attachment: MAPREDUCE-5485-v1.patch

Attach a new patch to address Bikas comments above, include:
1. Make retry logic go to committer.commitJob() rather than MRAppMaster
2. It will fail AM instead of Job when exception happens during jobCommit if 
commitJob() is repeatable.
3. Add related unit tests.
Verify this feature works well on a small scale cluster that kill AM during job 
committing stage, and the job can continue and succeed after AM restarted.

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6528) Memory leak for HistoryFileManager.getJobSummary()

2015-10-30 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982852#comment-14982852
 ] 

Junping Du commented on MAPREDUCE-6528:
---

Thanks Jason to review and commit this in!

> Memory leak for HistoryFileManager.getJobSummary()
> --
>
> Key: MAPREDUCE-6528
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6528
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Fix For: 2.7.2, 2.6.3
>
> Attachments: MAPREDUCE-6528.patch
>
>
> We meet memory leak issues for JHS in a large cluster which is caused by code 
> below doesn't release FSDataInputStream in exception case. MAPREDUCE-6273 
> should fix most cases that exceptions get thrown. However, we still need to 
> fix the memory leak for occasional case.
> {code} 
> private String getJobSummary(FileContext fc, Path path) throws IOException {
> Path qPath = fc.makeQualified(path);
> FSDataInputStream in = fc.open(qPath);
> String jobSummaryString = in.readUTF();
> in.close();
> return jobSummaryString;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-10-30 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982375#comment-14982375
 ] 

Junping Du commented on MAPREDUCE-5485:
---

Thanks [~bikassaha] for the comments! I agree it makes more sense to move retry 
logic into committer.commitJob() if it support repeatable. My original thinking 
is to combine this retry for committer.commitJob() with other AM exceptions in 
handleJobCommit (outside of committer), like: failed to write 
endCommitSuccessFile, etc. But now I think we should separate committer retry 
with AM specific handling for the reason you mentioned above. For this case, I 
would prefer we just let AM exit directly instead of fail the job (if commit 
job is repeatable). Most like the same as proposed above by [~nemon], but a 
slightly different is: we should apply AM fail (not job fail) even for 
commiter.commitJob() failed after retry for handling some corner cases, i.e. 
something goes wrong with related to committer in this AM but still get chance 
to success in another AM if we support repeatable in commit job. 
I will update a patch soon.

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6528) Memory leak for HistoryFileManager.getJobSummary()

2015-10-30 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982317#comment-14982317
 ] 

Junping Du commented on MAPREDUCE-6528:
---

Thanks [~brahmareddy]! Can someone commit this patch in? It is quite 
straight-forward.

> Memory leak for HistoryFileManager.getJobSummary()
> --
>
> Key: MAPREDUCE-6528
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6528
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-6528.patch
>
>
> We meet memory leak issues for JHS in a large cluster which is caused by code 
> below doesn't release FSDataInputStream in exception case. MAPREDUCE-6273 
> should fix most cases that exceptions get thrown. However, we still need to 
> fix the memory leak for occasional case.
> {code} 
> private String getJobSummary(FileContext fc, Path path) throws IOException {
> Path qPath = fc.makeQualified(path);
> FSDataInputStream in = fc.open(qPath);
> String jobSummaryString = in.readUTF();
> in.close();
> return jobSummaryString;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6528) Memory leak for HistoryFileManager.getJobSummary()

2015-10-30 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982214#comment-14982214
 ] 

Junping Du commented on MAPREDUCE-6528:
---

Good point, Vinod! Let's keep the patch as it is now as try-with-resources 
won't be supported in earlier version of JDKs.

> Memory leak for HistoryFileManager.getJobSummary()
> --
>
> Key: MAPREDUCE-6528
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6528
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-6528.patch
>
>
> We meet memory leak issues for JHS in a large cluster which is caused by code 
> below doesn't release FSDataInputStream in exception case. MAPREDUCE-6273 
> should fix most cases that exceptions get thrown. However, we still need to 
> fix the memory leak for occasional case.
> {code} 
> private String getJobSummary(FileContext fc, Path path) throws IOException {
> Path qPath = fc.makeQualified(path);
> FSDataInputStream in = fc.open(qPath);
> String jobSummaryString = in.readUTF();
> in.close();
> return jobSummaryString;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-10-29 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-5485:
--
Attachment: MAPREDUCE-5485-demo-2.patch

Update 2nd demo patch, with fixing:
1. Make sure FileOutputCommitter only repeatable when using algorithm 2 
(algorithm 1 is not support yet).
2. Make the temporary directory delete operation idempotent by allowing 
temporary directory is deleted because the directory does not exist (since it 
may have been deleted by the first AM).
3. Make the SUCCESS file marker creation operation idempotent by allowing for 
the file to exist (since it may have been created by the first AM).
Test is still ongoing, will add unit test in next patch.

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
> Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6528) Memory leak for HistoryFileManager.getJobSummary()

2015-10-29 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980892#comment-14980892
 ] 

Junping Du commented on MAPREDUCE-6528:
---

bq. Thanks for reporting this..can you use try-with-resources..?
Given code with final block is already there. Any advantage for 
try-with-resources?

> Memory leak for HistoryFileManager.getJobSummary()
> --
>
> Key: MAPREDUCE-6528
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6528
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-6528.patch
>
>
> We meet memory leak issues for JHS in a large cluster which is caused by code 
> below doesn't release FSDataInputStream in exception case. MAPREDUCE-6273 
> should fix most cases that exceptions get thrown. However, we still need to 
> fix the memory leak for occasional case.
> {code} 
> private String getJobSummary(FileContext fc, Path path) throws IOException {
> Path qPath = fc.makeQualified(path);
> FSDataInputStream in = fc.open(qPath);
> String jobSummaryString = in.readUTF();
> in.close();
> return jobSummaryString;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6528) Memory leak for HistoryFileManager.getJobSummary()

2015-10-29 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6528:
--
Status: Patch Available  (was: Open)

> Memory leak for HistoryFileManager.getJobSummary()
> --
>
> Key: MAPREDUCE-6528
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6528
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-6528.patch
>
>
> We meet memory leak issues for JHS in a large cluster which is caused by code 
> below doesn't release FSDataInputStream in exception case. MAPREDUCE-6273 
> should fix most cases that exceptions get thrown. However, we still need to 
> fix the memory leak for occasional case.
> {code} 
> private String getJobSummary(FileContext fc, Path path) throws IOException {
> Path qPath = fc.makeQualified(path);
> FSDataInputStream in = fc.open(qPath);
> String jobSummaryString = in.readUTF();
> in.close();
> return jobSummaryString;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6528) Memory leak for HistoryFileManager.getJobSummary()

2015-10-29 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6528:
--
Attachment: MAPREDUCE-6528.patch

Attach a simple patch to fix it. The fix is quite straightforward, so no need 
for unit test. 

> Memory leak for HistoryFileManager.getJobSummary()
> --
>
> Key: MAPREDUCE-6528
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6528
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-6528.patch
>
>
> We meet memory leak issues for JHS in a large cluster which is caused by code 
> below doesn't release FSDataInputStream in exception case. MAPREDUCE-6273 
> should fix most cases that exceptions get thrown. However, we still need to 
> fix the memory leak for occasional case.
> {code} 
> private String getJobSummary(FileContext fc, Path path) throws IOException {
> Path qPath = fc.makeQualified(path);
> FSDataInputStream in = fc.open(qPath);
> String jobSummaryString = in.readUTF();
> in.close();
> return jobSummaryString;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAPREDUCE-6528) Memory leak for HistoryFileManager.getJobSummary()

2015-10-29 Thread Junping Du (JIRA)
Junping Du created MAPREDUCE-6528:
-

 Summary: Memory leak for HistoryFileManager.getJobSummary()
 Key: MAPREDUCE-6528
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6528
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobhistoryserver
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical


We meet memory leak issues for JHS in a large cluster which is caused by code 
below doesn't release FSDataInputStream in exception case. MAPREDUCE-6273 
should fix most cases that exceptions get thrown. However, we still need to fix 
the memory leak for occasional case.

{code} 
private String getJobSummary(FileContext fc, Path path) throws IOException {
Path qPath = fc.makeQualified(path);
FSDataInputStream in = fc.open(qPath);
String jobSummaryString = in.readUTF();
in.close();
return jobSummaryString;
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-10-26 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-5485:
--
Status: Patch Available  (was: Open)

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
> Attachments: MAPREDUCE-5485-demo.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAPREDUCE-6201) TestNetworkedJob fails on trunk

2015-10-23 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved MAPREDUCE-6201.
---
Resolution: Cannot Reproduce

> TestNetworkedJob fails on trunk
> ---
>
> Key: MAPREDUCE-6201
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6201
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Robert Kanter
>Assignee: Brahma Reddy Battula
>
> Currently, {{TestNetworkedJob}} is failing on trunk:
> {noformat}
> Running org.apache.hadoop.mapred.TestNetworkedJob
> Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 215.01 sec 
> <<< FAILURE! - in org.apache.hadoop.mapred.TestNetworkedJob
> testNetworkedJob(org.apache.hadoop.mapred.TestNetworkedJob)  Time elapsed: 
> 67.363 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.mapred.TestNetworkedJob.testNetworkedJob(TestNetworkedJob.java:195)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6201) TestNetworkedJob fails on trunk

2015-10-23 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971415#comment-14971415
 ] 

Junping Du commented on MAPREDUCE-6201:
---

Oh. My bad... I will roll back as Cannot Reproduce. Thanks for pointing that 
out, [~brahmareddy]!

> TestNetworkedJob fails on trunk
> ---
>
> Key: MAPREDUCE-6201
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6201
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Robert Kanter
>Assignee: Brahma Reddy Battula
>
> Currently, {{TestNetworkedJob}} is failing on trunk:
> {noformat}
> Running org.apache.hadoop.mapred.TestNetworkedJob
> Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 215.01 sec 
> <<< FAILURE! - in org.apache.hadoop.mapred.TestNetworkedJob
> testNetworkedJob(org.apache.hadoop.mapred.TestNetworkedJob)  Time elapsed: 
> 67.363 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.mapred.TestNetworkedJob.testNetworkedJob(TestNetworkedJob.java:195)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (MAPREDUCE-6201) TestNetworkedJob fails on trunk

2015-10-23 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du reopened MAPREDUCE-6201:
---

> TestNetworkedJob fails on trunk
> ---
>
> Key: MAPREDUCE-6201
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6201
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Robert Kanter
>Assignee: Brahma Reddy Battula
>
> Currently, {{TestNetworkedJob}} is failing on trunk:
> {noformat}
> Running org.apache.hadoop.mapred.TestNetworkedJob
> Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 215.01 sec 
> <<< FAILURE! - in org.apache.hadoop.mapred.TestNetworkedJob
> testNetworkedJob(org.apache.hadoop.mapred.TestNetworkedJob)  Time elapsed: 
> 67.363 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.mapred.TestNetworkedJob.testNetworkedJob(TestNetworkedJob.java:195)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6508) TestNetworkedJob fails consistently due to delegation token changes on RM.

2015-10-23 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6508:
--
   Resolution: Fixed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

I have commit latest patch to trunk and branch-2. Thanks [~ajisakaa] for 
delivering the patch!

> TestNetworkedJob fails consistently due to delegation token changes on RM.
> --
>
> Key: MAPREDUCE-6508
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6508
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith Sharma K S
>Assignee: Akira AJISAKA
> Fix For: 2.8.0
>
> Attachments: MAPREDUCE-6508.00.patch, MAPREDUCE-6508.01.patch
>
>
> {noformat}
> Running org.apache.hadoop.mapred.TestNetworkedJob
> Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 84.215 sec 
> <<< FAILURE! - in org.apache.hadoop.mapred.TestNetworkedJob
> testNetworkedJob(org.apache.hadoop.mapred.TestNetworkedJob)  Time elapsed: 
> 31.537 sec  <<< ERROR!
> java.io.IOException: org.apache.hadoop.yarn.exceptions.YarnException: 
> java.io.IOException: Delegation Token can be issued only with kerberos 
> authentication
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getDelegationToken(ClientRMService.java:1044)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getDelegationToken(ApplicationClientProtocolPBServiceImpl.java:325)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:483)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2236)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2232)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2230)
> Caused by: java.io.IOException: Delegation Token can be issued only with 
> kerberos authentication
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getDelegationToken(ClientRMService.java:1017)
>   ... 10 more
>   at org.apache.hadoop.ipc.Client.call(Client.java:1448)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1379)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>   at com.sun.proxy.$Proxy84.getDelegationToken(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getDelegationToken(ApplicationClientProtocolPBClientImpl.java:339)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:251)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>   at com.sun.proxy.$Proxy85.getDelegationToken(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getRMDelegationToken(YarnClientImpl.java:541)
>   at 
> org.apache.hadoop.mapred.ResourceMgrDelegate.getDelegationToken(ResourceMgrDelegate.java:177)
>   at 
> org.apache.hadoop.mapred.YARNRunner.getDelegationToken(YARNRunner.java:231)
>   at 
> org.apache.hadoop.mapreduce.Cluster.getDelegationToken(Cluster.java:401)
>   at org.apache.hadoop.mapred.JobClient$16.run(JobClient.java:1234)
>   at org.apache.hadoop.mapred.JobClient$16.run(JobClient.java:1231)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
>   at 
> org.apache.hadoop.mapred.JobClient.getDelegationToken(JobClient.java:1230)
>   at 
> org.apache.hadoop.mapred.TestNetworkedJob.testNetworkedJob(TestNetworkedJob.java:260)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6508) TestNetworkedJob fails consistently due to delegation token changes on RM.

2015-10-23 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6508:
--
Summary: TestNetworkedJob fails consistently due to delegation token 
changes on RM.  (was: TestNetworkedJob fails intermittently)

> TestNetworkedJob fails consistently due to delegation token changes on RM.
> --
>
> Key: MAPREDUCE-6508
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6508
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith Sharma K S
>Assignee: Akira AJISAKA
> Attachments: MAPREDUCE-6508.00.patch, MAPREDUCE-6508.01.patch
>
>
> {noformat}
> Running org.apache.hadoop.mapred.TestNetworkedJob
> Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 84.215 sec 
> <<< FAILURE! - in org.apache.hadoop.mapred.TestNetworkedJob
> testNetworkedJob(org.apache.hadoop.mapred.TestNetworkedJob)  Time elapsed: 
> 31.537 sec  <<< ERROR!
> java.io.IOException: org.apache.hadoop.yarn.exceptions.YarnException: 
> java.io.IOException: Delegation Token can be issued only with kerberos 
> authentication
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getDelegationToken(ClientRMService.java:1044)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getDelegationToken(ApplicationClientProtocolPBServiceImpl.java:325)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:483)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2236)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2232)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2230)
> Caused by: java.io.IOException: Delegation Token can be issued only with 
> kerberos authentication
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getDelegationToken(ClientRMService.java:1017)
>   ... 10 more
>   at org.apache.hadoop.ipc.Client.call(Client.java:1448)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1379)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>   at com.sun.proxy.$Proxy84.getDelegationToken(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getDelegationToken(ApplicationClientProtocolPBClientImpl.java:339)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:251)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>   at com.sun.proxy.$Proxy85.getDelegationToken(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getRMDelegationToken(YarnClientImpl.java:541)
>   at 
> org.apache.hadoop.mapred.ResourceMgrDelegate.getDelegationToken(ResourceMgrDelegate.java:177)
>   at 
> org.apache.hadoop.mapred.YARNRunner.getDelegationToken(YARNRunner.java:231)
>   at 
> org.apache.hadoop.mapreduce.Cluster.getDelegationToken(Cluster.java:401)
>   at org.apache.hadoop.mapred.JobClient$16.run(JobClient.java:1234)
>   at org.apache.hadoop.mapred.JobClient$16.run(JobClient.java:1231)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
>   at 
> org.apache.hadoop.mapred.JobClient.getDelegationToken(JobClient.java:1230)
>   at 
> org.apache.hadoop.mapred.TestNetworkedJob.testNetworkedJob(TestNetworkedJob.java:260)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAPREDUCE-6201) TestNetworkedJob fails on trunk

2015-10-23 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved MAPREDUCE-6201.
---
Resolution: Duplicate

> TestNetworkedJob fails on trunk
> ---
>
> Key: MAPREDUCE-6201
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6201
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Robert Kanter
>Assignee: Brahma Reddy Battula
>
> Currently, {{TestNetworkedJob}} is failing on trunk:
> {noformat}
> Running org.apache.hadoop.mapred.TestNetworkedJob
> Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 215.01 sec 
> <<< FAILURE! - in org.apache.hadoop.mapred.TestNetworkedJob
> testNetworkedJob(org.apache.hadoop.mapred.TestNetworkedJob)  Time elapsed: 
> 67.363 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.mapred.TestNetworkedJob.testNetworkedJob(TestNetworkedJob.java:195)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (MAPREDUCE-6201) TestNetworkedJob fails on trunk

2015-10-23 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du reopened MAPREDUCE-6201:
---

I believe TestNetworkedJob still get consistently failed on trunk, so we 
shouldn't close this JIRA as "Cannot Reproduce". We should resolve this as 
duplicate of MAPREDUCE-6508 which already have a patch to go soon.

> TestNetworkedJob fails on trunk
> ---
>
> Key: MAPREDUCE-6201
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6201
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Robert Kanter
>Assignee: Brahma Reddy Battula
>
> Currently, {{TestNetworkedJob}} is failing on trunk:
> {noformat}
> Running org.apache.hadoop.mapred.TestNetworkedJob
> Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 215.01 sec 
> <<< FAILURE! - in org.apache.hadoop.mapred.TestNetworkedJob
> testNetworkedJob(org.apache.hadoop.mapred.TestNetworkedJob)  Time elapsed: 
> 67.363 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.mapred.TestNetworkedJob.testNetworkedJob(TestNetworkedJob.java:195)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6508) TestNetworkedJob fails intermittently

2015-10-23 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971204#comment-14971204
 ] 

Junping Du commented on MAPREDUCE-6508:
---

Thanks for your reply, [~ajisakaa]! I agree it is right to remove that code in 
unit test which make test failed consistently. The JIRA title "intermittently" 
is a little misleading, and I will correct it soon.
Latest patch (01) LGTM. +1. I will commit it shortly.


> TestNetworkedJob fails intermittently
> -
>
> Key: MAPREDUCE-6508
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6508
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith Sharma K S
>Assignee: Akira AJISAKA
> Attachments: MAPREDUCE-6508.00.patch, MAPREDUCE-6508.01.patch
>
>
> {noformat}
> Running org.apache.hadoop.mapred.TestNetworkedJob
> Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 84.215 sec 
> <<< FAILURE! - in org.apache.hadoop.mapred.TestNetworkedJob
> testNetworkedJob(org.apache.hadoop.mapred.TestNetworkedJob)  Time elapsed: 
> 31.537 sec  <<< ERROR!
> java.io.IOException: org.apache.hadoop.yarn.exceptions.YarnException: 
> java.io.IOException: Delegation Token can be issued only with kerberos 
> authentication
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getDelegationToken(ClientRMService.java:1044)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getDelegationToken(ApplicationClientProtocolPBServiceImpl.java:325)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:483)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2236)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2232)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2230)
> Caused by: java.io.IOException: Delegation Token can be issued only with 
> kerberos authentication
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getDelegationToken(ClientRMService.java:1017)
>   ... 10 more
>   at org.apache.hadoop.ipc.Client.call(Client.java:1448)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1379)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>   at com.sun.proxy.$Proxy84.getDelegationToken(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getDelegationToken(ApplicationClientProtocolPBClientImpl.java:339)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:251)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>   at com.sun.proxy.$Proxy85.getDelegationToken(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getRMDelegationToken(YarnClientImpl.java:541)
>   at 
> org.apache.hadoop.mapred.ResourceMgrDelegate.getDelegationToken(ResourceMgrDelegate.java:177)
>   at 
> org.apache.hadoop.mapred.YARNRunner.getDelegationToken(YARNRunner.java:231)
>   at 
> org.apache.hadoop.mapreduce.Cluster.getDelegationToken(Cluster.java:401)
>   at org.apache.hadoop.mapred.JobClient$16.run(JobClient.java:1234)
>   at org.apache.hadoop.mapred.JobClient$16.run(JobClient.java:1231)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
>   at 
> org.apache.hadoop.mapred.JobClient.getDelegationToken(JobClient.java:1230)
>   at 
> org.apache.hadoop.mapred.TestNetworkedJob.testNetworkedJob(TestNetworkedJob.java:260)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6449) MR Code should not throw and catch YarnRuntimeException to communicate internal exceptions

2015-10-23 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971033#comment-14971033
 ] 

Junping Du commented on MAPREDUCE-6449:
---

Is this patch breaking our MR rolling upgrade story that old MR job and new MR 
job can coexist in a single cluster (during upgrade)? At least, changes on hs 
part sounds like this. If so, I would be very concern on this.

> MR Code should not throw and catch YarnRuntimeException to communicate 
> internal exceptions
> --
>
> Key: MAPREDUCE-6449
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6449
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Neelesh Srinivas Salian
>  Labels: mapreduce
> Attachments: MAPREDUCE-6449.001.patch, MAPREDUCE-6449.002.patch, 
> MAPREDUCE-6499-prelim.patch
>
>
> In discussion of MAPREDUCE-6439 we discussed how throwing and catching 
> YarnRuntimeException in MR code is incorrect and we should instead use some 
> MR specific exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6201) TestNetworkedJob fails on trunk

2015-10-23 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970991#comment-14970991
 ] 

Junping Du commented on MAPREDUCE-6201:
---

MAPREDUCE-6508 is still open for tracking failures of TestNetworkedJob. I think 
it is still get failed intermittently. Isn't it?

> TestNetworkedJob fails on trunk
> ---
>
> Key: MAPREDUCE-6201
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6201
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Robert Kanter
>Assignee: Brahma Reddy Battula
>
> Currently, {{TestNetworkedJob}} is failing on trunk:
> {noformat}
> Running org.apache.hadoop.mapred.TestNetworkedJob
> Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 215.01 sec 
> <<< FAILURE! - in org.apache.hadoop.mapred.TestNetworkedJob
> testNetworkedJob(org.apache.hadoop.mapred.TestNetworkedJob)  Time elapsed: 
> 67.363 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.mapred.TestNetworkedJob.testNetworkedJob(TestNetworkedJob.java:195)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6508) TestNetworkedJob fails intermittently

2015-10-23 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970973#comment-14970973
 ] 

Junping Du commented on MAPREDUCE-6508:
---

Hi [~ajisakaa], Thanks for the patch. However, do we figure out the reason why 
job get failed intermittently? May be we should try to fix/understand the test 
problem rather than remove the test code?

> TestNetworkedJob fails intermittently
> -
>
> Key: MAPREDUCE-6508
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6508
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith Sharma K S
>Assignee: Akira AJISAKA
> Attachments: MAPREDUCE-6508.00.patch, MAPREDUCE-6508.01.patch
>
>
> {noformat}
> Running org.apache.hadoop.mapred.TestNetworkedJob
> Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 84.215 sec 
> <<< FAILURE! - in org.apache.hadoop.mapred.TestNetworkedJob
> testNetworkedJob(org.apache.hadoop.mapred.TestNetworkedJob)  Time elapsed: 
> 31.537 sec  <<< ERROR!
> java.io.IOException: org.apache.hadoop.yarn.exceptions.YarnException: 
> java.io.IOException: Delegation Token can be issued only with kerberos 
> authentication
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getDelegationToken(ClientRMService.java:1044)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getDelegationToken(ApplicationClientProtocolPBServiceImpl.java:325)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:483)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2236)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2232)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2230)
> Caused by: java.io.IOException: Delegation Token can be issued only with 
> kerberos authentication
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getDelegationToken(ClientRMService.java:1017)
>   ... 10 more
>   at org.apache.hadoop.ipc.Client.call(Client.java:1448)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1379)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>   at com.sun.proxy.$Proxy84.getDelegationToken(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getDelegationToken(ApplicationClientProtocolPBClientImpl.java:339)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:251)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>   at com.sun.proxy.$Proxy85.getDelegationToken(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getRMDelegationToken(YarnClientImpl.java:541)
>   at 
> org.apache.hadoop.mapred.ResourceMgrDelegate.getDelegationToken(ResourceMgrDelegate.java:177)
>   at 
> org.apache.hadoop.mapred.YARNRunner.getDelegationToken(YARNRunner.java:231)
>   at 
> org.apache.hadoop.mapreduce.Cluster.getDelegationToken(Cluster.java:401)
>   at org.apache.hadoop.mapred.JobClient$16.run(JobClient.java:1234)
>   at org.apache.hadoop.mapred.JobClient$16.run(JobClient.java:1231)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
>   at 
> org.apache.hadoop.mapred.JobClient.getDelegationToken(JobClient.java:1230)
>   at 
> org.apache.hadoop.mapred.TestNetworkedJob.testNetworkedJob(TestNetworkedJob.java:260)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-10-22 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-5485:
--
Attachment: MAPREDUCE-5485-demo.patch

Upload a demo patch first. More unit test will be added later. 
BTW, it adopt some code in MAPREDUCE-5718 with similar purpose, so please share 
the credit to the contributors of MAPREDUCE-5718 if we want to commit the 
following patches of this JIRA in future.

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>Assignee: Junping Du
> Attachments: MAPREDUCE-5485-demo.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5485) Allow repeating job commit by extending OutputCommitter API

2015-10-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955012#comment-14955012
 ] 

Junping Du commented on MAPREDUCE-5485:
---

The proposal above sounds good to me. [~nemon], thanks for filing this JIRA 
which is quite useful in some scenarios. If you don't mind, I would like to 
work on it and move it forward. Thanks!

> Allow repeating job commit by extending OutputCommitter API
> ---
>
> Key: MAPREDUCE-5485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.1.0-beta
>Reporter: Nemon Lou
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6478) Add an option to skip cleanupJob stage or ignore cleanup failure during commitJob().

2015-09-18 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876054#comment-14876054
 ] 

Junping Du commented on MAPREDUCE-6478:
---

Thanks [~leftnoteasy] for review and commit!

> Add an option to skip cleanupJob stage or ignore cleanup failure during 
> commitJob().
> 
>
> Key: MAPREDUCE-6478
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6478
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Junping Du
>Assignee: Junping Du
> Fix For: 2.8.0
>
> Attachments: MAPREDUCE-6478-v1.1.patch, MAPREDUCE-6478-v1.patch
>
>
> In some of our test cases for MR on public cloud scenario, a very big MR job 
> with hundreds or thousands of reducers cannot finish successfully because of 
> Job Cleanup failures which is caused by different scale/performance impact 
> for File System on the cloud (like AzureFS) which replacing HDFS's deletion 
> for whole directory with REST API calls on deleting each sub-directories 
> recursively. Even it get successfully, that could take much longer time 
> (hours) which is not necessary and waste time/resources especially in public 
> cloud scenario. 
> In these scenarios, some failures of cleanupJob can be ignored or user choose 
> to skip cleanupJob() completely make more sense. This is because making whole 
> job finish successfully with side effect of wasting some user spaces is much 
> better as user's jobs are usually comes and goes in public cloud, so have 
> choices to tolerant some temporary files exists with get rid of big job 
> re-run (or saving job's running time) is quite effective in time/resource 
> cost. 
> We should allow user to have this option (ignore failure or skip job cleanup 
> stage completely) especially when user know the cleanup failure is not due to 
> HDFS abnormal status but other FS' different performance trade-off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6478) Add an option to skip cleanupJob stage or ignore cleanup failure during commitJob().

2015-09-16 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6478:
--
Attachment: MAPREDUCE-6478-v1.1.patch

Fix whitespace issue in v1.1. patch.

> Add an option to skip cleanupJob stage or ignore cleanup failure during 
> commitJob().
> 
>
> Key: MAPREDUCE-6478
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6478
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Junping Du
>Assignee: Junping Du
> Attachments: MAPREDUCE-6478-v1.1.patch, MAPREDUCE-6478-v1.patch
>
>
> In some of our test cases for MR on public cloud scenario, a very big MR job 
> with hundreds or thousands of reducers cannot finish successfully because of 
> Job Cleanup failures which is caused by different scale/performance impact 
> for File System on the cloud (like AzureFS) which replacing HDFS's deletion 
> for whole directory with REST API calls on deleting each sub-directories 
> recursively. Even it get successfully, that could take much longer time 
> (hours) which is not necessary and waste time/resources especially in public 
> cloud scenario. 
> In these scenarios, some failures of cleanupJob can be ignored or user choose 
> to skip cleanupJob() completely make more sense. This is because making whole 
> job finish successfully with side effect of wasting some user spaces is much 
> better as user's jobs are usually comes and goes in public cloud, so have 
> choices to tolerant some temporary files exists with get rid of big job 
> re-run (or saving job's running time) is quite effective in time/resource 
> cost. 
> We should allow user to have this option (ignore failure or skip job cleanup 
> stage completely) especially when user know the cleanup failure is not due to 
> HDFS abnormal status but other FS' different performance trade-off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6478) Add an option to skip cleanupJob stage or ignore cleanup failure during commitJob().

2015-09-15 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6478:
--
Status: Patch Available  (was: Open)

> Add an option to skip cleanupJob stage or ignore cleanup failure during 
> commitJob().
> 
>
> Key: MAPREDUCE-6478
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6478
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Junping Du
>Assignee: Junping Du
> Attachments: MAPREDUCE-6478-v1.patch
>
>
> In some of our test cases for MR on public cloud scenario, a very big MR job 
> with hundreds or thousands of reducers cannot finish successfully because of 
> Job Cleanup failures which is caused by different scale/performance impact 
> for File System on the cloud (like AzureFS) which replacing HDFS's deletion 
> for whole directory with REST API calls on deleting each sub-directories 
> recursively. Even it get successfully, that could take much longer time 
> (hours) which is not necessary and waste time/resources especially in public 
> cloud scenario. 
> In these scenarios, some failures of cleanupJob can be ignored or user choose 
> to skip cleanupJob() completely make more sense. This is because making whole 
> job finish successfully with side effect of wasting some user spaces is much 
> better as user's jobs are usually comes and goes in public cloud, so have 
> choices to tolerant some temporary files exists with get rid of big job 
> re-run (or saving job's running time) is quite effective in time/resource 
> cost. 
> We should allow user to have this option (ignore failure or skip job cleanup 
> stage completely) especially when user know the cleanup failure is not due to 
> HDFS abnormal status but other FS' different performance trade-off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6478) Add an option to skip cleanupJob stage or ignore cleanup failure during commitJob().

2015-09-15 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6478:
--
Attachment: MAPREDUCE-6478-v1.patch

Put a quick patch to add two configurations to allow skip cleanupJob or ignore 
cleanupJob failures. This is quite straightforward, so unit test is unnecessary 
here.

> Add an option to skip cleanupJob stage or ignore cleanup failure during 
> commitJob().
> 
>
> Key: MAPREDUCE-6478
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6478
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Junping Du
>Assignee: Junping Du
> Attachments: MAPREDUCE-6478-v1.patch
>
>
> In some of our test cases for MR on public cloud scenario, a very big MR job 
> with hundreds or thousands of reducers cannot finish successfully because of 
> Job Cleanup failures which is caused by different scale/performance impact 
> for File System on the cloud (like AzureFS) which replacing HDFS's deletion 
> for whole directory with REST API calls on deleting each sub-directories 
> recursively. Even it get successfully, that could take much longer time 
> (hours) which is not necessary and waste time/resources especially in public 
> cloud scenario. 
> In these scenarios, some failures of cleanupJob can be ignored or user choose 
> to skip cleanupJob() completely make more sense. This is because making whole 
> job finish successfully with side effect of wasting some user spaces is much 
> better as user's jobs are usually comes and goes in public cloud, so have 
> choices to tolerant some temporary files exists with get rid of big job 
> re-run (or saving job's running time) is quite effective in time/resource 
> cost. 
> We should allow user to have this option (ignore failure or skip job cleanup 
> stage completely) especially when user know the cleanup failure is not due to 
> HDFS abnormal status but other FS' different performance trade-off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6478) Add an option to skip cleanupJob stage or ignore cleanup failure during commitJob().

2015-09-15 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6478:
--
Description: 
In some of our test cases for MR on public cloud scenario, a very big MR job 
with hundreds or thousands of reducers cannot finish successfully because of 
Job Cleanup failures which is caused by different scale/performance impact for 
File System on the cloud (like AzureFS) which replacing HDFS's deletion for 
whole directory with REST API calls on deleting each sub-directories 
recursively. Even it get successfully, that could take much longer time (hours) 
which is not necessary and waste time/resources especially in public cloud 
scenario. 
In these scenarios, some failures of cleanupJob can be ignored or user choose 
to skip cleanupJob() completely make more sense. This is because making whole 
job finish successfully with side effect of wasting some user spaces is much 
better as user's jobs are usually comes and goes in public cloud, so have 
choices to tolerant some temporary files exists with get rid of big job re-run 
(or saving job's running time) is quite effective in time/resource cost. 
We should allow user to have this option (ignore failure or skip job cleanup 
stage completely) especially when user know the cleanup failure is not due to 
HDFS abnormal status but other FS' different performance trade-off.

  was:
In some our test cases for MR on public cloud scenario, a very big MR job with 
hundreds or thousands of reducers cannot finish successfully because of Job 
Cleanup failures which is caused by different scale/performance impact for File 
System on the cloud (like AzureFS) which replacing HDFS's deletion for whole 
directory with REST API calls on deleting each sub-directories recursively. 
That could take much longer time (hours) which is not necessary in public cloud 
scenario. 
Also, it also more easily to get failed to cleanup in these cases, so some 
failures of cleanupJob can be ignored in this case. Making whole job finish 
successfully with side effect of wasting some user spaces make more sense in 
these cases as user's job is usually comes and goes in public cloud, so have a 
trade off to tolerant some temporary files exists with get rid of big job 
re-run is quite cost effective. 
We should allow user to have this option (ignore failure or skip job cleanup 
stage completely) especially when user know the cleanup failure is not due to 
HDFS abnormal status but other FS' different performance trade-off.


> Add an option to skip cleanupJob stage or ignore cleanup failure during 
> commitJob().
> 
>
> Key: MAPREDUCE-6478
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6478
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Junping Du
>Assignee: Junping Du
>
> In some of our test cases for MR on public cloud scenario, a very big MR job 
> with hundreds or thousands of reducers cannot finish successfully because of 
> Job Cleanup failures which is caused by different scale/performance impact 
> for File System on the cloud (like AzureFS) which replacing HDFS's deletion 
> for whole directory with REST API calls on deleting each sub-directories 
> recursively. Even it get successfully, that could take much longer time 
> (hours) which is not necessary and waste time/resources especially in public 
> cloud scenario. 
> In these scenarios, some failures of cleanupJob can be ignored or user choose 
> to skip cleanupJob() completely make more sense. This is because making whole 
> job finish successfully with side effect of wasting some user spaces is much 
> better as user's jobs are usually comes and goes in public cloud, so have 
> choices to tolerant some temporary files exists with get rid of big job 
> re-run (or saving job's running time) is quite effective in time/resource 
> cost. 
> We should allow user to have this option (ignore failure or skip job cleanup 
> stage completely) especially when user know the cleanup failure is not due to 
> HDFS abnormal status but other FS' different performance trade-off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAPREDUCE-6478) Add an option to skip cleanupJob stage or ignore cleanup failure during commitJob().

2015-09-15 Thread Junping Du (JIRA)
Junping Du created MAPREDUCE-6478:
-

 Summary: Add an option to skip cleanupJob stage or ignore cleanup 
failure during commitJob().
 Key: MAPREDUCE-6478
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6478
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Junping Du
Assignee: Junping Du


In some our test cases for MR on public cloud scenario, a very big MR job with 
hundreds or thousands of reducers cannot finish successfully because of Job 
Cleanup failures which is caused by different scale/performance impact for File 
System on the cloud (like AzureFS) which replacing HDFS's deletion for whole 
directory with REST API calls on deleting each sub-directories recursively. 
That could take much longer time (hours) which is not necessary in public cloud 
scenario. 
Also, it also more easily to get failed to cleanup in these cases, so some 
failures of cleanupJob can be ignored in this case. Making whole job finish 
successfully with side effect of wasting some user spaces make more sense in 
these cases as user's job is usually comes and goes in public cloud, so have a 
trade off to tolerant some temporary files exists with get rid of big job 
re-run is quite cost effective. 
We should allow user to have this option (ignore failure or skip job cleanup 
stage completely) especially when user know the cleanup failure is not due to 
HDFS abnormal status but other FS' different performance trade-off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6458) Figure out the way to pass build-in classpath (files in distributed cache, etc.) from parent to spawned shells

2015-08-21 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706488#comment-14706488
 ] 

Junping Du commented on MAPREDUCE-6458:
---

bq. Re-assigning this to me and updating the description to reflect reality, 
since I actually understand how bash works.
Please feel free to take it if you have bandwidth to work on it immediately.

> Figure out the way to pass build-in classpath (files in distributed cache, 
> etc.) from parent to spawned shells
> --
>
> Key: MAPREDUCE-6458
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6458
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Allen Wittenauer
>
> In MAPREDUCE-6454 (target for branch-2.x), we provide a way with constraints 
> to pass built-in classpath from parent to child shell, via HADOOP_CLASSPATH, 
> so jars in distributed cache can still work in child tasks. In trunk, we may 
> think some way different, like: involve additional env var to safely pass 
> build-in classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6458) Figure out the way to pass build-in classpath (files in distributed cache, etc.) from parent to spawned shells

2015-08-21 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6458:
--
Description: In MAPREDUCE-6454 (target for branch-2.x), we provide a way 
with constraints to pass built-in classpath from parent to child shell, via 
HADOOP_CLASSPATH, so jars in distributed cache can still work in child tasks. 
In trunk, we may think some way different, like: involve additional env var to 
safely pass build-in classpath.  (was: In MAPREDUCE-6454 (target for 
branch-2.x), we provide an extremely fragile way to pass built-in classpath 
from parent to child shell, via HADOOP_CLASSPATH, so jars in distributed cache 
can still work in child tasks. In trunk, we may think some way different, like: 
involve additional env var to safely pass build-in classpath.)

> Figure out the way to pass build-in classpath (files in distributed cache, 
> etc.) from parent to spawned shells
> --
>
> Key: MAPREDUCE-6458
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6458
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Allen Wittenauer
>
> In MAPREDUCE-6454 (target for branch-2.x), we provide a way with constraints 
> to pass built-in classpath from parent to child shell, via HADOOP_CLASSPATH, 
> so jars in distributed cache can still work in child tasks. In trunk, we may 
> think some way different, like: involve additional env var to safely pass 
> build-in classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6454) MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.

2015-08-20 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706044#comment-14706044
 ] 

Junping Du commented on MAPREDUCE-6454:
---

Thanks [~vinodkv] for review/commit and [~aw] for comments.
bq. For trunk at least, it would probably be better to have a different var 
that is handled via mapreduce's shellprofile.d bit.
This also sounds like a good way. Agree that we can discuss more later (on 
another JIRA) for trunk.

bq. We will have to think more about the right-approach for trunk. Will open a 
separate ticket for this.
+1. Just filed MAPREDUCE-6458 to address this.

> MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.
> 
>
> Key: MAPREDUCE-6454
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6454
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Fix For: 2.7.2, 2.6.2
>
> Attachments: MAPREDUCE-6454-v2.1.patch, MAPREDUCE-6454-v2.patch, 
> MAPREDUCE-6454-v3.1.patch, MAPREDUCE-6454-v3.patch, MAPREDUCE-6454.patch
>
>
> We already set lib jars on distributed-cache to CLASSPATH. However, in some 
> corner cases (like: MR local mode, Hive Map side local join, etc.), we need 
> these jars on HADOOP_CLASSPATH so hadoop scripts can take it in launching 
> runjar process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAPREDUCE-6458) Figure out the way to pass build-in classpath (files in distributed cache, etc.) from parent to spawned shells

2015-08-20 Thread Junping Du (JIRA)
Junping Du created MAPREDUCE-6458:
-

 Summary: Figure out the way to pass build-in classpath (files in 
distributed cache, etc.) from parent to spawned shells
 Key: MAPREDUCE-6458
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6458
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Junping Du
Assignee: Junping Du


In MAPREDUCE-6454 (target for branch-2.x), we provide a way to pass built-in 
classpath from parent to child shell, via HADOOP_CLASSPATH, so jars in 
distributed cache can still work in child tasks. In trunk, we may think some 
way different, like: involve additional env var to complicatedly pass build-in 
classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6454) MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.

2015-08-20 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6454:
--
Attachment: MAPREDUCE-6454-v3.1.patch

Fix minor issues like: javadoc warnings, checkstyle, etc.

> MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.
> 
>
> Key: MAPREDUCE-6454
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6454
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-6454-v2.1.patch, MAPREDUCE-6454-v2.patch, 
> MAPREDUCE-6454-v3.1.patch, MAPREDUCE-6454-v3.patch, MAPREDUCE-6454.patch
>
>
> We already set lib jars on distributed-cache to CLASSPATH. However, in some 
> corner cases (like: MR local mode, Hive Map side local join, etc.), we need 
> these jars on HADOOP_CLASSPATH so hadoop scripts can take it in launching 
> runjar process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6454) MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.

2015-08-20 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6454:
--
Status: Patch Available  (was: Open)

> MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.
> 
>
> Key: MAPREDUCE-6454
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6454
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-6454-v2.1.patch, MAPREDUCE-6454-v2.patch, 
> MAPREDUCE-6454-v3.patch, MAPREDUCE-6454.patch
>
>
> We already set lib jars on distributed-cache to CLASSPATH. However, in some 
> corner cases (like: MR local mode, Hive Map side local join, etc.), we need 
> these jars on HADOOP_CLASSPATH so hadoop scripts can take it in launching 
> runjar process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6454) MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.

2015-08-20 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6454:
--
Attachment: MAPREDUCE-6454-v3.patch

Previous fixes are not right to resolve the issues we met. v3 patch is verified 
to be working.

> MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.
> 
>
> Key: MAPREDUCE-6454
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6454
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-6454-v2.1.patch, MAPREDUCE-6454-v2.patch, 
> MAPREDUCE-6454-v3.patch, MAPREDUCE-6454.patch
>
>
> We already set lib jars on distributed-cache to CLASSPATH. However, in some 
> corner cases (like: MR local mode, Hive Map side local join, etc.), we need 
> these jars on HADOOP_CLASSPATH so hadoop scripts can take it in launching 
> runjar process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6454) MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.

2015-08-19 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6454:
--
Status: Open  (was: Patch Available)

> MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.
> 
>
> Key: MAPREDUCE-6454
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6454
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-6454-v2.1.patch, MAPREDUCE-6454-v2.patch, 
> MAPREDUCE-6454.patch
>
>
> We already set lib jars on distributed-cache to CLASSPATH. However, in some 
> corner cases (like: MR local mode, Hive Map side local join, etc.), we need 
> these jars on HADOOP_CLASSPATH so hadoop scripts can take it in launching 
> runjar process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6454) MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.

2015-08-19 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6454:
--
Attachment: MAPREDUCE-6454-v2.1.patch

Fix some whitespace issue in v2.1 patch.

> MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.
> 
>
> Key: MAPREDUCE-6454
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6454
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-6454-v2.1.patch, MAPREDUCE-6454-v2.patch, 
> MAPREDUCE-6454.patch
>
>
> We already set lib jars on distributed-cache to CLASSPATH. However, in some 
> corner cases (like: MR local mode, Hive Map side local join, etc.), we need 
> these jars on HADOOP_CLASSPATH so hadoop scripts can take it in launching 
> runjar process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6454) MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.

2015-08-19 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6454:
--
Status: Patch Available  (was: Open)

> MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.
> 
>
> Key: MAPREDUCE-6454
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6454
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-6454-v2.1.patch, MAPREDUCE-6454-v2.patch, 
> MAPREDUCE-6454.patch
>
>
> We already set lib jars on distributed-cache to CLASSPATH. However, in some 
> corner cases (like: MR local mode, Hive Map side local join, etc.), we need 
> these jars on HADOOP_CLASSPATH so hadoop scripts can take it in launching 
> runjar process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6454) MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.

2015-08-19 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6454:
--
Status: Open  (was: Patch Available)

> MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.
> 
>
> Key: MAPREDUCE-6454
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6454
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-6454-v2.patch, MAPREDUCE-6454.patch
>
>
> We already set lib jars on distributed-cache to CLASSPATH. However, in some 
> corner cases (like: MR local mode, Hive Map side local join, etc.), we need 
> these jars on HADOOP_CLASSPATH so hadoop scripts can take it in launching 
> runjar process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6454) MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.

2015-08-19 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6454:
--
Attachment: MAPREDUCE-6454-v2.patch

The test failure is because  YarnConfiguration.YARN_APPLICATION_CLASSPATH is 
always empty string without any settings. In this case, we should use 
YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH as default value. Fix this 
in v2 patch.

> MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.
> 
>
> Key: MAPREDUCE-6454
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6454
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-6454-v2.patch, MAPREDUCE-6454.patch
>
>
> We already set lib jars on distributed-cache to CLASSPATH. However, in some 
> corner cases (like: MR local mode, Hive Map side local join, etc.), we need 
> these jars on HADOOP_CLASSPATH so hadoop scripts can take it in launching 
> runjar process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6454) MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.

2015-08-18 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6454:
--
Attachment: MAPREDUCE-6454.patch

MRApps already include files in distributed cache when adding them to classpath 
env of MRApps. However, "jar"s are excluded for some reason (seems not valid 
now). In this patch, we add it back.

> MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.
> 
>
> Key: MAPREDUCE-6454
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6454
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-6454.patch
>
>
> We already set lib jars on distributed-cache to CLASSPATH. However, in some 
> corner cases (like: MR local mode, Hive Map side local join, etc.), we need 
> these jars on HADOOP_CLASSPATH so hadoop scripts can take it in launching 
> runjar process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6454) MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.

2015-08-18 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6454:
--
Status: Patch Available  (was: Open)

> MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.
> 
>
> Key: MAPREDUCE-6454
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6454
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-6454.patch
>
>
> We already set lib jars on distributed-cache to CLASSPATH. However, in some 
> corner cases (like: MR local mode, Hive Map side local join, etc.), we need 
> these jars on HADOOP_CLASSPATH so hadoop scripts can take it in launching 
> runjar process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAPREDUCE-6454) MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in distributed cache.

2015-08-18 Thread Junping Du (JIRA)
Junping Du created MAPREDUCE-6454:
-

 Summary: MapReduce doesn't set the HADOOP_CLASSPATH for jar lib in 
distributed cache.
 Key: MAPREDUCE-6454
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6454
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical


We already set lib jars on distributed-cache to CLASSPATH. However, in some 
corner cases (like: MR local mode, Hive Map side local join, etc.), we need 
these jars on HADOOP_CLASSPATH so hadoop scripts can take it in launching 
runjar process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6443) Add JvmPauseMonitor to Job History Server

2015-08-06 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6443:
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

> Add JvmPauseMonitor to Job History Server
> -
>
> Key: MAPREDUCE-6443
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6443
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobhistoryserver
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
> Fix For: 2.8.0
>
> Attachments: MAPREDUCE-6443.001.patch, MAPREDUCE-6443.002.patch
>
>
> We should add the {{JvmPauseMonitor}} from HADOOP-9618 to the Job History 
> Server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6443) Add JvmPauseMonitor to Job History Server

2015-08-06 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660053#comment-14660053
 ] 

Junping Du commented on MAPREDUCE-6443:
---

I have commit the patch to trunk and branch-2. Thanks [~rkanter] for the 
contribution!

> Add JvmPauseMonitor to Job History Server
> -
>
> Key: MAPREDUCE-6443
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6443
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobhistoryserver
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
> Attachments: MAPREDUCE-6443.001.patch, MAPREDUCE-6443.002.patch
>
>
> We should add the {{JvmPauseMonitor}} from HADOOP-9618 to the Job History 
> Server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6443) Add JvmPauseMonitor to Job History Server

2015-08-06 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660017#comment-14660017
 ] 

Junping Du commented on MAPREDUCE-6443:
---

+1. 002 patch LGTM. Will commit it shortly.

> Add JvmPauseMonitor to Job History Server
> -
>
> Key: MAPREDUCE-6443
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6443
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobhistoryserver
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
> Attachments: MAPREDUCE-6443.001.patch, MAPREDUCE-6443.002.patch
>
>
> We should add the {{JvmPauseMonitor}} from HADOOP-9618 to the Job History 
> Server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6443) Add JvmPauseMonitor to Job History Server

2015-08-05 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14655241#comment-14655241
 ] 

Junping Du commented on MAPREDUCE-6443:
---

Thanks for the patch, Robert. A similar comment with YARN-4019: can we move 
initiate work of pauseMonitor to serviceInit() and only left start work at 
serviceStart()?

> Add JvmPauseMonitor to Job History Server
> -
>
> Key: MAPREDUCE-6443
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6443
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobhistoryserver
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
> Attachments: MAPREDUCE-6443.001.patch
>
>
> We should add the {{JvmPauseMonitor}} from HADOOP-9618 to the Job History 
> Server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Moved] (MAPREDUCE-6441) LocalDistributedCacheManager for concurrent sqoop processes fails to create unique directories

2015-07-31 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du moved HADOOP-10924 to MAPREDUCE-6441:


Key: MAPREDUCE-6441  (was: HADOOP-10924)
Project: Hadoop Map/Reduce  (was: Hadoop Common)

> LocalDistributedCacheManager for concurrent sqoop processes fails to create 
> unique directories
> --
>
> Key: MAPREDUCE-6441
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6441
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: William Watson
>Assignee: William Watson
> Attachments: HADOOP-10924.02.patch, 
> HADOOP-10924.03.jobid-plus-uuid.patch
>
>
> Kicking off many sqoop processes in different threads results in:
> {code}
> 2014-08-01 13:47:24 -0400:  INFO - 14/08/01 13:47:22 ERROR tool.ImportTool: 
> Encountered IOException running import job: java.io.IOException: 
> java.util.concurrent.ExecutionException: java.io.IOException: Rename cannot 
> overwrite non empty destination directory 
> /tmp/hadoop-hadoop/mapred/local/1406915233073
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.hadoop.mapred.LocalDistributedCacheManager.setup(LocalDistributedCacheManager.java:149)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.(LocalJobRunner.java:163)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:731)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:432)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> java.security.AccessController.doPrivileged(Native Method)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> javax.security.auth.Subject.doAs(Subject.java:415)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.mapreduce.ImportJobBase.doSubmitJob(ImportJobBase.java:186)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:159)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:239)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.manager.SqlManager.importQuery(SqlManager.java:645)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:415)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.tool.ImportTool.run(ImportTool.java:502)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.Sqoop.run(Sqoop.java:145)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
> 2014-08-01 13:47:24 -0400:  INFO -at 
> org.apache.sqoop.Sqoop.main(Sqoop.java:238)
> {code}
> If two are kicked off in the same second. The issue is the following lines of 
> code in the org.apache.hadoop.mapred.LocalDistributedCacheManager class: 
> {code}
> // Generating unique numbers for FSDownload.
> AtomicLong uniqueNumberGenerator =
>new AtomicLong(System.currentTimeMillis());
> {code}
> and 
> {code}
> Long.toString(uniqueNumberGenerator.incrementAndGet())),
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAPREDUCE-6424) Store MR counters as timeline metrics instead of event

2015-07-02 Thread Junping Du (JIRA)
Junping Du created MAPREDUCE-6424:
-

 Summary: Store MR counters as timeline metrics instead of event
 Key: MAPREDUCE-6424
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6424
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
Reporter: Junping Du
Assignee: Junping Du


In MAPREDUCE-6327, we make map/reduce counters get encoded from 
JobFinishedEvent as timeline events with counters details in JSON format. 
We need to store framework specific counters as metrics in timeline service to 
support query, aggregation, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6232) Task state is running when all task attempts fail

2015-05-22 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6232:
--
Labels:   (was: BB2015-05-RFC)

> Task state is running when all task attempts fail
> -
>
> Key: MAPREDUCE-6232
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6232
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 2.6.0
>Reporter: Yang Hao
>Assignee: Yang Hao
> Attachments: MAPREDUCE-6232.patch, MAPREDUCE-6232.v2.patch, 
> TaskImpl.new.png, TaskImpl.normal.png, result.pdf
>
>
> When task attempts fails, the task's state is still  running. A clever way is 
> to check the task attempts's state, if none of the attempts is running, then 
> the task state should not be running



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6232) Task state is running when all task attempts fail

2015-05-22 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6232:
--
Target Version/s:   (was: 2.6.0)

> Task state is running when all task attempts fail
> -
>
> Key: MAPREDUCE-6232
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6232
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 2.6.0
>Reporter: Yang Hao
>Assignee: Yang Hao
> Attachments: MAPREDUCE-6232.patch, MAPREDUCE-6232.v2.patch, 
> TaskImpl.new.png, TaskImpl.normal.png, result.pdf
>
>
> When task attempts fails, the task's state is still  running. A clever way is 
> to check the task attempts's state, if none of the attempts is running, then 
> the task state should not be running



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5965) Hadoop streaming throws error if list of input files is high. Error is: "error=7, Argument list too long at if number of input file is high"

2015-05-22 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555957#comment-14555957
 ] 

Junping Du commented on MAPREDUCE-5965:
---

Thanks guys for good discussions here. +1 on the overall solution here. Agree 
that we don't need to put new streaming configuration to *-default.xml as 
previous practices. 

bq. If you really want to make it configurable the easiest way would be to roll 
the two settings in one. We could make the stream.truncate.long.jobconf.values 
an integer: -1 do not truncate otherwise truncate at the length given.
That sounds better. May be we should rename 
"stream.truncate.long.jobconf.values" to something like: 
"stream.jobconf.truncate.limit" and document somewhere to say -1 is the default 
value which doesn't do any truncate and 20K is a proper value for most cases?

> Hadoop streaming throws error if list of input files is high. Error is: 
> "error=7, Argument list too long at if number of input file is high"
> 
>
> Key: MAPREDUCE-5965
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5965
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Arup Malakar
>Assignee: Wilfred Spiegelenburg
> Attachments: MAPREDUCE-5965.1.patch, MAPREDUCE-5965.2.patch, 
> MAPREDUCE-5965.patch
>
>
> Hadoop streaming exposes all the key values in job conf as environment 
> variables when it forks a process for streaming code to run. Unfortunately 
> the variable mapreduce_input_fileinputformat_inputdir contains the list of 
> input files, and Linux has a limit on size of environment variables + 
> arguments.
> Based on how long the list of files and their full path is this could be 
> pretty huge. And given all of these variables are not even used it stops user 
> from running hadoop job with large number of files, even though it could be 
> run.
> Linux throws E2BIG if the size is greater than certain size which is error 
> code 7. And java translates that to "error=7, Argument list too long". More: 
> http://man7.org/linux/man-pages/man2/execve.2.html I suggest skipping 
> variables if it is greater than certain length. That way if user code 
> requires the environment variable it would fail. It should also introduce a 
> config variable to skip long variables, and set it to false by default. That 
> way user has to specifically set it to true to invoke this feature.
> Here is the exception:
> {code}
> Error: java.lang.RuntimeException: Error in configuring object at 
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) 
> at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) 
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:426) at 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at 
> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:415) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) Caused by: 
> java.lang.reflect.InvocationTargetException at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:606) at 
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106) 
> ... 9 more Caused by: java.lang.RuntimeException: Error in configuring object 
> at 
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) 
> at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) 
> at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38) ... 14 
> more Caused by: java.lang.reflect.InvocationTargetException at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:606) at 
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106) 
> ... 17 more Caused by: java.lang.RuntimeException: configuration exception at 
> org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222) at 
> org.apache.hadoop.streaming.PipeMapper.configure(PipeMa

[jira] [Updated] (MAPREDUCE-6361) NPE issue in shuffle caused by concurrent issue between copySucceeded() in one thread and copyFailed() in another thread on the same host

2015-05-19 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6361:
--
Target Version/s: 2.7.1  (was: 2.8.0)
   Fix Version/s: (was: 2.8.0)
  2.7.1

Thanks [~ozawa] for review and commit the patch! Move the commit from 2.8 to 
2.7.1 as we need this fix asap.

> NPE issue in shuffle caused by concurrent issue between copySucceeded() in 
> one thread and copyFailed() in another thread on the same host
> -
>
> Key: MAPREDUCE-6361
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6361
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Fix For: 2.7.1
>
> Attachments: MAPREDUCE-6361-v1.patch
>
>
> The failure in log:
> 2015-05-08 21:00:00,513 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : 
> org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
> shuffle in fetcher#25
>  at 
> org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
>  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
>  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NullPointerException
>  at 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:267)
>  at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:308)
>  at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6164) "mapreduce.reduce.shuffle.fetch.retry.timeout-ms" should be set to 3 minutes instead of 30 seconds by default to be consistent with other retry timeout

2015-05-19 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550284#comment-14550284
 ] 

Junping Du commented on MAPREDUCE-6164:
---

Do we still need this? It seems to be pending for a long time...

> "mapreduce.reduce.shuffle.fetch.retry.timeout-ms" should be set to 3 minutes 
> instead of 30 seconds by default to be consistent with other retry timeout 
> 
>
> Key: MAPREDUCE-6164
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6164
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>  Labels: BB2015-05-TBR
> Attachments: MAPREDUCE-6164.patch
>
>
> In MAPREDUCE-5891, we are adding retry logic to MAPREDUCE shuffle stage for 
> fetcher can be survival during NM downtime (with shuffle service down as 
> well). In many places, we are setting the default timeout to be 3 minutes 
> (connection timeout, etc.) to tolerant possible more time for NM down, but we 
> are making "mapreduce.reduce.shuffle.fetch.retry.timeout-ms" to be 30 seconds 
> which is not consistent here. We should change this to 180 seconds. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6361) NPE issue in shuffle caused by concurrent issue between copySucceeded() in one thread and copyFailed() in another thread on the same host

2015-05-12 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6361:
--
Status: Patch Available  (was: Open)

> NPE issue in shuffle caused by concurrent issue between copySucceeded() in 
> one thread and copyFailed() in another thread on the same host
> -
>
> Key: MAPREDUCE-6361
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6361
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-6361-v1.patch
>
>
> The failure in log:
> 2015-05-08 21:00:00,513 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : 
> org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
> shuffle in fetcher#25
>  at 
> org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
>  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
>  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NullPointerException
>  at 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:267)
>  at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:308)
>  at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6361) NPE issue in shuffle caused by concurrent issue between copySucceeded() in one thread and copyFailed() in another thread on the same host

2015-05-12 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6361:
--
Attachment: MAPREDUCE-6361-v1.patch

Upload the patch with the 2nd solution proposed above with unit test.

> NPE issue in shuffle caused by concurrent issue between copySucceeded() in 
> one thread and copyFailed() in another thread on the same host
> -
>
> Key: MAPREDUCE-6361
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6361
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: MAPREDUCE-6361-v1.patch
>
>
> The failure in log:
> 2015-05-08 21:00:00,513 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : 
> org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
> shuffle in fetcher#25
>  at 
> org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
>  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
>  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NullPointerException
>  at 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:267)
>  at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:308)
>  at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6361) NPE issue in shuffle caused by concurrent issue between copySucceeded() in one thread and copyFailed() in another thread on the same host

2015-05-12 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14539594#comment-14539594
 ] 

Junping Du commented on MAPREDUCE-6361:
---

There are basically two ways to fix the race condition here:
1. abstract following code into a synchronized method, so copySucceeded() would 
get blocked until copyFailed() finished.
{code}
scheduler.hostFailed(host.getHostName());
for(TaskAttemptID left: failedTasks) {
scheduler.copyFailed(left, host, true, false);
}
{code}
This sounds like more performance impact on shuffle as failure in fetching map 
output on one thread will block copySucceeded() for other threads with longer 
time.

2. Update copyFailed() to have assumption that hostFailures could be cleanup in 
the other thread. In case of that, adding back host to hostFailed as the first 
time host failed.

Prefer the 2nd option which sounds more lightweight. Will deliver a quick patch 
soon.

> NPE issue in shuffle caused by concurrent issue between copySucceeded() in 
> one thread and copyFailed() in another thread on the same host
> -
>
> Key: MAPREDUCE-6361
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6361
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
>
> The failure in log:
> 2015-05-08 21:00:00,513 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : 
> org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
> shuffle in fetcher#25
>  at 
> org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
>  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
>  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NullPointerException
>  at 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:267)
>  at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:308)
>  at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6164) "mapreduce.reduce.shuffle.fetch.retry.timeout-ms" should be set to 3 minutes instead of 30 seconds by default to be consistent with other retry timeout

2015-05-11 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6164:
--
Target Version/s: 2.8.0  (was: 2.6.1)

> "mapreduce.reduce.shuffle.fetch.retry.timeout-ms" should be set to 3 minutes 
> instead of 30 seconds by default to be consistent with other retry timeout 
> 
>
> Key: MAPREDUCE-6164
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6164
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>  Labels: BB2015-05-TBR
> Attachments: MAPREDUCE-6164.patch
>
>
> In MAPREDUCE-5891, we are adding retry logic to MAPREDUCE shuffle stage for 
> fetcher can be survival during NM downtime (with shuffle service down as 
> well). In many places, we are setting the default timeout to be 3 minutes 
> (connection timeout, etc.) to tolerant possible more time for NM down, but we 
> are making "mapreduce.reduce.shuffle.fetch.retry.timeout-ms" to be 30 seconds 
> which is not consistent here. We should change this to 180 seconds. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6361) NPE issue in shuffle caused by concurrent issue between copySucceeded() in one thread and copyFailed() in another thread on the same host

2015-05-11 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6361:
--
Priority: Critical  (was: Major)

> NPE issue in shuffle caused by concurrent issue between copySucceeded() in 
> one thread and copyFailed() in another thread on the same host
> -
>
> Key: MAPREDUCE-6361
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6361
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
>
> The failure in log:
> 2015-05-08 21:00:00,513 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : 
> org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
> shuffle in fetcher#25
>  at 
> org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
>  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
>  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NullPointerException
>  at 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:267)
>  at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:308)
>  at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6361) NPE issue in shuffle caused by concurrent issue between copySucceeded() in one thread and copyFailed() in another thread on the same host

2015-05-11 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538071#comment-14538071
 ] 

Junping Du commented on MAPREDUCE-6361:
---

NPE get throw in copyFailed() in ShuffleSchedulerImpl.java:267:
{code}
"boolean hostFail = hostFailures.get(hostname).get() > getMaxHostFailures() ? 
true : false;"
{code} 
It means hostFailures doesn't include hostname that just failed, which is not 
expected because we call hostFailed() to put host into hostFailures before 
anytime to call copyFailed():
{code}
scheduler.hostFailed(host.getHostName());
for(TaskAttemptID left: failedTasks) {
  scheduler.copyFailed(left, host, true, false);
}
{code}
Although hostFailed() and copyFailed() are both synchronized method (so as 
copySucceeded()), it is still possible (like the only reason) to cause this NPE 
for the other thread calls copySucceeded() on the same host (for other map 
output) between we call hostFailed() and copyFailed() in this thread when 
taking care of one map output failure.
We need to fix this concurrent issue to get rid of NPE issue which failed map 
output copy directly without any retry.

> NPE issue in shuffle caused by concurrent issue between copySucceeded() in 
> one thread and copyFailed() in another thread on the same host
> -
>
> Key: MAPREDUCE-6361
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6361
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>
> The failure in log:
> 2015-05-08 21:00:00,513 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : 
> org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
> shuffle in fetcher#25
>  at 
> org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
>  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
>  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NullPointerException
>  at 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:267)
>  at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:308)
>  at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAPREDUCE-6361) NPE issue in shuffle caused by concurrent issue between copySucceeded() in one thread and copyFailed() in another thread on the same host

2015-05-11 Thread Junping Du (JIRA)
Junping Du created MAPREDUCE-6361:
-

 Summary: NPE issue in shuffle caused by concurrent issue between 
copySucceeded() in one thread and copyFailed() in another thread on the same 
host
 Key: MAPREDUCE-6361
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6361
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Junping Du
Assignee: Junping Du


The failure in log:
2015-05-08 21:00:00,513 WARN [main] org.apache.hadoop.mapred.YarnChild: 
Exception running child : 
org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle 
in fetcher#25
 at 
org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.NullPointerException
 at 
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:267)
 at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:308)
 at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


<    1   2   3   4   5   6   7   >