[jira] [Updated] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-02-07 Thread Srikanth Sampath (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth Sampath updated MAPREDUCE-6608:

Attachment: WorkPreservingMRAppMaster-2.pdf

Updated high level design

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-02-07 Thread Srikanth Sampath (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15136279#comment-15136279
 ] 

Srikanth Sampath commented on MAPREDUCE-6608:
-

I have attached a design patch - 
[Patch1|https://issues.apache.org/jira/secure/attachment/12786705/Patch1.patch] 
that gives a high level approach on the implementation.  The 
[Design|https://issues.apache.org/jira/secure/attachment/12786706/WorkPreservingMRAppMaster-2.pdf]
 document gives the high level design.

*Notes:*
1. This is a patch against Apache 2.6.1
2. It works for the example hadoop sleep job - where I have killed the  AM 
randomly and the inflight tasks continue.
3. SS_DEBUG in the patch indicates a debug statement that helps me. Some of 
these will be removed eventually.
4. SS_FIXME in the patch is a tag for me to fix some known issues that I have 
commented on.  I will clean these up before the next submission.

I solicit comments on the high level design and the approach I have taken in 
the patch.

*Next Steps:*
1. I will iron out the known issues (all SS_FIXME), clean up the interfaces,  
make the code compliant with apache coding standards, rebase the code against 
trunk, and test it thoroughly.  I will factor in the comments and suggestions 
that are made with the design doc and design patch.
2. Identify the components and issues involved and raise sub tasks.  

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-02-07 Thread Srikanth Sampath (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth Sampath updated MAPREDUCE-6608:

Attachment: Patch1.patch

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAPREDUCE-6628) Potential memory leak in CryptoOutputStream

2016-02-07 Thread Mariappan Asokan (JIRA)
Mariappan Asokan created MAPREDUCE-6628:
---

 Summary: Potential memory leak in CryptoOutputStream
 Key: MAPREDUCE-6628
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6628
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: security
Reporter: Mariappan Asokan
Assignee: Mariappan Asokan


There is a potential memory leak in {{CryptoOutputStream.java.}}  It allocates 
two direct byte buffers ({{inBuffer}} and {{outBuffer}}) that get freed when 
{{close()}} method is called.  Most of the time, {{close()}} method is called.  
However, when writing to intermediate Map output file or the spill files in 
{{MapTask}}, {{close()}} is never called since calling so  would close the 
underlying stream which is not desirable.  There is a single underlying 
physical stream that contains multiple logical streams one per partition of Map 
output.  

By default the amount of memory allocated per byte buffer is 128 KB and  so the 
total memory allocated is 256 KB,  This may not sound much.  However, if the 
number of partitions (or number of reducers) is large (in the hundreds) and/or 
there are spill files created in {{MapTask}}, this can grow into a few hundred 
MB. 

I can think of two ways to address this issue:

h2. Possible Fix - 1
According to JDK documentation:
{quote}
The contents of direct buffers may reside outside of the normal 
garbage-collected heap, and so their impact upon the memory footprint of an 
application might not be obvious.  It is therefore recommended that direct 
buffers be allocated primarily for large, long-lived buffers that are subject 
to the underlying system's native I/O operations.  In general it is best to 
allocate direct buffers only when they yield a measureable gain in program 
performance.
{quote}
It is not clear to me whether there is any benefit of allocating direct byte 
buffers in {{CryptoOutputStream.java}}.  In fact, there is a slight CPU 
overhead in moving data from {{outBuffer}} to a temporary byte array as per the 
following code in {{CryptoOutputStream.java}}.
{code}
/*
 * If underlying stream supports {@link ByteBuffer} write in future, needs
 * refine here. 
 */
final byte[] tmp = getTmpBuf();
outBuffer.get(tmp, 0, len);
out.write(tmp, 0, len);
{code}
Even if the underlying stream supports direct byte buffer IO (or direct IO in 
OS parlance), it is not clear whether it will yield any measurable performance 
gain.

The fix would be to allocate a ByteBuffer on the heap for inBuffer and wrap a 
byte array in a {{ByteBuffer}} for {{outBuffer}}.  By the way, the {{inBuffer}} 
and {{outBuffer}} have to be {{ByteBuffer}} as demanded by the {{encrypt()}} 
method in {{Encryptor}}.

h2. Possible Fix - 2
Assuming that we want to keep the buffers as direct byte buffers, we can create 
a new constructor to {{CryptoOutputStream}} and pass a boolean flag 
{{ownOutputStream}} to indicate whether the underlying stream will be owned by 
{{CryptoOutputStream}}. If it is true, then calling the {{close()}} method will 
close the underlying stream.  Otherwise, when {{close()}} is called only the 
direct byte buffers will be freed and the underlying stream will not be closed.

The scope of changes for this fix will be somewhat wider.  We need to modify 
{{MapTask.java}}, {{CryptoUtils.java}}, and {{CryptoFSDataOutputStream.java}} 
as well to pass the ownership flag mentioned above.

I can post a patch for either of the above.  I welcome any other ideas from 
developers to fix this issue.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAPREDUCE-6629) Cluster used capacity is > 100 when container reserved

2016-02-07 Thread Brahma Reddy Battula (JIRA)
Brahma Reddy Battula created MAPREDUCE-6629:
---

 Summary: Cluster used capacity is > 100 when container reserved 
 Key: MAPREDUCE-6629
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6629
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Brahma Reddy Battula
Assignee: Brahma Reddy Battula


 *Scenario:* 

* Start cluster with Three NM's each having 8GB (cluster memory:24GB).
* Configure queues with elasticity and userlimitfactor=10.
* disable pre-emption.
* run two job with different priority in different queue at the same time
** yarn jar hadoop-mapreduce-examples-2.7.2.jar pi -Dyarn.app.priority=LOW 
-Dmapreduce.job.queuename=QueueA -Dmapreduce.map.memory.mb=4096 
-Dyarn.app.mapreduce.am.resource.mb=1536 
-Dmapreduce.job.reduce.slowstart.completedmaps=1.0 10 1
** yarn jar hadoop-mapreduce-examples-2.7.2.jar pi -Dyarn.app.priority=HIGH 
-Dmapreduce.job.queuename=QueueB -Dmapreduce.map.memory.mb=4096 
-Dyarn.app.mapreduce.am.resource.mb=1536 3 1

* observe the cluster capacity which was used in RM web UI






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6579) Test failure : TestNetworkedJob

2016-02-07 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15136606#comment-15136606
 ] 

Rohith Sharma K S commented on MAPREDUCE-6579:
--

Looking into this JIRA, had offline sync up with [~Naganarasimha]. 
Earlier to YARN-3946, {{JobStatus#getFailureInfo}} used to return empty string 
if job is running. And used to set valid diagnosis message when the job is 
failed. But now, it is change in the behavior that API returns diagnosis 
message if job is running. API return type and name are not in sync and change 
in return type. 

I think we should keep MR API return as old behavior only. Any clients like Tez 
using this API would get affected. Any thoughts? 

And as a side note, if fix is not in test case then JIRA summary can be updated 
as per real issue.



> Test failure : TestNetworkedJob
> ---
>
> Key: MAPREDUCE-6579
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6579
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith Sharma K S
>Assignee: Akira AJISAKA
> Attachments: MAPREDUCE-6579.01.patch, MAPREDUCE-6579.02.patch, 
> MAPREDUCE-6579.03.patch, MAPREDUCE-6579.04.patch
>
>
> From 
> [https://builds.apache.org/job/PreCommit-YARN-Build/9976/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-jobclient-jdk1.8.0_66.txt]
>  TestNetworkedJob are failed intermittently.
> {code}
> Running org.apache.hadoop.mapred.TestNetworkedJob
> Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 81.131 sec 
> <<< FAILURE! - in org.apache.hadoop.mapred.TestNetworkedJob
> testNetworkedJob(org.apache.hadoop.mapred.TestNetworkedJob)  Time elapsed: 
> 30.55 sec  <<< FAILURE!
> org.junit.ComparisonFailure: expected:<[[Tue Dec 15 14:02:45 + 2015] 
> Application is Activated, waiting for resources to be assigned for AM.  
> Details : AM Partition =  ; Partition Resource = 
>  ; Queue's Absolute capacity = 100.0 % ; Queue's 
> Absolute used capacity = 0.0 % ; Queue's Absolute max capacity = 100.0 % ; ]> 
> but was:<[]>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.hadoop.mapred.TestNetworkedJob.testNetworkedJob(TestNetworkedJob.java:174)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6579) Test failure : TestNetworkedJob

2016-02-07 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15136611#comment-15136611
 ] 

Naganarasimha G R commented on MAPREDUCE-6579:
--

Thanks [~rohithsharma] for sharing your thoughts, One approach i can think of 
is checking the status of job in the MR client side and only if its failed we 
can fetch and return the diagnosis message from yarn else we can return empty 
string. This will ensure the compatability is not broken if any client 
application is checking the failure message with out checking the job status.


> Test failure : TestNetworkedJob
> ---
>
> Key: MAPREDUCE-6579
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6579
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith Sharma K S
>Assignee: Akira AJISAKA
> Attachments: MAPREDUCE-6579.01.patch, MAPREDUCE-6579.02.patch, 
> MAPREDUCE-6579.03.patch, MAPREDUCE-6579.04.patch
>
>
> From 
> [https://builds.apache.org/job/PreCommit-YARN-Build/9976/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-jobclient-jdk1.8.0_66.txt]
>  TestNetworkedJob are failed intermittently.
> {code}
> Running org.apache.hadoop.mapred.TestNetworkedJob
> Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 81.131 sec 
> <<< FAILURE! - in org.apache.hadoop.mapred.TestNetworkedJob
> testNetworkedJob(org.apache.hadoop.mapred.TestNetworkedJob)  Time elapsed: 
> 30.55 sec  <<< FAILURE!
> org.junit.ComparisonFailure: expected:<[[Tue Dec 15 14:02:45 + 2015] 
> Application is Activated, waiting for resources to be assigned for AM.  
> Details : AM Partition =  ; Partition Resource = 
>  ; Queue's Absolute capacity = 100.0 % ; Queue's 
> Absolute used capacity = 0.0 % ; Queue's Absolute max capacity = 100.0 % ; ]> 
> but was:<[]>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.hadoop.mapred.TestNetworkedJob.testNetworkedJob(TestNetworkedJob.java:174)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6579) Test failure : TestNetworkedJob

2016-02-07 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15136615#comment-15136615
 ] 

Rohith Sharma K S commented on MAPREDUCE-6579:
--

bq. And as a side note, if fix is not in test case then JIRA summary can be 
updated as per real issue.
I meat if agreed upon my thoughts then it is required to change the summary. 

> Test failure : TestNetworkedJob
> ---
>
> Key: MAPREDUCE-6579
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6579
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith Sharma K S
>Assignee: Akira AJISAKA
> Attachments: MAPREDUCE-6579.01.patch, MAPREDUCE-6579.02.patch, 
> MAPREDUCE-6579.03.patch, MAPREDUCE-6579.04.patch
>
>
> From 
> [https://builds.apache.org/job/PreCommit-YARN-Build/9976/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-jobclient-jdk1.8.0_66.txt]
>  TestNetworkedJob are failed intermittently.
> {code}
> Running org.apache.hadoop.mapred.TestNetworkedJob
> Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 81.131 sec 
> <<< FAILURE! - in org.apache.hadoop.mapred.TestNetworkedJob
> testNetworkedJob(org.apache.hadoop.mapred.TestNetworkedJob)  Time elapsed: 
> 30.55 sec  <<< FAILURE!
> org.junit.ComparisonFailure: expected:<[[Tue Dec 15 14:02:45 + 2015] 
> Application is Activated, waiting for resources to be assigned for AM.  
> Details : AM Partition =  ; Partition Resource = 
>  ; Queue's Absolute capacity = 100.0 % ; Queue's 
> Absolute used capacity = 0.0 % ; Queue's Absolute max capacity = 100.0 % ; ]> 
> but was:<[]>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.hadoop.mapred.TestNetworkedJob.testNetworkedJob(TestNetworkedJob.java:174)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)