date:20140224


 [ 
https://issues.apache.org/jira/browse/YARN-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-1702:


Attachment: apache-yarn-1702.5.patch

 Expose kill app functionality as part of RM web services
 

 Key: YARN-1702
 URL: https://issues.apache.org/jira/browse/YARN-1702
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-1702.2.patch, apache-yarn-1702.3.patch, 
 apache-yarn-1702.4.patch, apache-yarn-1702.5.patch


 Expose functionality to kill an app via the ResourceManager web services API.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (YARN-1702) Expose kill app functionality as part of RM web services


 [ 
https://issues.apache.org/jira/browse/YARN-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-1702:


Attachment: (was: apache-yarn-1702.5.patch)

 Expose kill app functionality as part of RM web services
 

 Key: YARN-1702
 URL: https://issues.apache.org/jira/browse/YARN-1702
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-1702.2.patch, apache-yarn-1702.3.patch, 
 apache-yarn-1702.4.patch, apache-yarn-1702.5.patch


 Expose functionality to kill an app via the ResourceManager web services API.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (YARN-1702) Expose kill app functionality as part of RM web services


[ 
https://issues.apache.org/jira/browse/YARN-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910095#comment-13910095
 ] 

Hadoop QA commented on YARN-1702:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12630626/apache-yarn-1702.5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3162//console

This message is automatically generated.

 Expose kill app functionality as part of RM web services
 

 Key: YARN-1702
 URL: https://issues.apache.org/jira/browse/YARN-1702
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-1702.2.patch, apache-yarn-1702.3.patch, 
 apache-yarn-1702.4.patch, apache-yarn-1702.5.patch


 Expose functionality to kill an app via the ResourceManager web services API.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (YARN-1702) Expose kill app functionality as part of RM web services


 [ 
https://issues.apache.org/jira/browse/YARN-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-1702:


Attachment: apache-yarn-1702.5.patch

 Expose kill app functionality as part of RM web services
 

 Key: YARN-1702
 URL: https://issues.apache.org/jira/browse/YARN-1702
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-1702.2.patch, apache-yarn-1702.3.patch, 
 apache-yarn-1702.4.patch, apache-yarn-1702.5.patch


 Expose functionality to kill an app via the ResourceManager web services API.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (YARN-1702) Expose kill app functionality as part of RM web services


 [ 
https://issues.apache.org/jira/browse/YARN-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-1702:


Attachment: (was: apache-yarn-1702.5.patch)

 Expose kill app functionality as part of RM web services
 

 Key: YARN-1702
 URL: https://issues.apache.org/jira/browse/YARN-1702
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-1702.2.patch, apache-yarn-1702.3.patch, 
 apache-yarn-1702.4.patch, apache-yarn-1702.5.patch


 Expose functionality to kill an app via the ResourceManager web services API.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (YARN-1702) Expose kill app functionality as part of RM web services


[ 
https://issues.apache.org/jira/browse/YARN-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910142#comment-13910142
 ] 

Hadoop QA commented on YARN-1702:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12630633/apache-yarn-1702.5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3163//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3163//console

This message is automatically generated.

 Expose kill app functionality as part of RM web services
 

 Key: YARN-1702
 URL: https://issues.apache.org/jira/browse/YARN-1702
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-1702.2.patch, apache-yarn-1702.3.patch, 
 apache-yarn-1702.4.patch, apache-yarn-1702.5.patch


 Expose functionality to kill an app via the ResourceManager web services API.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (YARN-1686) NodeManager.resyncWithRM() does not handle exception which cause NodeManger to Hang.

2014-02-24 Thread Rohith (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-1686:
-

Attachment: YARN-1686.2.patch

Thank you vinod for your reviewing patch.

I have updated the patch addressing all your comments. Please review new patch.

Jian He, tx for motivation.:-)

 NodeManager.resyncWithRM() does not handle exception which cause NodeManger 
 to Hang.
 

 Key: YARN-1686
 URL: https://issues.apache.org/jira/browse/YARN-1686
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Rohith
Assignee: Rohith
 Fix For: 3.0.0

 Attachments: YARN-1686.1.patch, YARN-1686.2.patch


 During start of NodeManager,if registration with resourcemanager throw 
 exception then nodemager shutdown happens. 
 Consider case where NM-1 is registered with RM. RM issued Resync to NM. If 
 any exception thrown in resyncWithRM (starts new thread which does not 
 handle exception) during RESYNC evet, then this thread is lost. NodeManger 
 enters hanged state. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (YARN-1686) NodeManager.resyncWithRM() does not handle exception which cause NodeManger to Hang.


[ 
https://issues.apache.org/jira/browse/YARN-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910219#comment-13910219
 ] 

Hadoop QA commented on YARN-1686:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12630643/YARN-1686.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3164//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3164//console

This message is automatically generated.

 NodeManager.resyncWithRM() does not handle exception which cause NodeManger 
 to Hang.
 

 Key: YARN-1686
 URL: https://issues.apache.org/jira/browse/YARN-1686
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Rohith
Assignee: Rohith
 Fix For: 3.0.0

 Attachments: YARN-1686.1.patch, YARN-1686.2.patch


 During start of NodeManager,if registration with resourcemanager throw 
 exception then nodemager shutdown happens. 
 Consider case where NM-1 is registered with RM. RM issued Resync to NM. If 
 any exception thrown in resyncWithRM (starts new thread which does not 
 handle exception) during RESYNC evet, then this thread is lost. NodeManger 
 enters hanged state. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (YARN-221) NM should provide a way for AM to tell it not to aggregate logs.

2014-02-24 Thread Jason Lowe (JIRA)

[
https://issues.apache.org/jira/browse/YARN-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910411#comment-13910411
]

Jason Lowe commented on YARN-221:
-

bq. We can have RM AM wait for notification as in container exit - NM notifies
RM - RM notifies AM. That will create some delay for AM to declare the job is
done. With the NM - RM heartbeat value used in big clusters, it could add
couple seconds delay for the job. That might not be a big deal for regular MR
jobs.

The NM does out-of-band heartbeats when containers exit, so the turnaround time
can be shorter than a full NM heartbeat interval.

If we're really concerned about any additional time added for graceful task
exit we can also have the AM unregister when the job succeeds/fails but before
all tasks exit, and eventually the RM will kill all containers of the
application when the AM eventually exits (or times out waiting). In that sense
it would not add any time from the job client's perspective, as the job could
report completion at the same time it did before. However it would add some
time from the YARN perspective, as the application is lingering on the cluster
a few extra seconds in the FINISHING state than it did before.

bq. One thing to add we need the definition and policy on how to handle those
tasks that are in the finishing state and MR AM ends up stopping them as they
don't exit by themselves.

I don't think we need to get too tricky here. The NM will see the container
return a non-zero exit code and assume that's failure. If tasks are succeeding
but returning non-zero exit codes then that's probably a bug and arguably a
good thing we're grabbing the logs to show what went wrong when it tried to
tear down. IMHO we should fix what's causing the non-zero exit code rather
than try to add a mechanism to prevent logs from being aggregated in what
should be a rare and abnormal case.

NM should provide a way for AM to tell it not to aggregate logs.

Key: YARN-221
URL: https://issues.apache.org/jira/browse/YARN-221
Project: Hadoop YARN
Issue Type: Sub-task
Components: nodemanager
Reporter: Robert Joseph Evans
Assignee: Chris Trezzo
Attachments: YARN-221-trunk-v1.patch

The NodeManager should provide a way for an AM to tell it that either the
logs should not be aggregated, that they should be aggregated with a high
priority, or that they should be aggregated but with a lower priority. The
AM should be able to do this in the ContainerLaunch context to provide a
default value, but should also be able to update the value when the container
is released.
This would allow for the NM to not aggregate logs in some cases, and avoid
connection to the NN at all.

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (YARN-1336) Work-preserving nodemanager restart

2014-02-24 Thread Jason Lowe (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Lowe updated YARN-1336:
-

Attachment: YARN-1336-rollup.patch

Attaching a rollup patch for the prototype that [~raviprak] and I developed.
This recovers resource localization state, applications and containers, tokens,
log aggregation, deletion service, and the MR shuffle auxiliary service. A
quick high-level overview:

- Restart functionality is enabled by configuring
yarn.nodemanager.recovery.enabled to true and yarn.nodemanager.recovery.dir to
a directory on the local filesystem where the state will be stored.
- Containers are launched with an additional shell layer which places the exit
code of the container in an .exitcode file. This allows the restarted NM
instance to recover containers that are already running or have exited since
the last NM instance.
- NMStateStoreService is the abstraction layer for the state store.
NMNullStateStoreService is used when recovery is disabled and
NMLevelDBStateStoreService is used when it is enabled.
- Rather than explicitly record localized resource reference counts, resources
are recovered with no references and recovered containers re-request their
resources as during a normal container lifecycle to restore the reference
counts.

Some things that are still missing:
- ability to distinguish shutdown for restart vs. decommission
- proper handling of state store errors
- adding unit tests
- adding formal documentation.

Feedback is greatly appreciated. I'll be working on addressing the missing
items and splitting the patch into smaller pieces across the appropriate
subtasks to simplify reviews.

Work-preserving nodemanager restart
---

Key: YARN-1336
URL: https://issues.apache.org/jira/browse/YARN-1336
Project: Hadoop YARN
Issue Type: New Feature
Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Attachments: YARN-1336-rollup.patch

This serves as an umbrella ticket for tasks related to work-preserving
nodemanager restart.

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Assigned] (YARN-1336) Work-preserving nodemanager restart

2014-02-24 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reassigned YARN-1336:


Assignee: Jason Lowe

 Work-preserving nodemanager restart
 ---

 Key: YARN-1336
 URL: https://issues.apache.org/jira/browse/YARN-1336
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-1336-rollup.patch


 This serves as an umbrella ticket for tasks related to work-preserving 
 nodemanager restart.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (YARN-1490) RM should optionally not kill all containers when an ApplicationMaster exits

2014-02-24 Thread Robert Kanter (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910615#comment-13910615
 ] 

Robert Kanter commented on YARN-1490:
-

By the way, the issue I mentioned a few comments 
[up|https://issues.apache.org/jira/browse/YARN-1490?focusedCommentId=13895329page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13895329]
 is actually now fixed by YARN-1689.  

 RM should optionally not kill all containers when an ApplicationMaster exits
 

 Key: YARN-1490
 URL: https://issues.apache.org/jira/browse/YARN-1490
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Fix For: 2.4.0

 Attachments: YARN-1490.1.patch, YARN-1490.10.patch, 
 YARN-1490.11.patch, YARN-1490.11.patch, YARN-1490.12.patch, 
 YARN-1490.2.patch, YARN-1490.3.patch, YARN-1490.4.patch, YARN-1490.5.patch, 
 YARN-1490.6.patch, YARN-1490.7.patch, YARN-1490.8.patch, YARN-1490.9.patch, 
 org.apache.oozie.service.TestRecoveryService_thread-dump.txt


 This is needed to enable work-preserving AM restart. Some apps can chose to 
 reconnect with old running containers, some may not want to. This should be 
 an option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (YARN-1730) Leveldb timeline store needs simple write locking

2014-02-24 Thread Billie Rinaldi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910612#comment-13910612
 ] 

Billie Rinaldi commented on YARN-1730:
--

I don't think using hold count will be sufficient.  The hold count only returns 
the number of holds that have been obtained by the current thread.  So as soon 
as the current thread is done with the lock, it would drop the lock from the 
lock map, which is not what we want.

 Leveldb timeline store needs simple write locking
 -

 Key: YARN-1730
 URL: https://issues.apache.org/jira/browse/YARN-1730
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi
 Attachments: YARN-1730.1.patch, YARN-1730.2.patch


 The actual data writes are performed atomically in a batch, but a lock should 
 be held while identifying a start time for the entity, which precedes every 
 write.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (YARN-1755) Add support for web services to the WebApp proxy

Varun Vasudev created YARN-1755:
---

 Summary: Add support for web services to the WebApp proxy
 Key: YARN-1755
 URL: https://issues.apache.org/jira/browse/YARN-1755
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev


The RM currently has an inbuilt web proxy that is used to serve requests. The 
web proxy is necessary for security reasons which are described on the Apache 
Hadoop website 
(http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html).
  The web application proxy is a part of YARN and can be configured to run as a 
standalone proxy. Currently, the RM itself supports web services. Adding 
support for all the web service calls in the web app proxy allows it to support 
failover and retry for all web services. The changes involved are the following 
–
a.  Add support for web service calls to the RM web application proxy and 
have it make the equivalent RPC calls.
b.  Add support for failover and retry to the web application proxy. We can 
refactor a lot of the existing client code from the Yarn client.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (YARN-1686) NodeManager.resyncWithRM() does not handle exception which cause NodeManger to Hang.


 [ 
https://issues.apache.org/jira/browse/YARN-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1686:
--

Attachment: YARN-1686.3.patch

Same patch as before but with a test time-out. Will check it in once Jenkins 
says okay..

 NodeManager.resyncWithRM() does not handle exception which cause NodeManger 
 to Hang.
 

 Key: YARN-1686
 URL: https://issues.apache.org/jira/browse/YARN-1686
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Rohith
Assignee: Rohith
 Attachments: YARN-1686.1.patch, YARN-1686.2.patch, YARN-1686.3.patch


 During start of NodeManager,if registration with resourcemanager throw 
 exception then nodemager shutdown happens. 
 Consider case where NM-1 is registered with RM. RM issued Resync to NM. If 
 any exception thrown in resyncWithRM (starts new thread which does not 
 handle exception) during RESYNC evet, then this thread is lost. NodeManger 
 enters hanged state. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (YARN-986) YARN should use cluster-id as token service address


[ 
https://issues.apache.org/jira/browse/YARN-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910690#comment-13910690
 ] 

Vinod Kumar Vavilapalli commented on YARN-986:
--

Couldn't find time last week, will look at it today..

 YARN should use cluster-id as token service address
 ---

 Key: YARN-986
 URL: https://issues.apache.org/jira/browse/YARN-986
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-986-1.patch, yarn-986-prelim-0.patch


 This needs to be done to support non-ip based fail over of RM. Once the 
 server sets the token service address to be this generic ClusterId/ServiceId, 
 clients can translate it to appropriate final IP and then be able to select 
 tokens via TokenSelectors.
 Some workarounds for other related issues were put in place at YARN-945.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (YARN-1754) Container process is not really killed

2014-02-24 Thread Gera Shegalov (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910693#comment-13910693
 ] 

Gera Shegalov commented on YARN-1754:
-

Get https://github.com/jerrykuch/ersatz-setsid and make sure that setsid is on 
your standard PATH.

 Container process is not really killed
 --

 Key: YARN-1754
 URL: https://issues.apache.org/jira/browse/YARN-1754
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
 Environment: Mac
Reporter: Jeff Zhang

 I test the following distributed shell example on my mac:
 hadoop jar 
 share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.2.0.jar 
 -appname shell -jar 
 share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.2.0.jar 
 -shell_command=sleep -shell_args=10 -num_containers=1
 And it will start 2 process for one container, one is the shell process, 
 another is the real command I execute ( here is sleep 10). 
 And then I kill this application by running command yarn application -kill 
 app_id
 it will kill the shell process, but won't kill the real command process. The 
 reason is that yarn use kill command to kill process, but it won't kill its 
 child process. use pkill could resolve this issue.
 IMHO, it is a very important case which will make the resource usage 
 inconsistency, and have potential security problem. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (YARN-1490) RM should optionally not kill all containers when an ApplicationMaster exits


[ 
https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910697#comment-13910697
 ] 

Vinod Kumar Vavilapalli commented on YARN-1490:
---

Thanks for the update [~rkanter].

 RM should optionally not kill all containers when an ApplicationMaster exits
 

 Key: YARN-1490
 URL: https://issues.apache.org/jira/browse/YARN-1490
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Fix For: 2.4.0

 Attachments: YARN-1490.1.patch, YARN-1490.10.patch, 
 YARN-1490.11.patch, YARN-1490.11.patch, YARN-1490.12.patch, 
 YARN-1490.2.patch, YARN-1490.3.patch, YARN-1490.4.patch, YARN-1490.5.patch, 
 YARN-1490.6.patch, YARN-1490.7.patch, YARN-1490.8.patch, YARN-1490.9.patch, 
 org.apache.oozie.service.TestRecoveryService_thread-dump.txt


 This is needed to enable work-preserving AM restart. Some apps can chose to 
 reconnect with old running containers, some may not want to. This should be 
 an option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (YARN-1741) XInclude support broken for YARN ResourceManager

2014-02-24 Thread Eric Sirianni (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910703#comment-13910703
 ] 

Eric Sirianni commented on YARN-1741:
-

Yes - This was the approach I was planning on investigating with a potential 
patch.  The trick is how to most cleanly get that to work with the 
{{ConfigurationProvider}} API.  Two main approaches seem possible:
# Change {{ConfigurationProvider.getConfigurationInputStream()}} to return a 
{{(String, InputStream)}} pair.
# Change {{ConfigurationProvider}} to provide directly into the 
{{Configuration}} object itself.  Something like 
{{ConfigurationProvider.provideTo(Configuration conf)}}.  With this approach, 
the different {{ConfigurationProvider}} subclasses could invoke the specific 
{{conf.addResource()}} overload that made sense for the subclass.

Based on investigating the usages of 
{{ConfigurationProvider.getConfigurationInputStream()}}, I was leaning towards 
the 2nd approach.

 XInclude support broken for YARN ResourceManager
 

 Key: YARN-1741
 URL: https://issues.apache.org/jira/browse/YARN-1741
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Eric Sirianni
Priority: Minor
  Labels: regression

 The XInclude support in Hadoop configuration files (introduced via 
 HADOOP-4944) was broken by the recent {{ConfigurationProvider}} changes to 
 YARN ResourceManager.  Specifically, YARN-1459 and, more generally, the 
 YARN-1611 family of JIRAs for ResourceManager HA.
 The issue is that {{ConfigurationProvider}} provides a raw {{InputStream}} as 
 a {{Configuration}} resource for what was previously a {{Path}}-based 
 resource.  
 For {{Path}} resources, the absolute file path is used as the {{systemId}} 
 for the {{DocumentBuilder.parse()}} call:
 {code}
   } else if (resource instanceof Path) {  // a file resource
 ...
   doc = parse(builder, new BufferedInputStream(
   new FileInputStream(file)), ((Path)resource).toString());
 }
 {code}
 The {{systemId}} is used to resolve XIncludes (among other things):
 {code}
 /**
  * Parse the content of the given codeInputStream/code as an
  * XML document and return a new DOM Document object.
 ...
  * @param systemId Provide a base for resolving relative URIs.
 ...
  */
 public Document parse(InputStream is, String systemId)
 {code}
 However, for loading raw {{InputStream}} resources, the {{systemId}} is set 
 to {{null}}:
 {code}
   } else if (resource instanceof InputStream) {
 doc = parse(builder, (InputStream) resource, null);
 {code}
 causing XInclude resolution to fail.
 In our particular environment, we make extensive use of XIncludes to 
 standardize common configuration parameters across multiple Hadoop clusters.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (YARN-1740) Redirection from AM-URL is broken with HTTPS_ONLY policy


 [ 
https://issues.apache.org/jira/browse/YARN-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1740:
--

Issue Type: Sub-task  (was: Bug)
Parent: YARN-1280

 Redirection from AM-URL is broken with HTTPS_ONLY policy
 

 Key: YARN-1740
 URL: https://issues.apache.org/jira/browse/YARN-1740
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Yesha Vora
Assignee: Jian He
 Attachments: YARN-1740.1.patch


 Steps to reproduce:
 1) Run a sleep job
 2) Run: yarn application -list command to find AM URL.
 root@host1:~# yarn application -list
 Total number of applications (application-types: [] and states: SUBMITTED, 
 ACCEPTED, RUNNING):1
 Application-Id Application-Name Application-Type User Queue State Final-State 
 Progress Tracking-URL
 application_1383251398986_0003 Sleep job MAPREDUCE hdfs default RUNNING 
 UNDEFINED 5% http://host1:40653
 3) Try to access http://host1:40653/ws/v1/mapreduce/info; url.
 This URL redirects to 
 http://RM_host:RM_https_port/proxy/application_1383251398986_0003/ws/v1/mapreduce/info
 Here, Http protocol is used with HTTPS port for RM.
 The expected Url is 
 https://RM_host:RM_https_port/proxy/application_1383251398986_0003/ws/v1/mapreduce/info



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC

2014-02-24 Thread Gera Shegalov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-1515:


Attachment: YARN-1515.v05.patch

v05 adds auto thread dump for stuck AM's as well.

 Ability to dump the container threads and stop the containers in a single RPC
 -

 Key: YARN-1515
 URL: https://issues.apache.org/jira/browse/YARN-1515
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, nodemanager
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, 
 YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch


 This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for 
 timed-out task attempts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (YARN-1686) NodeManager.resyncWithRM() does not handle exception which cause NodeManger to Hang.