[jira] [Updated] (YARN-6606) The implementation of LocalizationStatus in ContainerStatusProto

2018-05-07 Thread SammiChen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SammiChen updated YARN-6606:

Fix Version/s: (was: 2.9.1)

> The implementation of LocalizationStatus in ContainerStatusProto
> 
>
> Key: YARN-6606
> URL: https://issues.apache.org/jira/browse/YARN-6606
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Bingxue Qiu
>Priority: Major
> Attachments: YARN-6606.1.patch, YARN-6606.2.patch
>
>
> we have a use case, where the full implementation of localization status in 
> ContainerStatusProto 
> [Continuous-resource-localization|https://issues.apache.org/jira/secure/attachment/12825041/Continuous-resource-localization.pdf]
>need to be done , so we make it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6661) Too much CLEANUP event hang ApplicationMasterLauncher thread pool

2018-05-07 Thread SammiChen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SammiChen updated YARN-6661:

Fix Version/s: (was: 2.9.1)

> Too much CLEANUP event hang ApplicationMasterLauncher thread pool
> -
>
> Key: YARN-6661
> URL: https://issues.apache.org/jira/browse/YARN-6661
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
> Environment: hadoop 2.7.2 
>Reporter: JackZhou
>Priority: Major
>
> Some one else have already come up with the similar problem and fix it.
> We can look the jira(https://issues.apache.org/jira/browse/YARN-3809) for 
> detail.
> But I think the fix have not solve the problem completely, blow was the 
> problem I encountered:
> There is about 1000 nodes in my hadoop cluster, and I submit about 1800 apps.
> I failover my active rm and rm will failover all those 1800 apps.
> When a application failover, It will wait for AM container register itself. 
> But there is a bug in my AM (I do it intentionally), and it will not register 
> itself.
> So the RM will wait for about 10mins for the AM expiration, and it will send 
> a CLEANUP event to 
> ApplicationMasterLauncher thread pool. Because there is about 1800 apps, so 
> it will hang the ApplicationMasterLauncher
> thread pool for a large time. I have already use the 
> patch(https://issues.apache.org/jira/secure/attachment/12740804/YARN-3809.03.patch),
>  so
> a CLEANUP event will hang a thread 10 * 20 = 200s. But I have 1800 apps, so 
> for each of my thread, it will
> hang 1800 / 50 * 200s = 7200s=20min.
> Because the AM have register itself during 10mins, so it will retry and 
> create a new application attempt. 
> The application attempt will accept a container from RM, and send a LAUNCH to 
> ApplicationMasterLauncher thread pool.
> Because the 1800 CLEANUP will hang the 50 thread pools about 20mins. So the 
> application attempt will not 
> start the AM container during 10min. 
> And it will expire, and send a CLEANUP event to ApplicationMasterLauncher 
> thread pools too.
> As you can see, none of my application can really run it. 
> Each of them have 5 application attempts as follows, and each of them keep 
> retrying.
> appattempt_1495786030132_4000_05
> appattempt_1495786030132_4000_04
> appattempt_1495786030132_4000_03
> appattempt_1495786030132_4000_02  
> appattempt_1495786030132_4000_01
> So all of my apps have hang several hours, and none of them can really run. 
> I think this is a bug!!! We can treat CLEANUP and LAUNCH as different events.
> And use some other thread to deal with LAUNCH event or use other way.
> Sorry, I english is so poor. I don't know have I describe it clearly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6645) Bug fix in ContainerImpl when calling the symLink of LinuxContainerExecutor

2018-05-07 Thread SammiChen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SammiChen updated YARN-6645:

Fix Version/s: (was: 2.9.1)

> Bug fix in ContainerImpl when calling the symLink of LinuxContainerExecutor
> ---
>
> Key: YARN-6645
> URL: https://issues.apache.org/jira/browse/YARN-6645
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Bingxue Qiu
>Priority: Major
> Attachments: error when creating symlink.png
>
>
> when creating symlink after the resource localized in our clusters , an 
> IOException has been thrown, because the nmPrivateDir doesn't exist. we add a 
> patch to fix it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7601) Incorrect container states recovered as LevelDB uses alphabetical order

2018-04-02 Thread SammiChen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SammiChen updated YARN-7601:

Target Version/s: 2.9.2  (was: 2.9.1)

> Incorrect container states recovered as LevelDB uses alphabetical order
> ---
>
> Key: YARN-7601
> URL: https://issues.apache.org/jira/browse/YARN-7601
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sampada Dehankar
>Assignee: Sampada Dehankar
>Priority: Major
> Attachments: YARN-7601.001.patch, YARN-7601.002.patch
>
>
> LevelDB stores key-value pairs in the alphabetical order. Container id 
> concatenated by its state is used as key. So, even if container goes through 
> any states in its life cycle, the order of states for following values 
> retrieved from LevelDB is always going to be as below`:
> LAUNCHED
> PAUSED
> QUEUED
> For eg: If a container is LAUNCHED then PAUSED and LAUNCHED again, the 
> recovered container state is PAUSED currently instead of LAUNCHED.
> We propose to store the timestamp as the value while making call to 
>   
>   storeContainerLaunched
>   storeContainerPaused
>   storeContainerQueued
>   
> so that correct container state is recovered based on timestamps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4858) start-yarn and stop-yarn scripts to support timeline and sharedcachemanager

2018-04-02 Thread SammiChen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SammiChen updated YARN-4858:

Target Version/s: 2.9.2  (was: 2.9.1)

> start-yarn and stop-yarn scripts to support timeline and sharedcachemanager
> ---
>
> Key: YARN-4858
> URL: https://issues.apache.org/jira/browse/YARN-4858
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scripts
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
>  Labels: oct16-easy
> Attachments: YARN-4858-001.patch, YARN-4858-branch-2.001.patch
>
>
> The start-yarn and stop-yarn scripts don't have any (even commented out) 
> support for the  timeline and sharedcachemanager
> Proposed:
> * bash and cmd start-yarn scripts have commented out start actions
> * stop-yarn scripts stop the servers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7652) Handle AM register requests asynchronously in FederationInterceptor

2018-04-02 Thread SammiChen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SammiChen updated YARN-7652:

Target Version/s: 2.9.2  (was: 2.9.1)

> Handle AM register requests asynchronously in FederationInterceptor
> ---
>
> Key: YARN-7652
> URL: https://issues.apache.org/jira/browse/YARN-7652
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Affects Versions: 2.9.0, 3.0.0
>Reporter: Subru Krishnan
>Assignee: Botong Huang
>Priority: Major
>
> We (cc [~goiri]/[~botong]) observed that the {{FederationInterceptor}} in 
> {{AMRMProxy}} (and consequently the AM) is blocked if the _StateStore_ has 
> outdated info about a _SubCluster_. This is because we handle AM register 
> requests synchronously. This jira proposes to move to async similar to how we 
> operate with allocate invocations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7450) ATS Client should retry on intermittent Kerberos issues.

2018-04-02 Thread SammiChen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SammiChen updated YARN-7450:

Target Version/s: 2.9.2  (was: 2.9.1)

> ATS Client should retry on intermittent Kerberos issues.
> 
>
> Key: YARN-7450
> URL: https://issues.apache.org/jira/browse/YARN-7450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2
>Affects Versions: 2.7.3
> Environment: Hadoop-2.7.3
>Reporter: Ravi Prakash
>Priority: Major
>
> We saw a stack trace (posted in the first comment) in the ResourceManager 
> logs for the TimelineClientImpl not being able to relogin from keytab.
> I'm guessing there was an intermittent issue that failed the kerberos relogin 
> from keytab. However, I'm assuming this was *not* retried because I only saw 
> one instance of this stack trace.  I propose that this operation should have 
> been retried.
> It seems, this caused events at the ResourceManager to queue up and 
> eventually stop responding to even basic {{yarn application -list}} commands.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6918) Remove acls after queue delete to avoid memory leak

2018-04-02 Thread SammiChen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SammiChen updated YARN-6918:

Target Version/s: 3.0.2, 2.9.2  (was: 2.9.1, 3.0.2)

> Remove acls after queue delete to avoid memory leak
> ---
>
> Key: YARN-6918
> URL: https://issues.apache.org/jira/browse/YARN-6918
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Major
> Attachments: YARN-6918.001.patch, YARN-6918.002.patch
>
>
> Acl for deleted queue need to removed from allAcls to avoid leak 
> (Priority,YarnAuthorizer)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7649) RMContainer state transition exception after container update

2018-04-02 Thread SammiChen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SammiChen updated YARN-7649:

Target Version/s: 3.0.2, 2.9.2  (was: 2.9.1, 3.0.2)

> RMContainer state transition exception after container update
> -
>
> Key: YARN-7649
> URL: https://issues.apache.org/jira/browse/YARN-7649
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.0
>Reporter: Weiwei Yang
>Assignee: Arun Suresh
>Priority: Major
>
> I've been seen this in a cluster deployment as well as in UT, run 
> {{TestAMRMClient#testAMRMClientWithContainerPromotion}} could reproduce this, 
>  it doesn't fail the test case but following error message is shown up in the 
> log
> {noformat}
> 2017-12-13 19:41:31,817 ERROR rmcontainer.RMContainerImpl 
> (RMContainerImpl.java:handle(480)) - Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> RELEASED at ALLOCATED
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:478)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:65)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:675)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1586)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:155)
>   at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
>   at java.lang.Thread.run(Thread.java:748)
> 2017-12-13 19:41:31,817 ERROR rmcontainer.RMContainerImpl 
> (RMContainerImpl.java:handle(481)) - Invalid event RELEASED on container 
> container_1513165290804_0001_01_03
> {noformat}
> this seems to be related to YARN-6251.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7652) Handle AM register requests asynchronously in FederationInterceptor

2018-03-05 Thread SammiChen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387355#comment-16387355
 ] 

SammiChen commented on YARN-7652:
-

Hi [~botong], is it still on target for 2.9.1? If not, can we push it out from 
2.9.1 to next release? 

> Handle AM register requests asynchronously in FederationInterceptor
> ---
>
> Key: YARN-7652
> URL: https://issues.apache.org/jira/browse/YARN-7652
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Affects Versions: 2.9.0, 3.0.0
>Reporter: Subru Krishnan
>Assignee: Botong Huang
>Priority: Major
>
> We (cc [~goiri]/[~botong]) observed that the {{FederationInterceptor}} in 
> {{AMRMProxy}} (and consequently the AM) is blocked if the _StateStore_ has 
> outdated info about a _SubCluster_. This is because we handle AM register 
> requests synchronously. This jira proposes to move to async similar to how we 
> operate with allocate invocations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7601) Incorrect container states recovered as LevelDB uses alphabetical order

2018-03-05 Thread SammiChen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387353#comment-16387353
 ] 

SammiChen commented on YARN-7601:
-

Hi [~sampada15], is it still on target for 2.9.1? If not, can we put it out 
from 2.9.1 to next release? 

> Incorrect container states recovered as LevelDB uses alphabetical order
> ---
>
> Key: YARN-7601
> URL: https://issues.apache.org/jira/browse/YARN-7601
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sampada Dehankar
>Assignee: Sampada Dehankar
>Priority: Major
> Attachments: YARN-7601.001.patch, YARN-7601.002.patch
>
>
> LevelDB stores key-value pairs in the alphabetical order. Container id 
> concatenated by its state is used as key. So, even if container goes through 
> any states in its life cycle, the order of states for following values 
> retrieved from LevelDB is always going to be as below`:
> LAUNCHED
> PAUSED
> QUEUED
> For eg: If a container is LAUNCHED then PAUSED and LAUNCHED again, the 
> recovered container state is PAUSED currently instead of LAUNCHED.
> We propose to store the timestamp as the value while making call to 
>   
>   storeContainerLaunched
>   storeContainerPaused
>   storeContainerQueued
>   
> so that correct container state is recovered based on timestamps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7511) NPE in ContainerLocalizer when localization failed for running container

2018-03-01 Thread SammiChen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383107#comment-16383107
 ] 

SammiChen commented on YARN-7511:
-

Hi [~Tao Yang], is it still on target for 2.9.1? if not, can we push this out 
from 2.9.1 to next release? 

> NPE in ContainerLocalizer when localization failed for running container
> 
>
> Key: YARN-7511
> URL: https://issues.apache.org/jira/browse/YARN-7511
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.0.0-alpha4, 2.9.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-7511.001.patch
>
>
> Error log:
> {noformat}
> 2017-09-30 20:14:32,839 FATAL [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
>         at 
> java.util.concurrent.ConcurrentHashMap.replaceNode(ConcurrentHashMap.java:1106)
>         at 
> java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:1097)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceSet.resourceLocalizationFailed(ResourceSet.java:151)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ResourceLocalizationFailedWhileRunningTransition.transition(ContainerImpl.java:821)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ResourceLocalizationFailedWhileRunningTransition.transition(ContainerImpl.java:813)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1335)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:95)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1372)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1365)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
>         at java.lang.Thread.run(Thread.java:834)
> 2017-09-30 20:14:32,842 INFO [AsyncDispatcher ShutDown handler] 
> org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye..
> {noformat}
> Reproduce this problem:
> 1. Container was running and ContainerManagerImpl#localize was called for 
> this container
> 2. Localization failed in ResourceLocalizationService$LocalizerRunner#run and 
> sent out ContainerResourceFailedEvent with null LocalResourceRequest.
> 3. NPE when ResourceLocalizationFailedWhileRunningTransition#transition --> 
> container.resourceSet.resourceLocalizationFailed(null)
> I think we can fix this problem through ensuring that request is not null 
> before remove it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7736) Fix itemization in YARN federation document

2018-03-01 Thread SammiChen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383103#comment-16383103
 ] 

SammiChen commented on YARN-7736:
-

Hi [~ajisakaa], is it still on target for 2.9.1? If not, can we push this out 
from 2.9.1 to next release? 

> Fix itemization in YARN federation document
> ---
>
> Key: YARN-7736
> URL: https://issues.apache.org/jira/browse/YARN-7736
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Akira Ajisaka
>Priority: Minor
>  Labels: newbie
>
> https://hadoop.apache.org/docs/r3.0.0/hadoop-yarn/hadoop-yarn-site/Federation.html
> {noformat}
> Assumptions:
> * We assume reasonably good connectivity across sub-clusters (e.g., we are 
> not looking to federate across DC yet, though future investigations of this 
> are not excluded).
> * We rely on HDFS federation (or equivalently scalable DFS solutions) to take 
> care of scalability of the store side.
> {noformat}
> Blank line should be inserted before itemization to render correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4858) start-yarn and stop-yarn scripts to support timeline and sharedcachemanager

2018-03-01 Thread SammiChen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383040#comment-16383040
 ] 

SammiChen commented on YARN-4858:
-

Hi [~ste...@apache.org], does this still target for 2.9.1?  If not, can we push 
this out to next 2.9.2 release? 

> start-yarn and stop-yarn scripts to support timeline and sharedcachemanager
> ---
>
> Key: YARN-4858
> URL: https://issues.apache.org/jira/browse/YARN-4858
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scripts
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
>  Labels: oct16-easy
> Attachments: YARN-4858-001.patch, YARN-4858-branch-2.001.patch
>
>
> The start-yarn and stop-yarn scripts don't have any (even commented out) 
> support for the  timeline and sharedcachemanager
> Proposed:
> * bash and cmd start-yarn scripts have commented out start actions
> * stop-yarn scripts stop the servers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7450) ATS Client should retry on intermittent Kerberos issues.

2018-03-01 Thread SammiChen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383031#comment-16383031
 ] 

SammiChen commented on YARN-7450:
-

Hi [~raviprak],  does this still target for 2.9.1?  If not, can we push this 
out to next 2.9.2 release? 

> ATS Client should retry on intermittent Kerberos issues.
> 
>
> Key: YARN-7450
> URL: https://issues.apache.org/jira/browse/YARN-7450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2
>Affects Versions: 2.7.3
> Environment: Hadoop-2.7.3
>Reporter: Ravi Prakash
>Priority: Major
>
> We saw a stack trace (posted in the first comment) in the ResourceManager 
> logs for the TimelineClientImpl not being able to relogin from keytab.
> I'm guessing there was an intermittent issue that failed the kerberos relogin 
> from keytab. However, I'm assuming this was *not* retried because I only saw 
> one instance of this stack trace.  I propose that this operation should have 
> been retried.
> It seems, this caused events at the ResourceManager to queue up and 
> eventually stop responding to even basic {{yarn application -list}} commands.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org