[jira] [Updated] (YARN-9873) Mutation API Config Change updates Version Number

2019-10-04 Thread Sunil Govindan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-9873:
-
Summary: Mutation API Config Change updates Version Number   (was: Version 
Number for each Scheduler Config Change)

> Mutation API Config Change updates Version Number 
> --
>
> Key: YARN-9873
> URL: https://issues.apache.org/jira/browse/YARN-9873
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-9873-001.patch, YARN-9873-002.patch
>
>
> Version Number support for each Scheduler Config Change. This also helps to 
> know when the last change happened.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9772) CapacitySchedulerQueueManager has incorrect list of queues

2019-09-19 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933593#comment-16933593
 ] 

Sunil Govindan commented on YARN-9772:
--

Thanks. I think the point made by [~tarunparimi] is valid. Customer who has 
already a queue setup will run into issues during this.  we need to come with 
some way to smoothen that part. YARN-9766 was removing some checks, hence i had 
my reservations. [~tarunparimi] , lets fix cleanly. Also at same time, let 
customer get a cleaner upgrade path as well, if thats needed some tooling or 
scripts. 

 

> CapacitySchedulerQueueManager has incorrect list of queues
> --
>
> Key: YARN-9772
> URL: https://issues.apache.org/jira/browse/YARN-9772
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
>
> CapacitySchedulerQueueManager has incorrect list of queues when there is more 
> than one parent queue (say at middle level) with same name.
> For example,
>  * root
>  ** a
>  *** b
>   c
>  *** d
>   b
>  * e
> {{CapacitySchedulerQueueManager#getQueues}} maintains these list of queues. 
> While parsing "root.a.d.b", it overrides "root.a.b" with new Queue object in 
> the map because of similar name. After parsing all the queues, map count 
> should be 7, but it is 6. Any reference to queue "root.a.b" in code path is 
> nothing but "root.a.d.b" object. Since 
> {{CapacitySchedulerQueueManager#getQueues}} has been used in multiple places, 
> will need to understand the implications in detail. For example, 
> {{CapapcityScheduler#getQueue}} has been used in many places which in turn 
> uses {{CapacitySchedulerQueueManager#getQueues}}. cc [~eepayne], [~sunilg]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9617) RM UI enables viewing pages using Timeline Reader for a user who can not access the YARN config endpoint

2019-09-19 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933375#comment-16933375
 ] 

Sunil Govindan commented on YARN-9617:
--

[~akhilpb] cud u pls add a validation for this case?

> RM UI enables viewing pages using Timeline Reader for a user who can not 
> access the YARN config endpoint
> 
>
> Key: YARN-9617
> URL: https://issues.apache.org/jira/browse/YARN-9617
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Affects Versions: 3.1.1
>Reporter: Balázs Szabó
>Priority: Major
> Attachments: 1.png, 2.png
>
>
> If a user who can not access the /conf endpoint she/he will be unable to 
> query the address of the Timeline Service Reader 
> (yarn.timeline-service.reader.webapp.address). In this case, the user 
> receives a "403 Unauthenticated users are not authorized to access this page" 
> response, when trying to view pages requesting data from the Timeline Reader 
> (i.e. Flow Activity tab). In this case the UI is falling back to the default 
> address (localhost:8188), which eventually yields the 401 response (see 
> attached screenshots).
>  
> !1.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9766) YARN CapacityScheduler QueueMetrics has missing metrics for parent queues having same name

2019-09-19 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933372#comment-16933372
 ] 

Sunil Govindan commented on YARN-9766:
--

[~tarunparimi] [~Prabhu Joseph]

I went through the fix. I think its a short term fix. And it may have impacts.

This should be fixed in more cleaner way. I would like to enforce this 
validation if this doesnt make sense. Lets fix this at creation level of queues 
itself.
 # as per me, we are trying to over come the duplicated name issue by not 
looking at old one
 # this has happened because value got overriden.

Hence either we avoid such config, or reimplement the map to much better data 
structure.

 

cc [~eepayne] [~leftnoteasy] [~cheersyang]

> YARN CapacityScheduler QueueMetrics has missing metrics for parent queues 
> having same name
> --
>
> Key: YARN-9766
> URL: https://issues.apache.org/jira/browse/YARN-9766
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9766.001.patch
>
>
> In Capacity Scheduler, we enforce Leaf Queues to have unique names. But it is 
> not the case for Parent Queues. For example, we can have the below queue 
> hierarchy, where "b" is the queue name for two different queue paths root.a.b 
> and root.a.d.b . Since it is not a leaf queue this configuration works and 
> apps run fine in the leaf queues 'c'  and 'e'.
>  * root
>  ** a
>  *** b
>   c
>  *** d
>   b
>  * e
> But the jmx metrics does not show the metrics for the parent queue 
> "root.a.d.b" . We can see metrics only for "root.a.b" queue.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9814) JobHistoryServer can't delete aggregated files, if remote app root directory is created by NodeManager

2019-09-17 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932026#comment-16932026
 ] 

Sunil Govindan commented on YARN-9814:
--

I have committed this to trunk, thanks . [~adam.antal]

If this is needed for branch-3.2 or 3.1, you need to rebase the patch. For now, 
i am resolving the jira. please re-open if other branches backport is needed, 
Thanks.

> JobHistoryServer can't delete aggregated files, if remote app root directory 
> is created by NodeManager
> --
>
> Key: YARN-9814
> URL: https://issues.apache.org/jira/browse/YARN-9814
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, yarn
>Affects Versions: 3.1.2
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Minor
> Attachments: YARN-9814.001.patch, YARN-9814.002.patch, 
> YARN-9814.003.patch, YARN-9814.004.patch, YARN-9814.005.patch
>
>
> If remote-app-log-dir is not created before starting Yarn processes, the 
> NodeManager creates it during the init of AppLogAggregator service. In a 
> custom system the primary group of the yarn user (which starts the NM/RM 
> daemons) is not hadoop, but set to a more restricted group (say yarn). If 
> NodeManager creates the folder it derives the group of the folder from the 
> primary group of the login user (which is yarn:yarn in this case), thus 
> setting the root log folder and all its subfolders to yarn group, ultimately 
> making it unaccessible to other processes - e.g. the JobHistoryServer's 
> AggregatedLogDeletionService.
> I suggest to make this group configurable. If this new configuration is not 
> set then we can still stick to the existing behaviour. 
> Creating the root app-log-dir each time during the setup of this system is a 
> bit error prone, and an end user can easily forget it. I think the best to 
> put this step is the LogAggregationService, which was responsible for 
> creating the folder already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9814) JobHistoryServer can't delete aggregated files, if remote app root directory is created by NodeManager

2019-09-17 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932022#comment-16932022
 ] 

Sunil Govindan commented on YARN-9814:
--

+1 Committing shortly

> JobHistoryServer can't delete aggregated files, if remote app root directory 
> is created by NodeManager
> --
>
> Key: YARN-9814
> URL: https://issues.apache.org/jira/browse/YARN-9814
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, yarn
>Affects Versions: 3.1.2
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Minor
> Attachments: YARN-9814.001.patch, YARN-9814.002.patch, 
> YARN-9814.003.patch, YARN-9814.004.patch, YARN-9814.005.patch
>
>
> If remote-app-log-dir is not created before starting Yarn processes, the 
> NodeManager creates it during the init of AppLogAggregator service. In a 
> custom system the primary group of the yarn user (which starts the NM/RM 
> daemons) is not hadoop, but set to a more restricted group (say yarn). If 
> NodeManager creates the folder it derives the group of the folder from the 
> primary group of the login user (which is yarn:yarn in this case), thus 
> setting the root log folder and all its subfolders to yarn group, ultimately 
> making it unaccessible to other processes - e.g. the JobHistoryServer's 
> AggregatedLogDeletionService.
> I suggest to make this group configurable. If this new configuration is not 
> set then we can still stick to the existing behaviour. 
> Creating the root app-log-dir each time during the setup of this system is a 
> bit error prone, and an end user can easily forget it. I think the best to 
> put this step is the LogAggregationService, which was responsible for 
> creating the folder already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch

2019-09-17 Thread Sunil Govindan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-9833:
-
Fix Version/s: 3.1.4
   3.2.2

> Race condition when DirectoryCollection.checkDirs() runs during container 
> launch
> 
>
> Key: YARN-9833
> URL: https://issues.apache.org/jira/browse/YARN-9833
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-9833-001.patch
>
>
> During endurance testing, we found a race condition that cause an empty 
> {{localDirs}} being passed to container-executor.
> The problem is that {{DirectoryCollection.checkDirs()}} clears three 
> collections:
> {code:java}
> this.writeLock.lock();
> try {
>   localDirs.clear();
>   errorDirs.clear();
>   fullDirs.clear();
>   ...
> {code}
> This happens in critical section guarded by a write lock. When we start a 
> container, we retrieve the local dirs by calling 
> {{dirsHandler.getLocalDirs();}} which in turn invokes 
> {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
> {code:java}
> List getGoodDirs() {
> this.readLock.lock();
> try {
>   return Collections.unmodifiableList(localDirs);
> } finally {
>   this.readLock.unlock();
> }
>   }
> {code}
> So we're also in a critical section guarded by the lock. But 
> {{Collections.unmodifiableList()}} only returns a _view_ of the collection, 
> not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be 
> scheduled to run and immediately clears {{localDirs}}.
> This caused a weird behaviour in container-executor, which exited with error 
> code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
> Therefore we can't just return a view, we must return a copy with 
> {{ImmutableList.copyOf()}}.
> Credits to [~snemeth] for analyzing and determining the root cause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch

2019-09-17 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931493#comment-16931493
 ] 

Sunil Govindan commented on YARN-9833:
--

+1. Committing this now.

> Race condition when DirectoryCollection.checkDirs() runs during container 
> launch
> 
>
> Key: YARN-9833
> URL: https://issues.apache.org/jira/browse/YARN-9833
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9833-001.patch
>
>
> During endurance testing, we found a race condition that cause an empty 
> {{localDirs}} being passed to container-executor.
> The problem is that {{DirectoryCollection.checkDirs()}} clears three 
> collections:
> {code:java}
> this.writeLock.lock();
> try {
>   localDirs.clear();
>   errorDirs.clear();
>   fullDirs.clear();
>   ...
> {code}
> This happens in critical section guarded by a write lock. When we start a 
> container, we retrieve the local dirs by calling 
> {{dirsHandler.getLocalDirs();}} which in turn invokes 
> {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
> {code:java}
> List getGoodDirs() {
> this.readLock.lock();
> try {
>   return Collections.unmodifiableList(localDirs);
> } finally {
>   this.readLock.unlock();
> }
>   }
> {code}
> So we're also in a critical section guarded by the lock. But 
> {{Collections.unmodifiableList()}} only returns a _view_ of the collection, 
> not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be 
> scheduled to run and immediately clears {{localDirs}}.
> This caused a weird behaviour in container-executor, which exited with error 
> code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
> Therefore we can't just return a view, we must return a copy with 
> {{ImmutableList.copyOf()}}.
> Credits to [~snemeth] for analyzing and determining the root cause.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage metri

2019-09-17 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931192#comment-16931192
 ] 

Sunil Govindan commented on YARN-9838:
--

[~jiulongZhu] Thanks for reporting this issues.

Few general nits:

1. Please keep the Jira open, and click on the "Patch Available" button once u 
ready with a patch.

2. rename patch to YARN-9838.0001.patch or so to make the naming convention 
unique, and jenkins will auto run the test cases.

 

coming to the patch, there are some improvements made in YARN-5932. Could you 
please whether that will solve the issues which you mentioned.

 

> Using the CapacityScheduler,Apply "movetoqueue" on the application which CS 
> reserved containers for,will cause "Num Container" and "Used Resource" in 
> ResourceUsage metrics error 
> --
>
> Key: YARN-9838
> URL: https://issues.apache.org/jira/browse/YARN-9838
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Affects Versions: 2.7.3
>Reporter: jiulongzhu
>Priority: Critical
>  Labels: patch
> Fix For: 2.7.3
>
> Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, 
> bug_fix_capacityScheduler_moveApplication.patch
>
>
>       In some clusters of ours, we are seeing "Used Resource","Used 
> Capacity","Absolute Used Capacity" and "Num Container" is positive or 
> negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In 
> extreme cases, apps couldn't be submitted to the queue that is actually idle 
> but the "Used Resource" is far more than zero, just like "Container Leak".
>       Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used 
> Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and 
> "Num Container" use the "numContainer" value kept by LeafQueue.And 
> AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will 
> change the state value of "numContainer" and "Used". Secondly, by comparing 
> the values numContainer and ResourceUsageByLabel and QueueMetrics 
> changed(#allocateContainer and #releaseContainer) logic of applications with 
> and without "movetoqueue",i found that moving the reservedContainers didn't 
> modify the "numContainer" value in AbstractCSQueue and "used" value in 
> ResourceUsage when the application was moved from a queue to another queue.
>         The metric values changed logic of reservedContainers are allocated, 
> and moved from $FROM queue to $TO queue, and released.The degree of increase 
> and decrease is not conservative, the Resource allocated from $FROM queue and 
> release to $TO queue.
> ||move reversedContainer||allocate||movetoqueue||release||
> |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the 
> same,$TO queue stay the same{color}|decrease  in $TO queue|
> |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM 
> queue stay the same,$TO queue stay the same{color}|decrease  in $TO queue |
> |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in 
> $TO queue|decrease  in $TO queue|
>       The metric values changed logic of allocatedContainer(allocated, 
> acquired, running) are allocated, and movetoqueue, and released are 
> absolutely conservative.
>    



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9814) JobHistoryServer can't delete aggregated files, if remote app root directory is created by NodeManager

2019-09-16 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930281#comment-16930281
 ] 

Sunil Govindan commented on YARN-9814:
--

Thanks [~adam.antal].

This approach looks fine to me.

couple of minor comments:
 # Please renamed remote-app-log-dir.group => remote-app-log-dir.groupname or 
group-name. wanted to explicitly understand what group means, as its bit less 
informations. 
 # New LOG.debug which is added, please put it under if(LOG.isDebugEnabled()) 
flag
 # Is it possible to test when custom group is not added, it takes the default 
one ? if its already there, please point to me to that.

Thanks

> JobHistoryServer can't delete aggregated files, if remote app root directory 
> is created by NodeManager
> --
>
> Key: YARN-9814
> URL: https://issues.apache.org/jira/browse/YARN-9814
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, yarn
>Affects Versions: 3.1.2
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Minor
> Attachments: YARN-9814.001.patch, YARN-9814.002.patch, 
> YARN-9814.003.patch, YARN-9814.004.patch
>
>
> If remote-app-log-dir is not created before starting Yarn processes, the 
> NodeManager creates it during the init of AppLogAggregator service. In a 
> custom system the primary group of the yarn user (which starts the NM/RM 
> daemons) is not hadoop, but set to a more restricted group (say yarn). If 
> NodeManager creates the folder it derives the group of the folder from the 
> primary group of the login user (which is yarn:yarn in this case), thus 
> setting the root log folder and all its subfolders to yarn group, ultimately 
> making it unaccessible to other processes - e.g. the JobHistoryServer's 
> AggregatedLogDeletionService.
> I suggest to make this group configurable. If this new configuration is not 
> set then we can still stick to the existing behaviour. 
> Creating the root app-log-dir each time during the setup of this system is a 
> bit error prone, and an end user can easily forget it. I think the best to 
> put this step is the LogAggregationService, which was responsible for 
> creating the folder already.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9674) Max AM Resource calculation is wrong

2019-09-11 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927303#comment-16927303
 ] 

Sunil Govindan commented on YARN-9674:
--

Across partitions, it should be same behaviour. Looks like a bug to me.

cc [~Prabhu Joseph]

> Max AM Resource calculation is wrong
> 
>
> Key: YARN-9674
> URL: https://issues.apache.org/jira/browse/YARN-9674
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.2
>Reporter: ANANDA G B
>Priority: Major
> Attachments: RM_Issue.png
>
>
> 'Max AM Resource' calculated for default partition using 'Effective Max 
> Capacity' and ohter partitions it using 'Effective Capacity'.
> Which one is correct implemenation?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9813) RM does not start on JDK11 when UIv2 is enabled

2019-09-06 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924052#comment-16924052
 ] 

Sunil Govindan commented on YARN-9813:
--

pending jenkins

> RM does not start on JDK11 when UIv2 is enabled
> ---
>
> Key: YARN-9813
> URL: https://issues.apache.org/jira/browse/YARN-9813
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, yarn
>Affects Versions: 3.1.2
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Critical
> Attachments: YARN-9813.001.patch
>
>
> Starting a ResourceManager on JDK11 with UIv2 is enabled, RM startup fails 
> with the following message:
> {noformat}
> Error starting ResourceManager
> java.lang.ClassCastException: class 
> jdk.internal.loader.ClassLoaders$AppClassLoader cannot be cast to class 
> java.net.URLClassLoader (jdk.internal.loader.ClassLoaders$AppClassLoader and 
> java.net.URLClassLoader are in module java.base of loader 'bootstrap')
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startWepApp(ResourceManager.java:1190)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1333)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1531)
> {noformat}
> It is a known issue that the systemClassLoader is not URLClassLoader anymore 
> from JDK9 (see related UT failure: YARN-9512). 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9813) RM does not start on JDK11 when UIv2 is enabled

2019-09-06 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924041#comment-16924041
 ] 

Sunil Govindan commented on YARN-9813:
--

Thanks [~adam.antal] 

I ll test this and let know whether it works

> RM does not start on JDK11 when UIv2 is enabled
> ---
>
> Key: YARN-9813
> URL: https://issues.apache.org/jira/browse/YARN-9813
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, yarn
>Affects Versions: 3.1.2
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Critical
> Attachments: YARN-9813.001.patch
>
>
> Starting a ResourceManager on JDK11 with UIv2 is enabled, RM startup fails 
> with the following message:
> {noformat}
> Error starting ResourceManager
> java.lang.ClassCastException: class 
> jdk.internal.loader.ClassLoaders$AppClassLoader cannot be cast to class 
> java.net.URLClassLoader (jdk.internal.loader.ClassLoaders$AppClassLoader and 
> java.net.URLClassLoader are in module java.base of loader 'bootstrap')
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startWepApp(ResourceManager.java:1190)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1333)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1531)
> {noformat}
> It is a known issue that the systemClassLoader is not URLClassLoader anymore 
> from JDK9 (see related UT failure: YARN-9512). 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7055) YARN Timeline Service v.2: beta 1 / GA

2019-09-05 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923857#comment-16923857
 ] 

Sunil Govindan commented on YARN-7055:
--

Kudos! Thanks to all the contributors who helped in ATSv2. (y)

> YARN Timeline Service v.2: beta 1 / GA
> --
>
> Key: YARN-7055
> URL: https://issues.apache.org/jira/browse/YARN-7055
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: timelineclient, timelinereader, timelineserver
>Reporter: Vrushali C
>Priority: Major
> Fix For: 3.2.1
>
> Attachments: TSv2 next steps.pdf
>
>
> This is an umbrella JIRA for the beta 1 milestone for YARN Timeline Service 
> v.2.
> YARN-2928 was alpha1, YARN-5355 was alpha2. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9804) Update ATSv2 document for latest feature supports

2019-09-04 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922320#comment-16922320
 ] 

Sunil Govindan commented on YARN-9804:
--

+1. Thanks [~rohithsharma]

> Update ATSv2 document for latest feature supports
> -
>
> Key: YARN-9804
> URL: https://issues.apache.org/jira/browse/YARN-9804
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Blocker
> Attachments: YARN-9804.01.patch, YARN-9804.02.patch
>
>
> Revisit ATSv2 documents and update for GA features. And also for the road map.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9797) LeafQueue#activateApplications should use resourceCalculator#fitsIn

2019-09-02 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921139#comment-16921139
 ] 

Sunil Govindan commented on YARN-9797:
--

+1.

> LeafQueue#activateApplications should use resourceCalculator#fitsIn
> ---
>
> Key: YARN-9797
> URL: https://issues.apache.org/jira/browse/YARN-9797
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9797-001.patch, YARN-9797-002.patch, 
> YARN-9797-003.patch, YARN-9797-004.patch, YARN-9797-005.patch
>
>
> Dominant resource calculator compare function check for dominant resource is 
> lessThan.
> Incase case of AM limit we should activate application only when all the 
> resourceValues are less than the AM limit.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9784) org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue is flaky

2019-09-02 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921127#comment-16921127
 ] 

Sunil Govindan commented on YARN-9784:
--

+1

I ll commit this shortly. Thanks [~kmarton]

> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue
>  is flaky
> ---
>
> Key: YARN-9784
> URL: https://issues.apache.org/jira/browse/YARN-9784
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.3.0
>Reporter: Julia Kinga Marton
>Assignee: Julia Kinga Marton
>Priority: Major
> Attachments: YARN-9784.001.patch
>
>
> There are some test cases in TestLeafQueue which are failing intermittently.
> From 100 runs, there were 16 failures. 
> Some failure examples are the following ones:
> {code:java}
> 2019-08-26 13:18:13 [ERROR] Errors: 
> 2019-08-26 13:18:13 [ERROR]   TestLeafQueue.setUp:144->setUpInternal:221 
> WrongTypeOfReturnValue 
> 2019-08-26 13:18:13 YarnConfigu...
> 2019-08-26 13:18:13 [ERROR]   TestLeafQueue.setUp:144->setUpInternal:221 
> WrongTypeOfReturnValue 
> 2019-08-26 13:18:13 YarnConfigu...
> 2019-08-26 13:18:13 [INFO] 
> 2019-08-26 13:18:13 [ERROR] Tests run: 36, Failures: 0, Errors: 2, Skipped: 0
> {code}
> {code:java}
> 2019-08-26 13:18:09 [ERROR] Failures: 
> 2019-08-26 13:18:09 [ERROR]   TestLeafQueue.testHeadroomWithMaxCap:1373 
> expected:<2048> but was:<0>
> 2019-08-26 13:18:09 [INFO] 
> 2019-08-26 13:18:09 [ERROR] Tests run: 36, Failures: 1, Errors: 0, Skipped: 0
> {code}
> {code:java}
> 2019-08-26 13:18:18 [ERROR] Errors: 
> 2019-08-26 13:18:18 [ERROR]   TestLeafQueue.setUp:144->setUpInternal:221 
> WrongTypeOfReturnValue 
> 2019-08-26 13:18:18 YarnConfigu...
> 2019-08-26 13:18:18 [ERROR]   TestLeafQueue.testHeadroomWithMaxCap:1307 ? 
> ClassCast org.apache.hadoop.yarn.c...
> 2019-08-26 13:18:18 [INFO] 
> 2019-08-26 13:18:18 [ERROR] Tests run: 36, Failures: 0, Errors: 2, Skipped: 0
> {code}
> {code:java}
> 2019-08-26 13:18:10 [ERROR] Failures: 
> 2019-08-26 13:18:10 [ERROR]   TestLeafQueue.testDRFUserLimits:847 Verify 
> user_0 got resources 
> 2019-08-26 13:18:10 [INFO] 
> 2019-08-26 13:18:10 [ERROR] Tests run: 36, Failures: 1, Errors: 0, Skipped: 0
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9785) Fix DominantResourceCalculator when one resource is zero

2019-09-02 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921103#comment-16921103
 ] 

Sunil Govindan commented on YARN-9785:
--

+1 from me as well. Lets get this in.

> Fix DominantResourceCalculator when one resource is zero
> 
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.002.patch, 
> YARN-9785.003.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9785) Fix DominantResourceCalculator when one resource is zero

2019-08-29 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918445#comment-16918445
 ] 

Sunil Govindan commented on YARN-9785:
--

Thanks [~bibinchundatt]

> Fix DominantResourceCalculator when one resource is zero
> 
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.002.patch, 
> YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9785) Fix DominantResourceCalculator when one resource is zero

2019-08-29 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918310#comment-16918310
 ] 

Sunil Govindan edited comment on YARN-9785 at 8/29/19 6:05 AM:
---

Thanks [~bibinchundatt] for helping to clarify the issues. As far as we 
discussed, we are dealing with TWO separate issues here.
 # When a new resource type is added and no resources are configured for that 
resource type, then we have 0 in the resource entry. This causes the comparison 
issues. With the wip patch, we can resolve this pblm.
 # However as you pointed out in the 3rd case in above comment , Dominant 
resource calculator will kick in as CPU is more on RHS and Memory is more on 
LHS. But in ratio level, RHS has upper hand and CPU will become dominant here. 
This literally messes up the comparison of AM limi in Leaf Queue. As you 
mentioned, this can be fixed by changing to use fitsIn instead of 
lessThanOrEquals method.

Could we create a new Jira to track issue #2 mentioned here for AMLimit and 
this current issue can track the GPU resource 0 issue alone.  cc [~leftnoteasy]

Thanks


was (Author: sunilg):
Thanks [~bibinchundatt] for helping to clarify the issues. As far as we 
discussed, we are dealing with TWO separate issues here.
 # When a new resource type is added and no resources are configured for that 
resource type, then we have 0 in the resource entry. This causes the comparison 
issues. With the wip patch, we can resolve this pblm.
 # However as you pointed out in the 3rd case in above comment , Dominant 
resource calculator will kick in as CPU is more on RHS and Memory is more on 
LHS. But in ratio level, RHS has upper hand and CPU will become dominant here. 
This literally messes up the comparison of AM limi in Leaf Queue. As you 
mentioned, this can be fixed by changing to use fitsIn instead of 
lessThanOrEquals method.

Could we create a new Jira to track issue #2 mentioned here for AMLimit and 
this current issue can track the GPU resource 0 issue alone. 

Thanks

> Fix DominantResourceCalculator when one resource is zero
> 
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9785) Fix DominantResourceCalculator when one resource is zero

2019-08-29 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918310#comment-16918310
 ] 

Sunil Govindan commented on YARN-9785:
--

Thanks [~bibinchundatt] for helping to clarify the issues. As far as we 
discussed, we are dealing with TWO separate issues here.
 # When a new resource type is added and no resources are configured for that 
resource type, then we have 0 in the resource entry. This causes the comparison 
issues. With the wip patch, we can resolve this pblm.
 # However as you pointed out in the 3rd case in above comment , Dominant 
resource calculator will kick in as CPU is more on RHS and Memory is more on 
LHS. But in ratio level, RHS has upper hand and CPU will become dominant here. 
This literally messes up the comparison of AM limi in Leaf Queue. As you 
mentioned, this can be fixed by changing to use fitsIn instead of 
lessThanOrEquals method.

Could we create a new Jira to track issue #2 mentioned here for AMLimit and 
this current issue can track the GPU resource 0 issue alone. 

Thanks

> Fix DominantResourceCalculator when one resource is zero
> 
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9785) Application gets activated even when AM memory has reached

2019-08-28 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917919#comment-16917919
 ] 

Sunil Govindan commented on YARN-9785:
--

Thanks [~bibinchundatt] for detailed analysis.

I am checking about scenario 3 which you have mentioned here.

> Application gets activated even when AM memory has reached
> --
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9785) Application gets activated even when AM memory has reached

2019-08-28 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917628#comment-16917628
 ] 

Sunil Govindan commented on YARN-9785:
--

[~BilwaST] [~bibinchundatt]

We were trying a different approach. Just not to impact any other code patch.

I wll test this, and make some more changes. cud u pls check abt this approach 
in general.

> Application gets activated even when AM memory has reached
> --
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9785) Application gets activated even when AM memory has reached

2019-08-28 Thread Sunil Govindan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-9785:
-
Attachment: YARN-9785.wip.patch

> Application gets activated even when AM memory has reached
> --
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9642) AbstractYarnScheduler#clearPendingContainerCache could run even after transitiontostandby

2019-08-21 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912313#comment-16912313
 ] 

Sunil Govindan commented on YARN-9642:
--

+1 on latest patch. [~bibinchundatt] pls feel free to get this in.

> AbstractYarnScheduler#clearPendingContainerCache could run even after 
> transitiontostandby
> -
>
> Key: YARN-9642
> URL: https://issues.apache.org/jira/browse/YARN-9642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
>  Labels: memory-leak
> Attachments: YARN-9642.001.patch, YARN-9642.002.patch, 
> YARN-9642.003.patch, image-2019-06-22-16-05-24-114.png
>
>
> The TimeTask could hold the reference of Scheduler in case of fast switch 
> over too.
>  AbstractYarnScheduler should make sure scheduled Timer cancelled on 
> serviceStop.
> Causes memory leak too
> !image-2019-06-22-16-05-24-114.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9758) Upgrade JQuery to latest version for YARN UI

2019-08-20 Thread Sunil Govindan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-9758:
-
Summary: Upgrade JQuery to latest version for YARN UI  (was: Upgrade JQuery 
to latest version for YARN)

> Upgrade JQuery to latest version for YARN UI
> 
>
> Key: YARN-9758
> URL: https://issues.apache.org/jira/browse/YARN-9758
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Akhil PB
>Assignee: Akhil PB
>Priority: Major
> Attachments: YARN-9758.001.patch
>
>
>  
> cc: [~sunilg]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9758) Upgrade JQuery to latest version for YARN

2019-08-20 Thread Sunil Govindan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911327#comment-16911327
 ] 

Sunil Govindan commented on YARN-9758:
--

+1

Committing shortly

> Upgrade JQuery to latest version for YARN
> -
>
> Key: YARN-9758
> URL: https://issues.apache.org/jira/browse/YARN-9758
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Akhil PB
>Assignee: Akhil PB
>Priority: Major
> Attachments: YARN-9758.001.patch
>
>
>  
> cc: [~sunilg]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6492) Generate queue metrics for each partition

2019-08-17 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909635#comment-16909635
 ] 

Sunil Govindan commented on YARN-6492:
--

I agree with [~eepayne] point in splitting this. So it will be more clearer on 
fixing the existing issues separately, and can assess impact if any.

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, 
> YARN-6492.004.patch, YARN-6492.005.WIP.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2599) Standby RM should expose jmx endpoint

2019-08-17 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909634#comment-16909634
 ] 

Sunil Govindan commented on YARN-2599:
--

Thanks folks. Committed to trunk

> Standby RM should expose jmx endpoint
> -
>
> Key: YARN-2599
> URL: https://issues.apache.org/jira/browse/YARN-2599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1, 2.7.3, 3.0.0-alpha1
>Reporter: Karthik Kambatla
>Assignee: Rohith Sharma K S
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-2599.002.patch, YARN-2599.patch
>
>
> YARN-1898 redirects jmx and metrics to the Active. As discussed there, we 
> need to separate out metrics displayed so the Standby RM can also be 
> monitored. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-2599) Standby RM should expose jmx endpoint

2019-08-17 Thread Sunil Govindan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-2599:
-
Summary: Standby RM should expose jmx endpoint  (was: Standby RM should 
also expose some jmx and metrics)

> Standby RM should expose jmx endpoint
> -
>
> Key: YARN-2599
> URL: https://issues.apache.org/jira/browse/YARN-2599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1, 2.7.3, 3.0.0-alpha1
>Reporter: Karthik Kambatla
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-2599.002.patch, YARN-2599.patch
>
>
> YARN-1898 redirects jmx and metrics to the Active. As discussed there, we 
> need to separate out metrics displayed so the Standby RM can also be 
> monitored. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2599) Standby RM should also expose some jmx and metrics

2019-08-16 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909039#comment-16909039
 ] 

Sunil Govindan commented on YARN-2599:
--

Thanks [~cheersyang] for the review. Thanks [~rohithsharma] for the original 
patch. I jus rebased it.

I will commit later today if there are no objections.

 

> Standby RM should also expose some jmx and metrics
> --
>
> Key: YARN-2599
> URL: https://issues.apache.org/jira/browse/YARN-2599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1, 2.7.3, 3.0.0-alpha1
>Reporter: Karthik Kambatla
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-2599.002.patch, YARN-2599.patch
>
>
> YARN-1898 redirects jmx and metrics to the Active. As discussed there, we 
> need to separate out metrics displayed so the Standby RM can also be 
> monitored. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-2599) Standby RM should also expose some jmx and metrics

2019-08-16 Thread Sunil Govindan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-2599:
-
Release Note: YARN /jmx URL end points will be accessible per resource 
manager process. Hence there will not be any redirection to active resource 
manager while accessing /jmx endpoints.

> Standby RM should also expose some jmx and metrics
> --
>
> Key: YARN-2599
> URL: https://issues.apache.org/jira/browse/YARN-2599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1, 2.7.3, 3.0.0-alpha1
>Reporter: Karthik Kambatla
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-2599.002.patch, YARN-2599.patch
>
>
> YARN-1898 redirects jmx and metrics to the Active. As discussed there, we 
> need to separate out metrics displayed so the Standby RM can also be 
> monitored. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2599) Standby RM should also expose some jmx and metrics

2019-08-16 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908779#comment-16908779
 ] 

Sunil Govindan commented on YARN-2599:
--

cc [~cheersyang] New patch is added.

> Standby RM should also expose some jmx and metrics
> --
>
> Key: YARN-2599
> URL: https://issues.apache.org/jira/browse/YARN-2599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1, 2.7.3, 3.0.0-alpha1
>Reporter: Karthik Kambatla
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-2599.002.patch, YARN-2599.patch
>
>
> YARN-1898 redirects jmx and metrics to the Active. As discussed there, we 
> need to separate out metrics displayed so the Standby RM can also be 
> monitored. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-2599) Standby RM should also expose some jmx and metrics

2019-08-16 Thread Sunil Govindan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-2599:
-
Attachment: YARN-2599.002.patch

> Standby RM should also expose some jmx and metrics
> --
>
> Key: YARN-2599
> URL: https://issues.apache.org/jira/browse/YARN-2599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1, 2.7.3, 3.0.0-alpha1
>Reporter: Karthik Kambatla
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-2599.002.patch, YARN-2599.patch
>
>
> YARN-1898 redirects jmx and metrics to the Active. As discussed there, we 
> need to separate out metrics displayed so the Standby RM can also be 
> monitored. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2599) Standby RM should also expose some jmx and metrics

2019-08-14 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907389#comment-16907389
 ] 

Sunil Govindan commented on YARN-2599:
--

Thanks [~bibinchundatt]

I do not think its a compatibility issue. Fundamentally current behaviour is 
broken. In one of old Jira prior to HA, I could see that each process was 
having its own jmx. So redirect on jmx was not correct. I am pulling this only 
to trunk and mark a note about same.

I could not get the metrics servlet. I ll dig in a bit. This seems very old, 
and I am not sure whether its already moved out or not,

> Standby RM should also expose some jmx and metrics
> --
>
> Key: YARN-2599
> URL: https://issues.apache.org/jira/browse/YARN-2599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1, 2.7.3, 3.0.0-alpha1
>Reporter: Karthik Kambatla
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-2599.patch
>
>
> YARN-1898 redirects jmx and metrics to the Active. As discussed there, we 
> need to separate out metrics displayed so the Standby RM can also be 
> monitored. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2599) Standby RM should also expose some jmx and metrics

2019-08-14 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907100#comment-16907100
 ] 

Sunil Govindan commented on YARN-2599:
--

Kicking this again.

Having per process jmx is always better to debug any issues. If we redirect, 
its tough to know what has happened in the standby. 

[~rohithsharma] 's patch seems good to me. I will rebase this.

cc [~leftnoteasy] [~cheersyang] [~vinodkv]

> Standby RM should also expose some jmx and metrics
> --
>
> Key: YARN-2599
> URL: https://issues.apache.org/jira/browse/YARN-2599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1, 2.7.3, 3.0.0-alpha1
>Reporter: Karthik Kambatla
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-2599.patch
>
>
> YARN-1898 redirects jmx and metrics to the Active. As discussed there, we 
> need to separate out metrics displayed so the Standby RM can also be 
> monitored. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9106) Add option to graceful decommission to not wait for applications

2019-08-13 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906857#comment-16906857
 ] 

Sunil Govindan commented on YARN-9106:
--

cc [~leftnoteasy] [~tangzhankun]

> Add option to graceful decommission to not wait for applications
> 
>
> Key: YARN-9106
> URL: https://issues.apache.org/jira/browse/YARN-9106
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Mikayla Konst
>Assignee: Mikayla Konst
>Priority: Major
> Attachments: YARN-9106.patch
>
>
> Add property 
> yarn.resourcemanager.decommissioning-nodes-watcher.wait-for-applications.
> If true (the default), the resource manager waits for all containers, as well 
> as all applications associated with those containers, to finish before 
> gracefully decommissioning a node.
> If false, the resource manager only waits for containers, but not 
> applications, to finish. For map-only jobs or other jobs in which mappers do 
> not need to serve shuffle data, this allows nodes to be decommissioned as 
> soon as their containers are finished as opposed to when the job is done.
> Add property 
> yarn.resourcemanager.decommissioning-nodes-watcher.wait-for-app-masters.
> If false, during graceful decommission, when the resource manager waits for 
> all containers on a node to finish, it will not wait for app master 
> containers to finish. Defaults to true. This property should only be set to 
> false if app master failure is recoverable.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9106) Add option to graceful decommission to not wait for applications

2019-08-13 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906856#comment-16906856
 ] 

Sunil Govindan commented on YARN-9106:
--

wait-for-applications and wait-for-app-masters

Expecting below behaviour:

1. wait-for-applications: by default,  suggestion is to set TRUE. This means no 
matter the containers are done, still node cannot be decommissioned, as some 
apps may be still running. This is true in case of MR, How about other apps?. 
Such as services, or tez or spark? I think we need to consider the reason why 
we need to hold node for longer time based on type containers/apps each node 
has ran. 

2. wait-for-app-masters: This config will be helpful inorder to force kill AM 
containers to decommission a node faster. Thinking out loud, this is an 
aggressive config, howver default is turned off. Hence i think its fine to have 
this. 

> Add option to graceful decommission to not wait for applications
> 
>
> Key: YARN-9106
> URL: https://issues.apache.org/jira/browse/YARN-9106
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Mikayla Konst
>Assignee: Mikayla Konst
>Priority: Major
> Attachments: YARN-9106.patch
>
>
> Add property 
> yarn.resourcemanager.decommissioning-nodes-watcher.wait-for-applications.
> If true (the default), the resource manager waits for all containers, as well 
> as all applications associated with those containers, to finish before 
> gracefully decommissioning a node.
> If false, the resource manager only waits for containers, but not 
> applications, to finish. For map-only jobs or other jobs in which mappers do 
> not need to serve shuffle data, this allows nodes to be decommissioned as 
> soon as their containers are finished as opposed to when the job is done.
> Add property 
> yarn.resourcemanager.decommissioning-nodes-watcher.wait-for-app-masters.
> If false, during graceful decommission, when the resource manager waits for 
> all containers on a node to finish, it will not wait for app master 
> containers to finish. Defaults to true. This property should only be set to 
> false if app master failure is recoverable.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9738) Remove lock on ClusterNodeTracker#getNodeReport as it blocks application submission

2019-08-13 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905952#comment-16905952
 ] 

Sunil Govindan commented on YARN-9738:
--

Hi [~BilwaST]

{{nodes}} is operated under read and write lock in ClusterNodeTracker. Now 
converting the same to a concurrent hash map also impacts other code lines too. 
If we are not carefully using writeLock and concurrentHashMap, then it could 
cause redundant locking.

Thanks

> Remove lock on ClusterNodeTracker#getNodeReport as it blocks application 
> submission
> ---
>
> Key: YARN-9738
> URL: https://issues.apache.org/jira/browse/YARN-9738
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-9738-001.patch, YARN-9738-002.patch
>
>
> *Env :*
> Server OS :- UBUNTU
> No. of Cluster Node:- 9120 NMs
> Env Mode:- [Secure / Non secure]Secure
> *Preconditions:*
> ~9120 NM's was running
> ~1250 applications was in running state 
> 35K applications was in pending state
> *Test Steps:*
> 1. Submit the application from 5 clients, each client 2 threads and total 10 
> queues
> 2. Once application submittion increases (for each application of 
> distributted shell will call getClusterNodes)
> *ClientRMservice#getClusterNodes tries to get 
> ClusterNodeTracker#getNodeReport where map nodes is locked.*
> {quote}
> "IPC Server handler 36 on 45022" #246 daemon prio=5 os_prio=0 
> tid=0x7f75095de000 nid=0x1949c waiting on condition [0x7f74cff78000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x7f759f6d8858> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.getNodeReport(ClusterNodeTracker.java:123)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getNodeReport(AbstractYarnScheduler.java:449)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.createNodeReports(ClientRMService.java:1067)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getClusterNodes(ClientRMService.java:992)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getClusterNodes(ApplicationClientProtocolPBServiceImpl.java:313)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:589)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2792)
> {quote}
> *Instead we can make nodes as concurrentHashMap and remove readlock*



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9729) [UI2] Fix error message for logs when ATSv2 is offline

2019-08-11 Thread Sunil Govindan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-9729:
-
Summary: [UI2] Fix error message for logs when ATSv2 is offline  (was: 
[UI2] Fix error message for logs without ATSv2)

> [UI2] Fix error message for logs when ATSv2 is offline
> --
>
> Key: YARN-9729
> URL: https://issues.apache.org/jira/browse/YARN-9729
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Affects Versions: 3.2.0, 3.1.2
>Reporter: Zoltan Siegl
>Assignee: Zoltan Siegl
>Priority: Major
> Attachments: ATS_NOT_UP.png, ATS_UP_WITH_NO_LOGS.png, Screenshot 
> 2019-08-08 at 13.23.11.png, Screenshot 2019-08-08 at 13.23.21.png, Screenshot 
> 2019-08-09 at 3.22.19 PM.png, YARN-9729.001.patch, YARN-9729.002.patch, 
> YARN-9729.003.patch, after_patch.png
>
>
> On UI2 applications page logs are not available unless ATSv2 is running. The 
> reason for logs not to appear is unclarified on the UI.
> When ATS is reported to be unhealthy, a descriptive error message should 
> appear. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9729) [UI2] Fix error message for logs without ATSv2

2019-08-10 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904468#comment-16904468
 ] 

Sunil Govindan commented on YARN-9729:
--

Addressed comments from Akhil. Thanks [~zsiegl] and [~akhilpb]

[~rohithsharma] cud u pls review this patch

> [UI2] Fix error message for logs without ATSv2
> --
>
> Key: YARN-9729
> URL: https://issues.apache.org/jira/browse/YARN-9729
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Affects Versions: 3.2.0, 3.1.2
>Reporter: Zoltan Siegl
>Assignee: Zoltan Siegl
>Priority: Major
> Attachments: ATS_NOT_UP.png, ATS_UP_WITH_NO_LOGS.png, Screenshot 
> 2019-08-08 at 13.23.11.png, Screenshot 2019-08-08 at 13.23.21.png, Screenshot 
> 2019-08-09 at 3.22.19 PM.png, YARN-9729.001.patch, YARN-9729.002.patch, 
> YARN-9729.003.patch, after_patch.png
>
>
> On UI2 applications page logs are not available unless ATSv2 is running. The 
> reason for logs not to appear is unclarified on the UI.
> When ATS is reported to be unhealthy, a descriptive error message should 
> appear. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9729) [UI2] Fix error message for logs without ATSv2

2019-08-10 Thread Sunil Govindan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-9729:
-
Attachment: YARN-9729.003.patch

> [UI2] Fix error message for logs without ATSv2
> --
>
> Key: YARN-9729
> URL: https://issues.apache.org/jira/browse/YARN-9729
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Affects Versions: 3.2.0, 3.1.2
>Reporter: Zoltan Siegl
>Assignee: Zoltan Siegl
>Priority: Major
> Attachments: ATS_NOT_UP.png, ATS_UP_WITH_NO_LOGS.png, Screenshot 
> 2019-08-08 at 13.23.11.png, Screenshot 2019-08-08 at 13.23.21.png, Screenshot 
> 2019-08-09 at 3.22.19 PM.png, YARN-9729.001.patch, YARN-9729.002.patch, 
> YARN-9729.003.patch, after_patch.png
>
>
> On UI2 applications page logs are not available unless ATSv2 is running. The 
> reason for logs not to appear is unclarified on the UI.
> When ATS is reported to be unhealthy, a descriptive error message should 
> appear. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9729) [UI2] Fix error message for logs without ATSv2

2019-08-10 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904459#comment-16904459
 ] 

Sunil Govindan commented on YARN-9729:
--

Mistakenly pushed wrong patch. Reverted same.

 

> [UI2] Fix error message for logs without ATSv2
> --
>
> Key: YARN-9729
> URL: https://issues.apache.org/jira/browse/YARN-9729
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Affects Versions: 3.2.0, 3.1.2
>Reporter: Zoltan Siegl
>Assignee: Zoltan Siegl
>Priority: Major
> Attachments: ATS_NOT_UP.png, ATS_UP_WITH_NO_LOGS.png, Screenshot 
> 2019-08-08 at 13.23.11.png, Screenshot 2019-08-08 at 13.23.21.png, Screenshot 
> 2019-08-09 at 3.22.19 PM.png, YARN-9729.001.patch, YARN-9729.002.patch, 
> after_patch.png
>
>
> On UI2 applications page logs are not available unless ATSv2 is running. The 
> reason for logs not to appear is unclarified on the UI.
> When ATS is reported to be unhealthy, a descriptive error message should 
> appear. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9729) [UI2] Fix error message for logs without ATSv2

2019-08-10 Thread Sunil Govindan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-9729:
-
Attachment: YARN-9729.002.patch

> [UI2] Fix error message for logs without ATSv2
> --
>
> Key: YARN-9729
> URL: https://issues.apache.org/jira/browse/YARN-9729
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Affects Versions: 3.2.0, 3.1.2
>Reporter: Zoltan Siegl
>Assignee: Zoltan Siegl
>Priority: Major
> Attachments: ATS_NOT_UP.png, ATS_UP_WITH_NO_LOGS.png, Screenshot 
> 2019-08-08 at 13.23.11.png, Screenshot 2019-08-08 at 13.23.21.png, Screenshot 
> 2019-08-09 at 3.22.19 PM.png, YARN-9729.001.patch, YARN-9729.002.patch, 
> after_patch.png
>
>
> On UI2 applications page logs are not available unless ATSv2 is running. The 
> reason for logs not to appear is unclarified on the UI.
> When ATS is reported to be unhealthy, a descriptive error message should 
> appear. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9701) Yarn service cli commands do not connect to ssl enabled RM using ssl-client.xml configs

2019-08-10 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904346#comment-16904346
 ] 

Sunil Govindan commented on YARN-9701:
--

cc [~billie.rina...@gmail.com]

> Yarn service cli commands do not connect to ssl enabled RM using 
> ssl-client.xml configs
> ---
>
> Key: YARN-9701
> URL: https://issues.apache.org/jira/browse/YARN-9701
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9701.001.patch, YARN-9701.002.patch
>
>
> Yarn service commands use the yarn service rest api. When ssl is enabled for 
> RM, the yarn service commands fail as they don't read the ssl-client.xml 
> configs to create ssl connection to the rest api.
> This becomes a problem especially for self signed certificates as the 
> truststore location specified at ssl.client.truststore.location is not 
> considered by commands.
> As workaround, we need to import the certificates to the java default cacert 
> for the yarn service commands to work via ssl. It would be more proper if the 
> yarn service commands makes use of the configs at ssl-client.xml instead to 
> configure and create an ssl client connection. This workaround may not even 
> work if there are additional properties configured in ssl-client.xml that are 
> necessary apart from the truststore related properties.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9715) [UI2] yarn-container-log URI need to be encoded to avoid potential misuses

2019-08-09 Thread Sunil Govindan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-9715:
-
Summary: [UI2] yarn-container-log URI need to be encoded to avoid potential 
misuses  (was: [YARN UI2] yarn-container-log support for https Knox Gateway url 
in nodes page)

> [UI2] yarn-container-log URI need to be encoded to avoid potential misuses
> --
>
> Key: YARN-9715
> URL: https://issues.apache.org/jira/browse/YARN-9715
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Prabhu Joseph
>Assignee: Akhil PB
>Priority: Major
> Attachments: Screen Shot 2019-08-08 at 12.54.40 PM.png, Screen Shot 
> 2019-08-08 at 12.55.03 PM.png, Screen Shot 2019-08-08 at 2.51.46 PM.png, 
> Screen Shot 2019-08-08 at 3.03.16 PM.png, YARN-9715.001.patch, 
> YARN-9715.002.patch
>
>
> Currently yarn-container-log (UI2 - Nodes - List of Containers - log file) 
> creates url with node scheme (http) and nodeHttpAddress. This does not work 
> with Knox Gateway https url. The logic to construct url can be improved to 
> accept both normal and knox case. The similar way is used in Applications -> 
> Logs Section.
> And also UI2 - Nodes - List of Containers - log file does not have pagination 
> support for log file.
>  
> *Screenshot of Problematic Page *:  Knox Url - UI2 - Nodes - List of 
> Containers - log file 
> !Screen Shot 2019-08-08 at 3.03.16 PM.png|height=200|width=350!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9715) [YARN UI2] yarn-container-log support for https Knox Gateway url in nodes page

2019-08-09 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903615#comment-16903615
 ] 

Sunil Govindan commented on YARN-9715:
--

+1

Committing this in.

> [YARN UI2] yarn-container-log support for https Knox Gateway url in nodes page
> --
>
> Key: YARN-9715
> URL: https://issues.apache.org/jira/browse/YARN-9715
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Prabhu Joseph
>Assignee: Akhil PB
>Priority: Major
> Attachments: Screen Shot 2019-08-08 at 12.54.40 PM.png, Screen Shot 
> 2019-08-08 at 12.55.03 PM.png, Screen Shot 2019-08-08 at 2.51.46 PM.png, 
> Screen Shot 2019-08-08 at 3.03.16 PM.png, YARN-9715.001.patch, 
> YARN-9715.002.patch
>
>
> Currently yarn-container-log (UI2 - Nodes - List of Containers - log file) 
> creates url with node scheme (http) and nodeHttpAddress. This does not work 
> with Knox Gateway https url. The logic to construct url can be improved to 
> accept both normal and knox case. The similar way is used in Applications -> 
> Logs Section.
> And also UI2 - Nodes - List of Containers - log file does not have pagination 
> support for log file.
>  
> *Screenshot of Problematic Page *:  Knox Url - UI2 - Nodes - List of 
> Containers - log file 
> !Screen Shot 2019-08-08 at 3.03.16 PM.png|height=200|width=350!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9721) An easy method to exclude a nodemanager from the yarn cluster cleanly

2019-08-06 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900661#comment-16900661
 ] 

Sunil Govindan commented on YARN-9721:
--

Looping [~tangzhankun] to this thread.

[~yuan_zac], ideally node decommissioning will help you to make sure all 
containers are drained and a smooth decommission can be done. Once the node is 
decommissioned, you can remove as per use case. And as you mentioned, such 
nodes which are forced out should not be in inactive list.

cc [~leftnoteasy] [~cheersyang]

> An easy method to exclude a nodemanager from the yarn cluster cleanly
> -
>
> Key: YARN-9721
> URL: https://issues.apache.org/jira/browse/YARN-9721
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Priority: Major
> Attachments: decommission nodes.png
>
>
> If we want to take offline a nodemanager server, nodes.exclude-path
>  and "rmadmin -refreshNodes" command are used to decommission the server.
>  But this method cannot clean up the node clearly. Nodemanager servers are 
> still in Decommissioned Nodes as the attachment shows.
>   !decommission nodes.png!
> YARN-4311 enable a removalTimer to clean up the untracked node.
>  But the logic of isUntrackedNode method is to restrict. If include-path is 
> not used, no servers can meet the criteria. Using an include file would make 
> a potential risk in maintenance.
> If yarn cluster is installed on cloud, nodemanager servers are created and 
> deleted frequently. We need a way to exclude a nodemanager from the yarn 
> cluster cleanly. Otherwise, the map of rmContext.getInactiveRMNodes() would 
> keep growing, which would cause a memory issue of RM.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9722) PlacementRule logs object ID in place of queue name.

2019-08-05 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900621#comment-16900621
 ] 

Sunil Govindan commented on YARN-9722:
--

Looks fine, lets get this in. +1

> PlacementRule logs object ID in place of queue name.
> 
>
> Key: YARN-9722
> URL: https://issues.apache.org/jira/browse/YARN-9722
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Minor
>  Labels: supportability
> Attachments: YARN-9722-001.patch
>
>
> UserGroupMappingPlacementRule logs object ID in place of queue name.
> {code}
> 2019-08-05 09:28:52,664 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.placement.UserGroupMappingPlacementRule:
>  Application application_1564996871731_0003 user ambari-qa mapping [default] 
> to 
> [org.apache.hadoop.yarn.server.resourcemanager.placement.ApplicationPlacementContext@5aafe9b2]
>  override false
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9583) Failed job which is submitted unknown queue is showed all users

2019-08-05 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900619#comment-16900619
 ] 

Sunil Govindan commented on YARN-9583:
--

Thanks [~magnum]

[~Prabhu Joseph] and [~magnum], i am worried about one scenario here. Assume an 
application has been submitted on to a queue named "queueA" and successfully 
completed. Now delete queueA and restart RM. Will RM be able to recover this 
app and keep the state as SUCCESS.

Could you please cross check this case as well?

> Failed job which is submitted unknown queue is showed all users
> ---
>
> Key: YARN-9583
> URL: https://issues.apache.org/jira/browse/YARN-9583
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: security
>Affects Versions: 3.1.2
>Reporter: KWON BYUNGCHANG
>Assignee: KWON BYUNGCHANG
>Priority: Major
> Attachments: YARN-9583-screenshot.png, YARN-9583.001.patch, 
> YARN-9583.002.patch, YARN-9583.003.patch, YARN-9583.004.patch, 
> YARN-9583.005.patch
>
>
> In secure mode, Failed job which is submitted unknown queue is showed all 
> users.
> I attached RM UI screen shot.
> reproduction senario
>1. user foo submit job to unknown queue without view-acl and job will fail 
> immediately. 
>2. user bar can access the job of user foo which previously failed.
> According to comments in  QueueACLsManager .java that caused the problem,
> This situation can happen when RM is restarted after deleting queue.
> I think  showing app of non existing queue to all users is the problem after 
> RM start. 
> It will become a security hole.
> I fixed it a little bit.  
> After fixing it, Both owner of job and admin of yarn can access job which is 
> submitted unknown queue. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9687) Queue headroom check may let unacceptable allocation off when using DominantResourceCalculator

2019-07-22 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890322#comment-16890322
 ] 

Sunil Govindan commented on YARN-9687:
--

Hi [~Tao Yang]

Thanks for reporting this issue. Yes, we have seen this in few places where 
such cases can occur given the combination of resource values. *fitsIn* helps 
in such areas. (already we fixed few in preemption modules)

+1 for this patch.

> Queue headroom check may let unacceptable allocation off when using 
> DominantResourceCalculator
> --
>
> Key: YARN-9687
> URL: https://issues.apache.org/jira/browse/YARN-9687
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9687.001.patch
>
>
> Currently queue headroom check in {{RegularContainerAllocator#checkHeadroom}} 
> is using {{Resources#greaterThanOrEqual}} which internally compare resources 
> by ratio, when using DominantResourceCalculator, it may let unacceptable 
> allocations off in some scenarios.
> For example:
> cluster-resource=<10GB, 10 vcores>
> queue-headroom=<2GB, 4 vcores>
> required-resource=<3GB, 1 vcores>
> In this way, headroom ratio(0.4) is greater than the required ratio(0.3), so 
> that allocations will be let off in scheduling process but will always be 
> rejected when committing these proposals.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9681) AM resource limit is incorrect for queue

2019-07-16 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885987#comment-16885987
 ] 

Sunil Govindan commented on YARN-9681:
--

Hi [~gb.ana...@gmail.com]

cud u pls share more details on same. 

> AM resource limit is incorrect for queue
> 
>
> Key: YARN-9681
> URL: https://issues.apache.org/jira/browse/YARN-9681
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.2
>Reporter: ANANDA G B
>Priority: Major
> Attachments: After running job on queue1.png, Before running job on 
> queue1.png
>
>
> After running the job on Queue1 of Partition1, then Queue1 of 
> DEFAULT_PARTITION's 'Max Application Master Resources' is calculated wrongly. 
> Please find the attachement.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9642) AbstractYarnScheduler#clearPendingContainerCache could run even after transitiontostandby

2019-07-05 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16878985#comment-16878985
 ] 

Sunil Govindan commented on YARN-9642:
--

+1. from me.

Committing shortly if no objections.

> AbstractYarnScheduler#clearPendingContainerCache could run even after 
> transitiontostandby
> -
>
> Key: YARN-9642
> URL: https://issues.apache.org/jira/browse/YARN-9642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9642.001.patch, image-2019-06-22-16-05-24-114.png
>
>
> The TimeTask could hold the reference of Scheduler in case of fast switch 
> over too.
>  AbstractYarnScheduler should make sure scheduled Timer cancelled on 
> serviceStop.
> Causes memory leak too
> !image-2019-06-22-16-05-24-114.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9662) Preemption not working on NodeLabels

2019-07-03 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877538#comment-16877538
 ] 

Sunil Govindan edited comment on YARN-9662 at 7/3/19 7:01 AM:
--

[~Amithsha] hadoop-2.7 doesnt support label based preemption, and YARN-7685 was 
about that.

Cud u pls confirm the version ?


was (Author: sunilg):
[~Amithsha] hadoop-2.7 doesnt support label based preemption, and YARN-7685 was 
about that.

Cud u pls conform the version ?

> Preemption not working on NodeLabels
> 
>
> Key: YARN-9662
> URL: https://issues.apache.org/jira/browse/YARN-9662
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Amithsha
>Priority: Major
>
> Preemption on node labels is not working when the utilization is 100%.
> Example
>  adhocp0,adhocp1,adhocp3 Queues mapped to nodelabels of label_adhoc_nm.
>  With a share of 60,30,10 as actual capacity and 100 as maximum capacity for 
> all.
>  When a jobA on adhocp3 consumes 100% of its maximum capacity and a jobB 
> submitted on adhocp0 no containers running on adhocp3  queue got preempted.
>   
>  This is already reported by another user
>  https://issues.apache.org/jira/browse/YARN-7685
> Note :
> Jobs with more than actual capacity and less than the maximum capacity are 
> able to preempt the containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9662) Preemption not working on NodeLabels

2019-07-03 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877538#comment-16877538
 ] 

Sunil Govindan commented on YARN-9662:
--

[~Amithsha] hadoop-2.7 doesnt support label based preemption, and YARN-7685 was 
about that.

Cud u pls conform the version ?

> Preemption not working on NodeLabels
> 
>
> Key: YARN-9662
> URL: https://issues.apache.org/jira/browse/YARN-9662
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Amithsha
>Priority: Major
>
> Preemption on node labels is not working when the utilization is 100%.
> Example
>  adhocp0,adhocp1,adhocp3 Queues mapped to nodelabels of label_adhoc_nm.
>  With a share of 60,30,10 as actual capacity and 100 as maximum capacity for 
> all.
>  When a jobA on adhocp3 consumes 100% of its maximum capacity and a jobB 
> submitted on adhocp0 no containers running on adhocp3  queue got preempted.
>   
>  This is already reported by another user
>  https://issues.apache.org/jira/browse/YARN-7685
> Note :
> Jobs with more than actual capacity and less than the maximum capacity are 
> able to preempt the containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9644) First RMContext object is always leaked during switch over

2019-07-03 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877536#comment-16877536
 ] 

Sunil Govindan commented on YARN-9644:
--

[~bibinchundatt] cud u pls share 3.2 patch

> First RMContext object is always leaked during switch over
> --
>
> Key: YARN-9644
> URL: https://issues.apache.org/jira/browse/YARN-9644
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9644.001.patch, YARN-9644.002.patch, 
> YARN-9644.003.patch
>
>
> As per my understanding following 2 issues causes the issue.
> * WebApp holds the reference to First applicationMasterServer instance, which 
> has rmcontext with ActiveServiceContext (holds RMApps + nodes map). WebApp 
> remains to life time of RM process.
> * On transistion to active RMNMInfo object is registered in  MBean and never 
> unregistered on transitionToStandBy
> On transistion to Standby and again based to active new RMContext gets 
> created, but above 2 issues causes first RMcontext persist still RMShutdown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7621) Support submitting apps with queue path for CapacityScheduler

2019-07-03 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877534#comment-16877534
 ] 

Sunil Govindan commented on YARN-7621:
--

I think this makes sense to me. 

We are trimming the last section of the queue path and querying based on same. 
And for user, they will get a seamless shift. +1 for this,. Thanks [~Tao Yang] 
and [~cheersyang]

> Support submitting apps with queue path for CapacityScheduler
> -
>
> Key: YARN-7621
> URL: https://issues.apache.org/jira/browse/YARN-7621
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-7621.001.patch, YARN-7621.002.patch
>
>
> Currently there is a difference of queue definition in 
> ApplicationSubmissionContext between CapacityScheduler and FairScheduler. 
> FairScheduler needs queue path but CapacityScheduler needs queue name. There 
> is no doubt of the correction of queue definition for CapacityScheduler 
> because it does not allow duplicate leaf queue names, but it's hard to switch 
> between FairScheduler and CapacityScheduler. I propose to support submitting 
> apps with queue path for CapacityScheduler to make the interface clearer and 
> scheduler switch smoothly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9629) Support configurable MIN_LOG_ROLLING_INTERVAL

2019-07-02 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877018#comment-16877018
 ] 

Sunil Govindan commented on YARN-9629:
--

Thanks [~adam.antal]

First point makes sense if the default value is -1 and this code will always 
kick in if this is never configured

For #2, given a configuration is done, its always better to cleanly say like in 
that last else condition, Hence taking it out and show same log in both kind of 
scenario still makes sense to me. in first log, you can minimize content if its 
duplicated, however skipping else conditions looks more cleaner and generic.

> Support configurable MIN_LOG_ROLLING_INTERVAL
> -
>
> Key: YARN-9629
> URL: https://issues.apache.org/jira/browse/YARN-9629
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation, nodemanager, yarn
>Affects Versions: 3.2.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Minor
> Attachments: YARN-9629.001.patch, YARN-9629.002.patch, 
> YARN-9629.003.patch, YARN-9629.004.patch, YARN-9629.005.patch
>
>
> One of the log-aggregation parameter, the minimum valid value for 
> {{yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds}} is 
> MIN_LOG_ROLLING_INTERVAL - it has been hardcoded since its addition in 
> YARN-2583. 
> It has been empirically set as 1 hour, as lower values would too frequently 
> put the NodeManagers under pressure. For bigger clusters that is indeed a 
> valid limitation, but for smaller clusters it makes sense and a valid 
> customer usecase to use lower values, even like not so lower 30 mins. At this 
> point this can only be achieved by setting 
> {{yarn.nodemanager.log-aggregation.debug-enabled}}, which I believe should be 
> kept as debug purposes.
> I'm suggesting to make this min configurable, although a warning should be 
> logged in the NodeManager startup when this value is lower than 1 hour.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9644) First RMContext object is always leaked during switch over

2019-07-02 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876730#comment-16876730
 ] 

Sunil Govindan commented on YARN-9644:
--

Thanks [~bibinchundatt]. Pushed to trunk. Conflicts in branch-3.2 

Cud u pls share a branch-3.2/3.1 patch.

> First RMContext object is always leaked during switch over
> --
>
> Key: YARN-9644
> URL: https://issues.apache.org/jira/browse/YARN-9644
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9644.001.patch, YARN-9644.002.patch, 
> YARN-9644.003.patch
>
>
> As per my understanding following 2 issues causes the issue.
> * WebApp holds the reference to First applicationMasterServer instance, which 
> has rmcontext with ActiveServiceContext (holds RMApps + nodes map). WebApp 
> remains to life time of RM process.
> * On transistion to active RMNMInfo object is registered in  MBean and never 
> unregistered on transitionToStandBy
> On transistion to Standby and again based to active new RMContext gets 
> created, but above 2 issues causes first RMcontext persist still RMShutdown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9644) First RMContext object is always leaked during switch over

2019-07-02 Thread Sunil Govindan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-9644:
-
Summary: First RMContext object is always leaked during switch over  (was: 
First RMContext always leaked during switch over)

> First RMContext object is always leaked during switch over
> --
>
> Key: YARN-9644
> URL: https://issues.apache.org/jira/browse/YARN-9644
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9644.001.patch, YARN-9644.002.patch, 
> YARN-9644.003.patch
>
>
> As per my understanding following 2 issues causes the issue.
> * WebApp holds the reference to First applicationMasterServer instance, which 
> has rmcontext with ActiveServiceContext (holds RMApps + nodes map). WebApp 
> remains to life time of RM process.
> * On transistion to active RMNMInfo object is registered in  MBean and never 
> unregistered on transitionToStandBy
> On transistion to Standby and again based to active new RMContext gets 
> created, but above 2 issues causes first RMcontext persist still RMShutdown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9327) Improve synchronisation in ProtoUtils#convertToProtoFormat block

2019-07-02 Thread Sunil Govindan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-9327:
-
Summary: Improve synchronisation in ProtoUtils#convertToProtoFormat block  
(was: ProtoUtils#convertToProtoFormat block Application Master Service and many 
more)

> Improve synchronisation in ProtoUtils#convertToProtoFormat block
> 
>
> Key: YARN-9327
> URL: https://issues.apache.org/jira/browse/YARN-9327
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9327.001.patch, YARN-9327.002.patch, 
> YARN-9327.003.patch
>
>
> {code}
>   public static synchronized ResourceProto convertToProtoFormat(Resource r) {
> return ResourcePBImpl.getProto(r);
>   }
> {code}
> {noformat}
> "IPC Server handler 41 on 23764" #324 daemon prio=5 os_prio=0 
> tid=0x7f181de72800 nid=0x222 waiting for monitor entry 
> [0x7ef153dad000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils.convertToProtoFormat(ProtoUtils.java:404)
>   - waiting to lock <0x7ef2d8bcf6d8> (a java.lang.Class for 
> org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:315)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:262)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:289)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:228)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:844)
>   - locked <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:72)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1.next(AllocateResponsePBImpl.java:810)
>   - locked <0x7f0fed96f500> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1.next(AllocateResponsePBImpl.java:799)
>   at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>   at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>   at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:13810)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:158)
>   - locked <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:198)
>   - eliminated <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:103)
>   - locked <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:824)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2684){noformat}
> Seems synchronization is not required here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Commented] (YARN-9327) ProtoUtils#convertToProtoFormat block Application Master Service and many more

2019-07-02 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876694#comment-16876694
 ] 

Sunil Govindan commented on YARN-9327:
--

+1, Committing shortly.

> ProtoUtils#convertToProtoFormat block Application Master Service and many more
> --
>
> Key: YARN-9327
> URL: https://issues.apache.org/jira/browse/YARN-9327
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9327.001.patch, YARN-9327.002.patch, 
> YARN-9327.003.patch
>
>
> {code}
>   public static synchronized ResourceProto convertToProtoFormat(Resource r) {
> return ResourcePBImpl.getProto(r);
>   }
> {code}
> {noformat}
> "IPC Server handler 41 on 23764" #324 daemon prio=5 os_prio=0 
> tid=0x7f181de72800 nid=0x222 waiting for monitor entry 
> [0x7ef153dad000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils.convertToProtoFormat(ProtoUtils.java:404)
>   - waiting to lock <0x7ef2d8bcf6d8> (a java.lang.Class for 
> org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:315)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:262)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:289)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:228)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:844)
>   - locked <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:72)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1.next(AllocateResponsePBImpl.java:810)
>   - locked <0x7f0fed96f500> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1.next(AllocateResponsePBImpl.java:799)
>   at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>   at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>   at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:13810)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:158)
>   - locked <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:198)
>   - eliminated <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:103)
>   - locked <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:824)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2684){noformat}
> Seems synchronization is not required here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Commented] (YARN-9629) Support configurable MIN_LOG_ROLLING_INTERVAL

2019-07-02 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876693#comment-16876693
 ] 

Sunil Govindan commented on YARN-9629:
--

Hi [~adam.antal] 

Thanks for the patch.

Couple of minor nits

1.
{code:java}
+  LOG.info("rollingMonitorInterval is set as " + interval
+  + ". The log rolling monitoring interval is disabled. "
+  + "The logs will be aggregated after this application is 
finished.");{code}
Could this be log to considered as warn ?.  is it possible to print appId or 
some more identifications details for better easy understanding?

2. 
{code:java}
+  if (lowerThanHardLimit) {
+if (logAggregationDebugMode) {
+  LOG.info("Log aggregation debug mode enabled. " +
+  "rollingMonitorInterval = " + interval);
+} else {
+  LOG.warn("rollingMonitorInterval should be more than " +
+  "or equal to {} seconds. Using {} seconds instead.",
+  minRollingMonitorInterval, minRollingMonitorInterval);
+  interval = minRollingMonitorInterval;
+}
+  } else {
+LOG.info("rollingMonitorInterval is set as " + interval
++ ". The logs will be aggregated every " + interval
++ " seconds");
+  }{code}
Last log which is in the else block can be taken out and kept a common one. 
since interval is set in internal if..else, its better we have the common log 
outside. (warn log what you added is correct, and that can be there)

 

> Support configurable MIN_LOG_ROLLING_INTERVAL
> -
>
> Key: YARN-9629
> URL: https://issues.apache.org/jira/browse/YARN-9629
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation, nodemanager, yarn
>Affects Versions: 3.2.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Minor
> Attachments: YARN-9629.001.patch, YARN-9629.002.patch, 
> YARN-9629.003.patch, YARN-9629.004.patch, YARN-9629.005.patch
>
>
> One of the log-aggregation parameter, the minimum valid value for 
> {{yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds}} is 
> MIN_LOG_ROLLING_INTERVAL - it has been hardcoded since its addition in 
> YARN-2583. 
> It has been empirically set as 1 hour, as lower values would too frequently 
> put the NodeManagers under pressure. For bigger clusters that is indeed a 
> valid limitation, but for smaller clusters it makes sense and a valid 
> customer usecase to use lower values, even like not so lower 30 mins. At this 
> point this can only be achieved by setting 
> {{yarn.nodemanager.log-aggregation.debug-enabled}}, which I believe should be 
> kept as debug purposes.
> I'm suggesting to make this min configurable, although a warning should be 
> logged in the NodeManager startup when this value is lower than 1 hour.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9644) First RMContext always leaked during switch over

2019-06-26 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873005#comment-16873005
 ] 

Sunil Govindan commented on YARN-9644:
--

+1 on this patch.

Fix seems fine for both cases which is mentioned here. Committing shortly if 
there are no objections, Thanks [~rohithsharma] for offline review.

> First RMContext always leaked during switch over
> 
>
> Key: YARN-9644
> URL: https://issues.apache.org/jira/browse/YARN-9644
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9644.001.patch, YARN-9644.002.patch, 
> YARN-9644.003.patch
>
>
> As per my understanding following 2 issues causes the issue.
> * WebApp holds the reference to First applicationMasterServer instance, which 
> has rmcontext with ActiveServiceContext (holds RMApps + nodes map). WebApp 
> remains to life time of RM process.
> * On transistion to active RMNMInfo object is registered in  MBean and never 
> unregistered on transitionToStandBy
> On transistion to Standby and again based to active new RMContext gets 
> created, but above 2 issues causes first RMcontext persist still RMShutdown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9639) DecommissioningNodesWatcher cause memory leak

2019-06-25 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872922#comment-16872922
 ] 

Sunil Govindan commented on YARN-9639:
--

Thanks [~BilwaST] for pointing out. I have seen once, when stop was called 
twice. It was ideally an other bug. However we should serviceStop should 
ideally be called only once, and on that point, this patch looks good. Thanks.

> DecommissioningNodesWatcher cause memory leak
> -
>
> Key: YARN-9639
> URL: https://issues.apache.org/jira/browse/YARN-9639
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Critical
> Attachments: YARN-9639-001.patch
>
>
> Missing cancel() of Timer task in DecommissioningNodesWatcher could leak to 
> memory leak.
> PollTimerTask holds the reference of rmcontext



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7721) TestContinuousScheduling fails sporadically with NPE

2019-06-25 Thread Sunil Govindan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan reassigned YARN-7721:


Assignee: Sunil Govindan  (was: Wilfred Spiegelenburg)

> TestContinuousScheduling fails sporadically with NPE
> 
>
> Key: YARN-7721
> URL: https://issues.apache.org/jira/browse/YARN-7721
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Jason Lowe
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-7721.001.patch
>
>
> TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime is 
> failing sporadically with an NPE in precommit builds, and I can usually 
> reproduce it locally after a few tries:
> {noformat}
> [ERROR] 
> testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling)
>   Time elapsed: 0.085 s  <<< ERROR!
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling.testFairSchedulerContinuousSchedulingInitTime(TestContinuousScheduling.java:383)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> [...]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9639) DecommissioningNodesWatcher cause memory leak

2019-06-25 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872174#comment-16872174
 ] 

Sunil Govindan commented on YARN-9639:
--

[~bibinchundatt] a quick question. is it better to have a null check on 
pollTimer ?

> DecommissioningNodesWatcher cause memory leak
> -
>
> Key: YARN-9639
> URL: https://issues.apache.org/jira/browse/YARN-9639
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Critical
> Attachments: YARN-9639-001.patch
>
>
> Missing cancel() of Timer task in DecommissioningNodesWatcher could leak to 
> memory leak.
> PollTimerTask holds the reference of rmcontext



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9635) Nodes page displayed duplicate nodes

2019-06-24 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872010#comment-16872010
 ] 

Sunil Govindan commented on YARN-9635:
--

Hi [~jiwq] and [~Tao Yang]

I am trying to understand whether this is an issue by implementation. As [~Tao 
Yang] mentioned, i think updating doc seems the real change needed. Could u 
please help to confirm for 3.2 Thanks.

> Nodes page displayed duplicate nodes
> 
>
> Key: YARN-9635
> URL: https://issues.apache.org/jira/browse/YARN-9635
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: api
>Affects Versions: 3.2.0
>Reporter: Wanqiang Ji
>Assignee: Wanqiang Ji
>Priority: Major
> Attachments: UI2-nodes.jpg
>
>
> Steps:
>  * shutdown nodes
>  * start nodes
> Nodes Page:
> !UI2-nodes.jpg!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9639) DecommissioningNodesWatcher cause memory leak

2019-06-23 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870594#comment-16870594
 ] 

Sunil Govindan commented on YARN-9639:
--

One small nit, In DecommissioningNodesWatcher, Do we need to bind pollTimer 
with null check in stop method ?

 

> DecommissioningNodesWatcher cause memory leak
> -
>
> Key: YARN-9639
> URL: https://issues.apache.org/jira/browse/YARN-9639
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Critical
> Attachments: YARN-9639-001.patch
>
>
> Missing cancel() of Timer task in DecommissioningNodesWatcher could leak to 
> memory leak.
> PollTimerTask holds the reference of rmcontext



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9642) AbstractYarnScheduler#clearPendingContainerCache could run even after transitiontostandby

2019-06-22 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870425#comment-16870425
 ] 

Sunil Govindan commented on YARN-9642:
--

Good catch. Yes, this timer ideally need to be cancelled for a better cleanup.

patch seems fine. + [~cheersyang] [~leftnoteasy] for additional thoughts

> AbstractYarnScheduler#clearPendingContainerCache could run even after 
> transitiontostandby
> -
>
> Key: YARN-9642
> URL: https://issues.apache.org/jira/browse/YARN-9642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9642.001.patch, image-2019-06-22-16-05-24-114.png
>
>
> The TimeTask could hold the reference of Scheduler in case of fast switch 
> over too.
>  AbstractYarnScheduler should make sure scheduled Timer cancelled on 
> serviceStop.
> Causes memory leak too
> !image-2019-06-22-16-05-24-114.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9327) ProtoUtils#convertToProtoFormat block Application Master Service and many more

2019-06-21 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869689#comment-16869689
 ] 

Sunil Govindan commented on YARN-9327:
--

looks fine to me. [~wangda] cud u pls take a look. 

if no issues, we can get this in.

> ProtoUtils#convertToProtoFormat block Application Master Service and many more
> --
>
> Key: YARN-9327
> URL: https://issues.apache.org/jira/browse/YARN-9327
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9327.001.patch, YARN-9327.002.patch, 
> YARN-9327.003.patch
>
>
> {code}
>   public static synchronized ResourceProto convertToProtoFormat(Resource r) {
> return ResourcePBImpl.getProto(r);
>   }
> {code}
> {noformat}
> "IPC Server handler 41 on 23764" #324 daemon prio=5 os_prio=0 
> tid=0x7f181de72800 nid=0x222 waiting for monitor entry 
> [0x7ef153dad000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils.convertToProtoFormat(ProtoUtils.java:404)
>   - waiting to lock <0x7ef2d8bcf6d8> (a java.lang.Class for 
> org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:315)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:262)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:289)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:228)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:844)
>   - locked <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:72)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1.next(AllocateResponsePBImpl.java:810)
>   - locked <0x7f0fed96f500> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1.next(AllocateResponsePBImpl.java:799)
>   at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>   at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>   at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:13810)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:158)
>   - locked <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:198)
>   - eliminated <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:103)
>   - locked <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:824)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2684){noformat}
> Seems synchronization is not required here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Commented] (YARN-9625) UI2 - No link to a queue on the Queues page for Fair Scheduler

2019-06-14 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863925#comment-16863925
 ] 

Sunil Govindan commented on YARN-9625:
--

+ [~zsiegl] [~snemeth]

> UI2 - No link to a queue on the Queues page for Fair Scheduler
> --
>
> Key: YARN-9625
> URL: https://issues.apache.org/jira/browse/YARN-9625
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Charan Hebri
>Priority: Major
> Attachments: Capacity_scheduler_page.png, Fair_scheduler_page.png
>
>
> When the scheduler is set as 'Capacity Scheduler' the Queues page has a tab 
> on the right with a link to a certain queue which provides running app 
> information for the queue. But for 'Fair Scheduler' there is no such link. 
> Attached screenshots for both schedulers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9626) UI2 - Fair scheduler queue apps page issues

2019-06-14 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863926#comment-16863926
 ] 

Sunil Govindan commented on YARN-9626:
--

+ [~snemeth] [~zsiegl]

> UI2 - Fair scheduler queue apps page issues
> ---
>
> Key: YARN-9626
> URL: https://issues.apache.org/jira/browse/YARN-9626
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Charan Hebri
>Priority: Major
> Attachments: Fair_scheduler_apps_page.png
>
>
> There are a few issues with the apps page for a queue when Fair Scheduler is 
> used.
>  * Labels like configured capacity, configured max capacity etc. (marked in 
> the attached image) are not needed as they are specific to Capacity Scheduler.
>  * Steady fair memory, used memory and maximum memory are actual values but 
> are shown as percentages.
>  * Formatting of Pending, Allocated, Reserved Containers values is not 
> correct (shown in the attached screenshot)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9607) Auto-configuring rollover-size of IFile format for non-appendable filesystems

2019-06-12 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862164#comment-16862164
 ] 

Sunil Govindan commented on YARN-9607:
--

Hi [~adam.antal]

Could we add a new config which says 
"yarn.logaggregation.non-appendable-fs.enable" or similar and can be set to 
TRUE. So for any cloud storages, we can set this and we dnt need to add all 
filesystems in if conditions. Thoughts ?

> Auto-configuring rollover-size of IFile format for non-appendable filesystems
> -
>
> Key: YARN-9607
> URL: https://issues.apache.org/jira/browse/YARN-9607
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Attachments: YARN-9607.001.patch
>
>
> In YARN-9525, we made IFile format compatible with remote folders with s3a 
> scheme. In rolling fashioned log-aggregation IFile still fails with the 
> "append is not supported" error message, which is a known limitation of the 
> format by design. 
> There is a workaround though: setting the rollover size in the configuration 
> of the IFile format, in each rolling cycle a new aggregated log file will be 
> created, thus we eliminated the append from the process. Setting this config 
> globally would cause performance problems in the regular log-aggregation, so 
> I'm suggesting to enforcing this config to zero, if the scheme of the URI is 
> s3a (or any other non-appendable filesystem).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9327) ProtoUtils#convertToProtoFormat block Application Master Service and many more

2019-06-12 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862153#comment-16862153
 ] 

Sunil Govindan commented on YARN-9327:
--

This is a good catch.

I have seen in some other proto class where same sync block is used earlier. As 
per me, this is not needed as you also suggested same. +1 for this.

[~leftnoteasy] [~cheersyang] thoughts >?

> ProtoUtils#convertToProtoFormat block Application Master Service and many more
> --
>
> Key: YARN-9327
> URL: https://issues.apache.org/jira/browse/YARN-9327
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9327.001.patch
>
>
> {code}
>   public static synchronized ResourceProto convertToProtoFormat(Resource r) {
> return ResourcePBImpl.getProto(r);
>   }
> {code}
> {noformat}
> "IPC Server handler 41 on 23764" #324 daemon prio=5 os_prio=0 
> tid=0x7f181de72800 nid=0x222 waiting for monitor entry 
> [0x7ef153dad000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils.convertToProtoFormat(ProtoUtils.java:404)
>   - waiting to lock <0x7ef2d8bcf6d8> (a java.lang.Class for 
> org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:315)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:262)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:289)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:228)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:844)
>   - locked <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:72)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1.next(AllocateResponsePBImpl.java:810)
>   - locked <0x7f0fed96f500> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1.next(AllocateResponsePBImpl.java:799)
>   at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>   at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>   at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:13810)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:158)
>   - locked <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:198)
>   - eliminated <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:103)
>   - locked <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:824)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2684){noformat}
> Seems synchronization is not required here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (YARN-9545) Create healthcheck REST endpoint for ATSv2

2019-06-12 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862121#comment-16862121
 ] 

Sunil Govindan commented on YARN-9545:
--

Corrected and committed patch w/o the hidden dir which caused the problem 
mentioned above. Thanks [~ste...@apache.org]

> Create healthcheck REST endpoint for ATSv2
> --
>
> Key: YARN-9545
> URL: https://issues.apache.org/jira/browse/YARN-9545
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2
>Affects Versions: 3.1.2
>Reporter: Zoltan Siegl
>Assignee: Zoltan Siegl
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9545.001.patch, YARN-9545.002.patch, 
> YARN-9545.003.patch, YARN-9545.004.patch, YARN-9545.branch-3.2.001.patch, 
> YARN-9545.branch-3.2.002.patch
>
>
> RM UI2 and CM needs a health check url for ATSv2 service.
> Create a /health rest endpoint
>  * must respond with 200 \{health: ok} if all ok
>  * must respond with non 200 if any problem occurs
>  * could check reader/writer connection



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9545) Create healthcheck REST endpoint for ATSv2

2019-06-12 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862112#comment-16862112
 ] 

Sunil Govindan commented on YARN-9545:
--

Hi [~ste...@apache.org], you are correct.

Its a mistake where couple of hidden files went in with this commit. I am 
reverting this

> Create healthcheck REST endpoint for ATSv2
> --
>
> Key: YARN-9545
> URL: https://issues.apache.org/jira/browse/YARN-9545
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2
>Affects Versions: 3.1.2
>Reporter: Zoltan Siegl
>Assignee: Zoltan Siegl
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9545.001.patch, YARN-9545.002.patch, 
> YARN-9545.003.patch, YARN-9545.004.patch, YARN-9545.branch-3.2.001.patch, 
> YARN-9545.branch-3.2.002.patch
>
>
> RM UI2 and CM needs a health check url for ATSv2 service.
> Create a /health rest endpoint
>  * must respond with 200 \{health: ok} if all ok
>  * must respond with non 200 if any problem occurs
>  * could check reader/writer connection



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9573) DistributedShell cannot specify LogAggregationContext

2019-06-06 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857468#comment-16857468
 ] 

Sunil Govindan commented on YARN-9573:
--

back ported to trunk

cud u pls help to backport to branch-3.2/3.1

> DistributedShell cannot specify LogAggregationContext
> -
>
> Key: YARN-9573
> URL: https://issues.apache.org/jira/browse/YARN-9573
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: distributed-shell, log-aggregation, yarn
>Affects Versions: 3.2.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Attachments: YARN-9573.001.patch, YARN-9573.002.patch, 
> YARN-9573.002.patch, YARN-9573.003.patch
>
>
> When DShell sends the application request object to the RM, it doesn't 
> specify the LogAggregationContext object - thus it is not possible to run 
> DShell with various log-aggregation configurations, for e.g. a rolling 
> fashioned log aggregation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9573) DistributedShell cannot specify LogAggregationContext

2019-06-04 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855746#comment-16855746
 ] 

Sunil Govindan commented on YARN-9573:
--

Thanks [~adam.antal]

I definitely would love to see removal of unused imports. This makes code more 
cleaner.

> DistributedShell cannot specify LogAggregationContext
> -
>
> Key: YARN-9573
> URL: https://issues.apache.org/jira/browse/YARN-9573
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: distributed-shell, log-aggregation, yarn
>Affects Versions: 3.2.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Attachments: YARN-9573.001.patch, YARN-9573.002.patch, 
> YARN-9573.002.patch, YARN-9573.003.patch
>
>
> When DShell sends the application request object to the RM, it doesn't 
> specify the LogAggregationContext object - thus it is not possible to run 
> DShell with various log-aggregation configurations, for e.g. a rolling 
> fashioned log aggregation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8906) [UI2] NM hostnames not displayed correctly in Node Heatmap Chart

2019-06-03 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854440#comment-16854440
 ] 

Sunil Govindan commented on YARN-8906:
--

Thanks [~akhilpb]

> [UI2] NM hostnames not displayed correctly in Node Heatmap Chart
> 
>
> Key: YARN-8906
> URL: https://issues.apache.org/jira/browse/YARN-8906
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Charan Hebri
>Assignee: Akhil PB
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: Node_Heatmap_Chart.png, Node_Heatmap_Chart_Fixed.png, 
> YARN-8906.001.patch, YARN-8906.002.patch
>
>
> Hostnames displayed on the Node Heatmap Chart look garbled and are not 
> clearly visible. Attached screenshot.
> cc [~akhilpb]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8906) [UI2] NM hostnames not displayed correctly in Node Heatmap Chart

2019-06-03 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854439#comment-16854439
 ] 

Sunil Govindan commented on YARN-8906:
--

+1

> [UI2] NM hostnames not displayed correctly in Node Heatmap Chart
> 
>
> Key: YARN-8906
> URL: https://issues.apache.org/jira/browse/YARN-8906
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Charan Hebri
>Assignee: Akhil PB
>Priority: Major
> Attachments: Node_Heatmap_Chart.png, Node_Heatmap_Chart_Fixed.png, 
> YARN-8906.001.patch, YARN-8906.002.patch
>
>
> Hostnames displayed on the Node Heatmap Chart look garbled and are not 
> clearly visible. Attached screenshot.
> cc [~akhilpb]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9545) Create healthcheck REST endpoint for ATSv2

2019-05-31 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852793#comment-16852793
 ] 

Sunil Govindan commented on YARN-9545:
--

Thanks [~snemeth] for the thoughts.

YARN-9016 need not have to be backported as its a new feature altogether. For 
this patch, could we skip such new impl and use for NoOp and FIleSystem only ?

> Create healthcheck REST endpoint for ATSv2
> --
>
> Key: YARN-9545
> URL: https://issues.apache.org/jira/browse/YARN-9545
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2
>Affects Versions: 3.1.2
>Reporter: Zoltan Siegl
>Assignee: Zoltan Siegl
>Priority: Major
> Attachments: YARN-9545.001.patch, YARN-9545.002.patch, 
> YARN-9545.003.patch, YARN-9545.004.patch
>
>
> RM UI2 and CM needs a health check url for ATSv2 service.
> Create a /health rest endpoint
>  * must respond with 200 \{health: ok} if all ok
>  * must respond with non 200 if any problem occurs
>  * could check reader/writer connection



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9543) [UI2] Handle ATSv2 server down or failures cases gracefully in YARN UI v2

2019-05-31 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852722#comment-16852722
 ] 

Sunil Govindan commented on YARN-9543:
--

Committed to trunk.

I ll backport to 3.2 and 3.1 when YARN-9545 is ready for commit in those 
branches

> [UI2] Handle ATSv2 server down or failures cases gracefully in YARN UI v2
> -
>
> Key: YARN-9543
> URL: https://issues.apache.org/jira/browse/YARN-9543
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2, yarn-ui-v2
>Affects Versions: 3.1.2
>Reporter: Zoltan Siegl
>Assignee: Zoltan Siegl
>Priority: Major
> Attachments: YARN-9543.001.patch, YARN-9543.002.patch
>
>
> Resource manager UI2 is throwing some console errors and an error page on the 
> flows page.
> Suggested improvements:
>  * Disable or remove the flows tab if ATSv2 is not available or not installed
>  * Handle all connection errors to ATSv2 gracefully



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9543) [UI2] Handle ATSv2 server down or failures cases gracefully in YARN UI v2

2019-05-31 Thread Sunil Govindan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-9543:
-
Summary: [UI2] Handle ATSv2 server down or failures cases gracefully in 
YARN UI v2  (was: UI2 should handle missing ATSv2 gracefully)

> [UI2] Handle ATSv2 server down or failures cases gracefully in YARN UI v2
> -
>
> Key: YARN-9543
> URL: https://issues.apache.org/jira/browse/YARN-9543
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2, yarn-ui-v2
>Affects Versions: 3.1.2
>Reporter: Zoltan Siegl
>Assignee: Zoltan Siegl
>Priority: Major
> Attachments: YARN-9543.001.patch, YARN-9543.002.patch
>
>
> Resource manager UI2 is throwing some console errors and an error page on the 
> flows page.
> Suggested improvements:
>  * Disable or remove the flows tab if ATSv2 is not available or not installed
>  * Handle all connection errors to ATSv2 gracefully



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9545) Create healthcheck REST endpoint for ATSv2

2019-05-30 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852662#comment-16852662
 ] 

Sunil Govindan commented on YARN-9545:
--

This patch is committed to trunk.

branch-3.2 is failing, Cud u pls help to rebase.

> Create healthcheck REST endpoint for ATSv2
> --
>
> Key: YARN-9545
> URL: https://issues.apache.org/jira/browse/YARN-9545
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2
>Affects Versions: 3.1.2
>Reporter: Zoltan Siegl
>Assignee: Zoltan Siegl
>Priority: Major
> Attachments: YARN-9545.001.patch, YARN-9545.002.patch, 
> YARN-9545.003.patch, YARN-9545.004.patch
>
>
> RM UI2 and CM needs a health check url for ATSv2 service.
> Create a /health rest endpoint
>  * must respond with 200 \{health: ok} if all ok
>  * must respond with non 200 if any problem occurs
>  * could check reader/writer connection



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9033) ResourceHandlerChain#bootstrap is invoked twice during NM start if LinuxContainerExecutor enabled

2019-05-30 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852655#comment-16852655
 ] 

Sunil Govindan commented on YARN-9033:
--

Looks good to me. +1

 

> ResourceHandlerChain#bootstrap is invoked twice during NM start if 
> LinuxContainerExecutor enabled
> -
>
> Key: YARN-9033
> URL: https://issues.apache.org/jira/browse/YARN-9033
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-9033-trunk.001.patch, YARN-9033-trunk.002.patch, 
> YARN-9033-trunk.003.patch
>
>
> The ResourceHandlerChain#bootstrap will always be invoked in NM's 
> ContainerScheduler#serviceInit (Involved by YARN-7715)
> So if LCE is enabled, the ResourceHandlerChain#bootstrap will be invoked 
> first and then invoked again in ContainerScheduler#serviceInit.
> But actually, the "updateContainer" invocation in YARN-7715 depend on 
> containerId's cgroups path creation in "preStart" method which only happens 
> when we use "LinuxContainerExecutor". So the bootstrap of 
> ResourceHandlerChain shouldn't happen in ContainerScheduler#serviceInit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9545) Create healthcheck REST endpoint for ATSv2

2019-05-30 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852102#comment-16852102
 ] 

Sunil Govindan commented on YARN-9545:
--

Thanks [~zsiegl], lets go with this approach. Committing shortly

> Create healthcheck REST endpoint for ATSv2
> --
>
> Key: YARN-9545
> URL: https://issues.apache.org/jira/browse/YARN-9545
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2
>Affects Versions: 3.1.2
>Reporter: Zoltan Siegl
>Assignee: Zoltan Siegl
>Priority: Major
> Attachments: YARN-9545.001.patch, YARN-9545.002.patch, 
> YARN-9545.003.patch, YARN-9545.004.patch
>
>
> RM UI2 and CM needs a health check url for ATSv2 service.
> Create a /health rest endpoint
>  * must respond with 200 \{health: ok} if all ok
>  * must respond with non 200 if any problem occurs
>  * could check reader/writer connection



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9452) Fix TestDistributedShell and TestTimelineAuthFilterForV2 failures

2019-05-30 Thread Sunil Govindan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-9452:
-
Summary: Fix TestDistributedShell and TestTimelineAuthFilterForV2 failures  
(was: Fix failing testcases TestDistributedShell and 
TestTimelineAuthFilterForV2)

> Fix TestDistributedShell and TestTimelineAuthFilterForV2 failures
> -
>
> Key: YARN-9452
> URL: https://issues.apache.org/jira/browse/YARN-9452
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2, distributed-shell, test
>Affects Versions: 3.2.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-9452-001.patch, YARN-9452-002.patch, 
> YARN-9452-003.patch, YARN-9452-004.patch
>
>
> *TestDistributedShell#testDSShellWithoutDomainV2CustomizedFlow*
> {code}
> [ERROR] 
> testDSShellWithoutDomainV2CustomizedFlow(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell)
>   Time elapsed: 72.14 s  <<< FAILURE!
> java.lang.AssertionError: Entity ID prefix should be same across each publish 
> of same entity expected:<9223372036854775806> but was:<9223370482298585580>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.verifyEntityForTimelineV2(TestDistributedShell.java:695)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.checkTimelineV2(TestDistributedShell.java:588)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:459)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow(TestDistributedShell.java:330)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> *TestTimelineAuthFilterForV2#testPutTimelineEntities*
> {code}
> [ERROR] 
> testPutTimelineEntities[3](org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2)
>   Time elapsed: 1.047 s  <<< FAILURE!
> java.lang.AssertionError
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertNotNull(Assert.java:712)
>   at org.junit.Assert.assertNotNull(Assert.java:722)
>   at 
> org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2.verifyEntity(TestTimelineAuthFilterForV2.java:282)
>   at 
> org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2.testPutTimelineEntities(TestTimelineAuthFilterForV2.java:421)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> 

[jira] [Commented] (YARN-9452) Fix failing testcases TestDistributedShell and TestTimelineAuthFilterForV2

2019-05-30 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852076#comment-16852076
 ] 

Sunil Govindan commented on YARN-9452:
--

+1 on latest patch. 

Committing now.

> Fix failing testcases TestDistributedShell and TestTimelineAuthFilterForV2
> --
>
> Key: YARN-9452
> URL: https://issues.apache.org/jira/browse/YARN-9452
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2, distributed-shell, test
>Affects Versions: 3.2.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-9452-001.patch, YARN-9452-002.patch, 
> YARN-9452-003.patch, YARN-9452-004.patch
>
>
> *TestDistributedShell#testDSShellWithoutDomainV2CustomizedFlow*
> {code}
> [ERROR] 
> testDSShellWithoutDomainV2CustomizedFlow(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell)
>   Time elapsed: 72.14 s  <<< FAILURE!
> java.lang.AssertionError: Entity ID prefix should be same across each publish 
> of same entity expected:<9223372036854775806> but was:<9223370482298585580>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.verifyEntityForTimelineV2(TestDistributedShell.java:695)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.checkTimelineV2(TestDistributedShell.java:588)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:459)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow(TestDistributedShell.java:330)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> *TestTimelineAuthFilterForV2#testPutTimelineEntities*
> {code}
> [ERROR] 
> testPutTimelineEntities[3](org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2)
>   Time elapsed: 1.047 s  <<< FAILURE!
> java.lang.AssertionError
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertNotNull(Assert.java:712)
>   at org.junit.Assert.assertNotNull(Assert.java:722)
>   at 
> org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2.verifyEntity(TestTimelineAuthFilterForV2.java:282)
>   at 
> org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2.testPutTimelineEntities(TestTimelineAuthFilterForV2.java:421)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at 

[jira] [Commented] (YARN-9525) IFile format is not working against s3a remote folder

2019-05-30 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851569#comment-16851569
 ] 

Sunil Govindan commented on YARN-9525:
--

[~adam.antal]

are we planning to fix nagative offset error in this jira?

> IFile format is not working against s3a remote folder
> -
>
> Key: YARN-9525
> URL: https://issues.apache.org/jira/browse/YARN-9525
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 3.1.2
>Reporter: Adam Antal
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: IFile-S3A-POC01.patch, YARN-9525-001.patch
>
>
> Using the IndexedFileFormat {{yarn.nodemanager.remote-app-log-dir}} 
> configured to an s3a URI throws the following exception during log 
> aggregation:
> {noformat}
> Cannot create writer for app application_1556199768861_0001. Skip log upload 
> this time. 
> java.io.IOException: java.io.FileNotFoundException: No such file or 
> directory: 
> s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:247)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:306)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:464)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:420)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2488)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2382)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2321)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128)
>   at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1244)
>   at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1240)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1246)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController$1.run(LogAggregationIndexedFileController.java:228)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:195)
>   ... 7 more
> {noformat}
> This stack trace point to 
> {{LogAggregationIndexedFileController$initializeWriter}} where we do the 
> following steps (in a non-rolling log aggregation setup):
> - create FSDataOutputStream
> - writing out a UUID
> - flushing
> - immediately after that we call a GetFileStatus to get the length of the log 
> file (the bytes we just wrote out), and that's where the failures happens: 
> the file is not there yet due to eventual consistency.
> Maybe we can get rid of that, so we can use IFile format against a s3a target.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9482) DistributedShell job with localization fails in unsecure cluster

2019-05-27 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848940#comment-16848940
 ] 

Sunil Govindan commented on YARN-9482:
--

Thanks [~giovanni.fumarola], [~Prabhu Joseph] and [~pbacsko]

Cud we get this to branch-3.2/3.1 as well. [~Prabhu Joseph] is this good to 
backport, thanks ?

> DistributedShell job with localization fails in unsecure cluster
> 
>
> Key: YARN-9482
> URL: https://issues.apache.org/jira/browse/YARN-9482
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-shell
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9482-001.patch, YARN-9482-002.patch, 
> YARN-9482-003.patch, YARN-9482-004.patch
>
>
> DistributedShell job with localization fails in unsecure cluster. The client 
> localizes the input files to home directory (job user) whereas the AM runs as 
> yarn user reads from it's home directory.
> *Command:*
> {code}
> yarn jar 
> /HADOOP/hadoop-3.2.0/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.2.0.jar
>  -shell_command ls  -shell_args / -jar  
> /HADOOP/hadoop-3.2.0/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.2.0.jar
>  -localize_files /tmp/prabhu
> {code}
> {code}
> Exception in thread "Thread-4" java.io.UncheckedIOException: Error during 
> localization setup
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster$LaunchContainerRunnable.lambda$run$0(ApplicationMaster.java:1495)
>   at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>   at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster$LaunchContainerRunnable.run(ApplicationMaster.java:1481)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: File does not exist: 
> hdfs://yarn-ats-1:8020/user/yarn/DistributedShell/application_1554817981283_0003/prabhu
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1586)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1579)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1594)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster$LaunchContainerRunnable.lambda$run$0(ApplicationMaster.java:1487)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9543) UI2 should handle missing ATSv2 gracefully

2019-05-27 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848936#comment-16848936
 ] 

Sunil Govindan commented on YARN-9543:
--

Getting this in now. Thanks [~zsiegl] and [~akhilpb]

 

> UI2 should handle missing ATSv2 gracefully
> --
>
> Key: YARN-9543
> URL: https://issues.apache.org/jira/browse/YARN-9543
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2, yarn-ui-v2
>Affects Versions: 3.1.2
>Reporter: Zoltan Siegl
>Assignee: Zoltan Siegl
>Priority: Major
> Attachments: YARN-9543.001.patch, YARN-9543.002.patch
>
>
> Resource manager UI2 is throwing some console errors and an error page on the 
> flows page.
> Suggested improvements:
>  * Disable or remove the flows tab if ATSv2 is not available or not installed
>  * Handle all connection errors to ATSv2 gracefully



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9545) Create healthcheck REST endpoint for ATSv2

2019-05-27 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848935#comment-16848935
 ] 

Sunil Govindan commented on YARN-9545:
--

Thanks

[~zsiegl], in general the DAO class looks good and forward looking. And tests 
are also good and covering basic cases. I have one small concern on 
{{isConnectionAlive}}. In this method, a simple bool is returned and based on 
that a decision is taken whether to consider Timeline Reader is up or down..

Now I think since this method a new interface, i suggest lets return an ENUM 
from this method which can renamed into {{getConnectionStatus}}/. With this, we 
can return status like RUNNING, CONNECTION_FAILURE etc. And in 
{{TimelineReaderWebServices}}, we can reimplement {{health}} method with switch 
case statement to consider basic 2 scenarios to start with, and same will be 
pretty much forward looking as well.

Thoughts/?

> Create healthcheck REST endpoint for ATSv2
> --
>
> Key: YARN-9545
> URL: https://issues.apache.org/jira/browse/YARN-9545
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2
>Affects Versions: 3.1.2
>Reporter: Zoltan Siegl
>Assignee: Zoltan Siegl
>Priority: Major
> Attachments: YARN-9545.001.patch, YARN-9545.002.patch, 
> YARN-9545.003.patch
>
>
> RM UI2 and CM needs a health check url for ATSv2 service.
> Create a /health rest endpoint
>  * must respond with 200 \{health: ok} if all ok
>  * must respond with non 200 if any problem occurs
>  * could check reader/writer connection



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9543) UI2 should handle missing ATSv2 gracefully

2019-05-23 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846695#comment-16846695
 ] 

Sunil Govindan commented on YARN-9543:
--

Patch looks good to me.

[~zsiegl] could you also pls cross check.

> UI2 should handle missing ATSv2 gracefully
> --
>
> Key: YARN-9543
> URL: https://issues.apache.org/jira/browse/YARN-9543
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2, yarn-ui-v2
>Affects Versions: 3.1.2
>Reporter: Zoltan Siegl
>Assignee: Zoltan Siegl
>Priority: Major
> Attachments: YARN-9543.001.patch, YARN-9543.002.patch
>
>
> Resource manager UI2 is throwing some console errors and an error page on the 
> flows page.
> Suggested improvements:
>  * Disable or remove the flows tab if ATSv2 is not available or not installed
>  * Handle all connection errors to ATSv2 gracefully



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9545) Create healthcheck REST endpoint for ATSv2

2019-05-23 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846694#comment-16846694
 ] 

Sunil Govindan commented on YARN-9545:
--

[~zsiegl] cud u pls check the find bugs warnings?

> Create healthcheck REST endpoint for ATSv2
> --
>
> Key: YARN-9545
> URL: https://issues.apache.org/jira/browse/YARN-9545
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2
>Affects Versions: 3.1.2
>Reporter: Zoltan Siegl
>Assignee: Zoltan Siegl
>Priority: Major
> Attachments: YARN-9545.001.patch, YARN-9545.002.patch, 
> YARN-9545.003.patch
>
>
> RM UI2 and CM needs a health check url for ATSv2 service.
> Create a /health rest endpoint
>  * must respond with 200 \{health: ok} if all ok
>  * must respond with non 200 if any problem occurs
>  * could check reader/writer connection



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9546) Add configuration option for YARN Native services AM classpath

2019-05-20 Thread Sunil Govindan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-9546:
-
Summary: Add configuration option for YARN Native services AM classpath  
(was: Add configuration option for yarn services AM classpath)

> Add configuration option for YARN Native services AM classpath
> --
>
> Key: YARN-9546
> URL: https://issues.apache.org/jira/browse/YARN-9546
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-9546.001.patch, YARN-9546.002.patch
>
>
> For regular containers we have the yarn.application.classpath property, which 
> allows users to add extra elements to the container's classpath. 
> However yarn services deliberately ignores this property to avoid 
> incompatible class collision. However on systems where the configuration 
> files for containers are located other than the path stored in 
> $HADOOP_CONF_DIR, there is no way to modify the classpath to include other 
> directories with the actual configuration.
> Suggestion let's create a new property which allows us to add extra elements 
> to the classpath generated for YARN service AM containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9546) Add configuration option for yarn services AM classpath

2019-05-20 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844017#comment-16844017
 ] 

Sunil Govindan commented on YARN-9546:
--

+1 on latest patch.

Committing same.

> Add configuration option for yarn services AM classpath
> ---
>
> Key: YARN-9546
> URL: https://issues.apache.org/jira/browse/YARN-9546
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-9546.001.patch, YARN-9546.002.patch
>
>
> For regular containers we have the yarn.application.classpath property, which 
> allows users to add extra elements to the container's classpath. 
> However yarn services deliberately ignores this property to avoid 
> incompatible class collision. However on systems where the configuration 
> files for containers are located other than the path stored in 
> $HADOOP_CONF_DIR, there is no way to modify the classpath to include other 
> directories with the actual configuration.
> Suggestion let's create a new property which allows us to add extra elements 
> to the classpath generated for YARN service AM containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9519) TFile log aggregation file format is not working for yarn.log-aggregation.TFile.remote-app-log-dir config

2019-05-14 Thread Sunil Govindan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-9519:
-
Summary: TFile log aggregation file format is not working for 
yarn.log-aggregation.TFile.remote-app-log-dir config  (was: TFile log 
aggregation file format is insensitive to the 
yarn.log-aggregation.TFile.remote-app-log-dir config)

> TFile log aggregation file format is not working for 
> yarn.log-aggregation.TFile.remote-app-log-dir config
> -
>
> Key: YARN-9519
> URL: https://issues.apache.org/jira/browse/YARN-9519
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 3.2.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Attachments: YARN-9519.001.patch, YARN-9519.002.patch, 
> YARN-9519.003.patch, YARN-9519.004.patch, YARN-9519.005.patch
>
>
> The TFile log aggregation file format is not sensitive to the 
> yarn.log-aggregation.TFile.remote-app-log-dir config.
> In {{LogAggregationTFileController$initInternal}}:
> {code:java}
> this.remoteRootLogDir = new Path(
> conf.get(YarnConfiguration.NM_REMOTE_APP_LOG_DIR,
> YarnConfiguration.DEFAULT_NM_REMOTE_APP_LOG_DIR));
> {code}
> So the remoteRootLogDir is only aware of the 
> yarn.nodemanager.remote-app-log-dir config, while other file format, like 
> IFile defaults to the file format config, so its priority is higher.
> From {{LogAggregationIndexedFileController$initInternal}}:
> {code:java}
> String remoteDirStr = String.format(
> YarnConfiguration.LOG_AGGREGATION_REMOTE_APP_LOG_DIR_FMT,
> this.fileControllerName);
> String remoteDir = conf.get(remoteDirStr);
> if (remoteDir == null || remoteDir.isEmpty()) {
>   remoteDir = conf.get(YarnConfiguration.NM_REMOTE_APP_LOG_DIR,
>   YarnConfiguration.DEFAULT_NM_REMOTE_APP_LOG_DIR);
> }
> {code}
> (Where these configs are: )
> {code:java}
> public static final String LOG_AGGREGATION_REMOTE_APP_LOG_DIR_FMT
>   = YARN_PREFIX + "log-aggregation.%s.remote-app-log-dir";
> public static final String NM_REMOTE_APP_LOG_DIR = 
> NM_PREFIX + "remote-app-log-dir";
> {code}
> I suggest TFile should try to obtain the remote dir config from 
> yarn.log-aggregation.TFile.remote-app-log-dir first, and only if that is not 
> specified falls back to the yarn.nodemanager.remote-app-log-dir config.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9546) Add configuration option for yarn services AM classpath

2019-05-12 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838142#comment-16838142
 ] 

Sunil Govindan commented on YARN-9546:
--

Hi [~shuzirra]

As per this patch, you are introducing a new yarn config named 
*yarn.service.classpath* and configuring any classpath which are needed for 
native service. Since its a new config, its better to a description for this 
config like "accepts comma separated values etc" in yarn default xml. and you 
can set default value as empty. Initially i thought its a an ENV variable and 
hence suggested not to add to yarn-default. However as per current approach, it 
makes sense to add to xml.

Other than this, patch looks fine. cc [~billie.rinaldi]

> Add configuration option for yarn services AM classpath
> ---
>
> Key: YARN-9546
> URL: https://issues.apache.org/jira/browse/YARN-9546
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-9546.001.patch
>
>
> For regular containers we have the yarn.application.classpath property, which 
> allows users to add extra elements to the container's classpath. 
> However yarn services deliberately ignores this property to avoid 
> incompatible class collision. However on systems where the configuration 
> files for containers are located other than the path stored in 
> $HADOOP_CONF_DIR, there is no way to modify the classpath to include other 
> directories with the actual configuration.
> Suggestion let's create a new property which allows us to add extra elements 
> to the classpath generated for YARN service AM containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9545) Create healthcheck REST endpoint for ATSv2

2019-05-10 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16837318#comment-16837318
 ] 

Sunil Govindan commented on YARN-9545:
--

Thanks [~Prabhu Joseph] for the thoughts. In minimum level, the steps which you 
have mentioned is absolutely fine. However [~zsiegl] is also trying to add some 
error codes in response if HBase is down or similar FATAL cases. I feel its may 
be a better idea to *http://yarn-ats-3:8198/ws/v2/timeline*  add a _*status*_ 
field in this REST response to handle HBase etc error status.

[~zsiegl] what do u think ?

> Create healthcheck REST endpoint for ATSv2
> --
>
> Key: YARN-9545
> URL: https://issues.apache.org/jira/browse/YARN-9545
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2
>Affects Versions: 3.1.2
>Reporter: Zoltan Siegl
>Assignee: Zoltan Siegl
>Priority: Major
>
> RM UI2 and CM needs a health check url for ATSv2 service.
> Create a /health rest endpoint
>  * must respond with 200 \{health: ok} if all ok
>  * must respond with non 200 if any problem occurs
>  * could check reader/writer connection



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >