[jira] [Commented] (YARN-8538) Fix valgrind leak check on container executor

2018-07-17 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547410#comment-16547410
 ] 

Bibin A Chundatt commented on YARN-8538:


[~billie.rinaldi]

I think good to be part of branch-3.1 too . If applicable could you cherry-pick 
to branch-3.1 too.

> Fix valgrind leak check on container executor
> -
>
> Key: YARN-8538
> URL: https://issues.apache.org/jira/browse/YARN-8538
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8538.1.patch, YARN-8538.2.patch
>
>
> Running valgrind --leak-check=yes ./cetest gives us this:
> {noformat}
> ==14094== LEAK SUMMARY:
> ==14094==    definitely lost: 964,351 bytes in 1,154 blocks
> ==14094==    indirectly lost: 75,506 bytes in 3,777 blocks
> ==14094==  possibly lost: 0 bytes in 0 blocks
> ==14094==    still reachable: 554 bytes in 22 blocks
> ==14094== suppressed: 0 bytes in 0 blocks
> ==14094== Reachable blocks (those to which a pointer was found) are not shown.
> ==14094== To see them, rerun with: --leak-check=full --show-leak-kinds=all
> ==14094==
> ==14094== For counts of detected and suppressed errors, rerun with: -v
> ==14094== ERROR SUMMARY: 373 errors from 373 contexts (suppressed: 0 from 0)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8529) Add timeout to RouterWebServiceUtil#invokeRMWebService

2018-07-17 Thread Giovanni Matteo Fumarola (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547249#comment-16547249
 ] 

Giovanni Matteo Fumarola commented on YARN-8529:


Thanks [~elgoiri] for the comments.

However, toMillis and getTimeDuration return long and we will need a cast to 
work in the current code.

> Add timeout to RouterWebServiceUtil#invokeRMWebService
> --
>
> Key: YARN-8529
> URL: https://issues.apache.org/jira/browse/YARN-8529
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Giovanni Matteo Fumarola
>Priority: Major
> Attachments: YARN-8529.v1.patch, YARN-8529.v2.patch
>
>
> {{RouterWebServiceUtil#invokeRMWebService}} currently has a fixed timeout. 
> This should be configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8529) Add timeout to RouterWebServiceUtil#invokeRMWebService

2018-07-17 Thread Giovanni Matteo Fumarola (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giovanni Matteo Fumarola updated YARN-8529:
---
Attachment: (was: YARN-8529.v2.patch)

> Add timeout to RouterWebServiceUtil#invokeRMWebService
> --
>
> Key: YARN-8529
> URL: https://issues.apache.org/jira/browse/YARN-8529
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Giovanni Matteo Fumarola
>Priority: Major
> Attachments: YARN-8529.v1.patch, YARN-8529.v2.patch
>
>
> {{RouterWebServiceUtil#invokeRMWebService}} currently has a fixed timeout. 
> This should be configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8529) Add timeout to RouterWebServiceUtil#invokeRMWebService

2018-07-17 Thread Giovanni Matteo Fumarola (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giovanni Matteo Fumarola updated YARN-8529:
---
Attachment: YARN-8529.v2.patch

> Add timeout to RouterWebServiceUtil#invokeRMWebService
> --
>
> Key: YARN-8529
> URL: https://issues.apache.org/jira/browse/YARN-8529
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Giovanni Matteo Fumarola
>Priority: Major
> Attachments: YARN-8529.v1.patch, YARN-8529.v2.patch
>
>
> {{RouterWebServiceUtil#invokeRMWebService}} currently has a fixed timeout. 
> This should be configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart

2018-07-17 Thread Robert Kanter (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547233#comment-16547233
 ] 

Robert Kanter commented on YARN-6966:
-

Looks like http://builds.apache.org/ is down...

> NodeManager metrics may return wrong negative values when NM restart
> 
>
> Key: YARN-6966
> URL: https://issues.apache.org/jira/browse/YARN-6966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-6966.001.patch, YARN-6966.002.patch, 
> YARN-6966.003.patch, YARN-6966.004.patch, YARN-6966.005.patch
>
>
> Just as YARN-6212. However, I think it is not a duplicate of YARN-3933.
> The primary cause of negative values is that metrics do not recover properly 
> when NM restart.
> AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores
>  in metrics also need to recover when NM restart.
> This should be done in ContainerManagerImpl#recoverContainer.
> The scenario could be reproduction by the following steps:
> # Make sure 
> YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true
>  in NM
> # Submit an application and keep running
> # Restart NM
> # Stop the application
> # Now you get the negative values
> {code}
> /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics
> {code}
> {code}
> {
> name: "Hadoop:service=NodeManager,name=NodeManagerMetrics",
> modelerType: "NodeManagerMetrics",
> tag.Context: "yarn",
> tag.Hostname: "hadoop.com",
> ContainersLaunched: 0,
> ContainersCompleted: 0,
> ContainersFailed: 2,
> ContainersKilled: 0,
> ContainersIniting: 0,
> ContainersRunning: 0,
> AllocatedGB: 0,
> AllocatedContainers: -2,
> AvailableGB: 160,
> AllocatedVCores: -11,
> AvailableVCores: 3611,
> ContainerLaunchDurationNumOps: 2,
> ContainerLaunchDurationAvgTime: 6,
> BadLocalDirs: 0,
> BadLogDirs: 0,
> GoodLocalDirsDiskUtilizationPerc: 2,
> GoodLogDirsDiskUtilizationPerc: 2
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8501) Reduce complexity of RMWebServices' getApps method

2018-07-17 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547221#comment-16547221
 ] 

genericqa commented on YARN-8501:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red}  7m 
57s{color} | {color:red} Docker failed to build yetus/hadoop:abb62dd. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-8501 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12931991/YARN-8501.005.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21279/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Reduce complexity of RMWebServices' getApps method
> --
>
> Key: YARN-8501
> URL: https://issues.apache.org/jira/browse/YARN-8501
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: restapi
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8501.001.patch, YARN-8501.002.patch, 
> YARN-8501.003.patch, YARN-8501.004.patch, YARN-8501.005.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8529) Add timeout to RouterWebServiceUtil#invokeRMWebService

2018-07-17 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547211#comment-16547211
 ] 

genericqa commented on YARN-8529:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red}  0m 
11s{color} | {color:red} Docker failed to build yetus/hadoop:abb62dd. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-8529 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12931990/YARN-8529.v2.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21278/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Add timeout to RouterWebServiceUtil#invokeRMWebService
> --
>
> Key: YARN-8529
> URL: https://issues.apache.org/jira/browse/YARN-8529
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Giovanni Matteo Fumarola
>Priority: Major
> Attachments: YARN-8529.v1.patch, YARN-8529.v2.patch
>
>
> {{RouterWebServiceUtil#invokeRMWebService}} currently has a fixed timeout. 
> This should be configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps

2018-07-17 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547161#comment-16547161
 ] 

Eric Payne commented on YARN-4606:
--

Thank you, [~maniraj...@gmail.com], for the latest patch.

The code changes look good. However, I have a couple of points with the tests.

- I have a general concern that these tests are not testing the fix to the 
starvation problem outlined in the description of this JIRA. I'm trying to 
determine if there is a clean way to unit test that use case.
- {TestCapacityScheduler#testMoveAppWithActiveUsersWithOnlyPendingApps}}: I am 
concerned about new tests that take longer than necessary because the unit 
tests keep taking longer and longer to run. I think that the following things 
can be done to reduce this test time (in my build environment) from 1min 17sec 
to 24 sec.
-- In the following code, the sleep(5000) outside of the for loop is not 
necessary.
-- In the following code, the sleep(5000) inside of the for loop could be cut 
down to sleep(500).
{code:title=TestCapacityScheduler#testMoveAppWithActiveUsersWithOnlyPendingApps}
Thread.sleep(5000);

//Triggering this event so that user limit computation can
//happen again
for (int i = 0; i < 10; i++) {
  cs.handle(new NodeUpdateSchedulerEvent(rmNode1));
  Thread.sleep(5000);
   }
{code}

- {{TestCapacityScheduler#testMoveAppWithActiveUsersWithOnlyPendingApps1}}: I 
don't think this test is necessary. It takes more than 1:20 to run in my build 
environment, and as far as I can tell, it is verifying that the active users 
count is not ever updated after a move if node heartbeats are not received. 
However, in a running YARN installation, node heartbeats are received every 
second (by default). Unless I'm missing something, this isn't a use case that 
one would encounter in a running Hadoop system.
- {{TestCapacityScheduler#setupQueueConfigurationForActiveUsersChecks}}: The 
parameters to {{conf.setUserLimitFactor(...)}} don't need to be 100.0f. User 
limit factor can be thought of as the multiplier for the amount of a queue that 
one user can consume. So, if the user limit factor is 1.0f, one user can use 
the capacity of the queue. If it is 2.0f, one user can use twice the capacity 
of the queue, and so forth. Since these queues have a capacity of 50%, I would 
set this to 2.0f.


> CapacityScheduler: applications could get starved because computation of 
> #activeUsers considers pending apps 
> -
>
> Key: YARN-4606
> URL: https://issues.apache.org/jira/browse/YARN-4606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Affects Versions: 2.8.0, 2.7.1
>Reporter: Karam Singh
>Assignee: Manikandan R
>Priority: Critical
> Attachments: YARN-4606.001.patch, YARN-4606.002.patch, 
> YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.005.patch, 
> YARN-4606.006.patch, YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, 
> YARN-4606.POC.3.patch, YARN-4606.POC.patch
>
>
> Currently, if all applications belong to same user in LeafQueue are pending 
> (caused by max-am-percent, etc.), ActiveUsersManager still considers the user 
> is an active user. This could lead to starvation of active applications, for 
> example:
> - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to 
> user3)/app4(belongs to user4) are pending
> - ActiveUsersManager returns #active-users=4
> - However, there're only two users (user1/user2) are able to allocate new 
> resources. So computed user-limit-resource could be lower than expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7300) DiskValidator is not used in LocalDirAllocator

2018-07-17 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547101#comment-16547101
 ] 

Haibo Chen commented on YARN-7300:
--

+1 on the 02 patch pending Jenkins.

> DiskValidator is not used in LocalDirAllocator
> --
>
> Key: YARN-7300
> URL: https://issues.apache.org/jira/browse/YARN-7300
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Haibo Chen
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-7300.001.patch, YARN-7300.002.patch
>
>
> HADOOP-13254 introduced a pluggable disk validator to replace 
> DiskChecker.checkDir(). However, LocalDirAllocator still references the old 
> DiskChecker.checkDir(). It'd be nice to
> use the plugin uniformly so that user configurations take effect in all 
> places.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8436) FSParentQueue: Comparison method violates its general contract

2018-07-17 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547096#comment-16547096
 ] 

Haibo Chen commented on YARN-8436:
--

Thanks [~wilfreds] for the patch! I have some minor comments

1) "TimSort which is used does not handle the mod of objects it sorts" => 
"TimSort which is used does not handle the concurrent modification of objects 
it is sorting"

2) Doing sleep to synchronize two threads seem flaky. Do you think having the 
Comparator and the thread share a countdownlatch is a better alternative? That 
is, when Comparator.compare() is called the latch is released indicating the 
sort has started, and in the modification thread we can wait for the latch.

> FSParentQueue: Comparison method violates its general contract
> --
>
> Key: YARN-8436
> URL: https://issues.apache.org/jira/browse/YARN-8436
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
> Attachments: YARN-8436.001.patch, YARN-8436.002.patch
>
>
> The ResourceManager can fail while sorting queues if an update comes in:
> {code:java}
> FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type NODE_UPDATE to the scheduler
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>   at java.util.TimSort.mergeLo(TimSort.java:777)
>   at java.util.TimSort.mergeAt(TimSort.java:514)
> ...
>   at java.util.Collections.sort(Collections.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:223){code}
> The reason it breaks is a change in the sorted object itself. 
> This is why it fails:
>  * an update from a node comes in as a heartbeat.
>  * the update triggers a check to see if we can assign a container on the 
> node.
>  * walk over the queue hierarchy to find a queue to assign a container to: 
> top down.
>  * for each parent queue we sort the child queues in {{assignContainer}} to 
> decide which queue to descent into.
>  * we lock the parent queue when sort to prevent changes, but we do not lock 
> the child queues that we are sorting.
> If during this sorting a different node update changes a child queue then we 
> allow that. This means that the objects that we are trying to sort now might 
> be out of order. That causes the issue with the comparator. The comparator 
> itself is not broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8534) [GPG] Add max heap config option for Federation GPG

2018-07-17 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547089#comment-16547089
 ] 

Botong Huang commented on YARN-8534:


Closing this Jira as won't fix. See YARN-8536 for discussions. 

> [GPG] Add max heap config option for Federation GPG
> ---
>
> Key: YARN-8534
> URL: https://issues.apache.org/jira/browse/YARN-8534
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
> Attachments: YARN-8534-YARN-7402.v1.patch, 
> YARN-8534-YARN-7402.v2.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8536) Add max heap config option for Federation Router

2018-07-17 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547086#comment-16547086
 ] 

Botong Huang commented on YARN-8536:


Ok, nvm then. Thanks [~aw] and [~giovanni.fumarola] for comments! 

> Add max heap config option for Federation Router
> 
>
> Key: YARN-8536
> URL: https://issues.apache.org/jira/browse/YARN-8536
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
> Attachments: YARN-8536.v1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7129) Application Catalog for YARN applications

2018-07-17 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547041#comment-16547041
 ] 

genericqa commented on YARN-7129:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red}  0m  
8s{color} | {color:red} Docker failed to build yetus/hadoop:abb62dd. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-7129 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12931952/YARN-7129.011.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21277/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Application Catalog for YARN applications
> -
>
> Key: YARN-7129
> URL: https://issues.apache.org/jira/browse/YARN-7129
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: applications
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN Appstore.pdf, YARN-7129.001.patch, 
> YARN-7129.002.patch, YARN-7129.003.patch, YARN-7129.004.patch, 
> YARN-7129.005.patch, YARN-7129.006.patch, YARN-7129.007.patch, 
> YARN-7129.008.patch, YARN-7129.009.patch, YARN-7129.010.patch, 
> YARN-7129.011.patch
>
>
> YARN native services provides web services API to improve usability of 
> application deployment on Hadoop using collection of docker images.  It would 
> be nice to have an application catalog system which provides an editorial and 
> search interface for YARN applications.  This improves usability of YARN for 
> manage the life cycle of applications.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8529) Add timeout to RouterWebServiceUtil#invokeRMWebService

2018-07-17 Thread JIRA


[ 
https://issues.apache.org/jira/browse/YARN-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547028#comment-16547028
 ] 

Íñigo Goiri commented on YARN-8529:
---

I think we can do:
{code}
public static final int DEFAULT_ROUTER_WEBAPP_CONNECT_TIMEOUT_MS = 
TimeUnit.SECONDS.toMillis(30);
{code}
And:
{code}
conf.getTimeDuration(YarnConfiguration.ROUTER_WEBAPP_CONNECT_TIMEOUT_MS,
YarnConfiguration.DEFAULT_ROUTER_WEBAPP_CONNECT_TIMEOUT_MS,
TimeUnit.MILLIS);
{code}

> Add timeout to RouterWebServiceUtil#invokeRMWebService
> --
>
> Key: YARN-8529
> URL: https://issues.apache.org/jira/browse/YARN-8529
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Giovanni Matteo Fumarola
>Priority: Major
> Attachments: YARN-8529.v1.patch, YARN-8529.v2.patch
>
>
> {{RouterWebServiceUtil#invokeRMWebService}} currently has a fixed timeout. 
> This should be configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8524) Single parameter Resource / LightWeightResource constructor looks confusing

2018-07-17 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547012#comment-16547012
 ] 

Szilard Nemeth commented on YARN-8524:
--

Thanks [~leftnoteasy]!

> Single parameter Resource / LightWeightResource constructor looks confusing
> ---
>
> Key: YARN-8524
> URL: https://issues.apache.org/jira/browse/YARN-8524
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8524.001.patch, YARN-8524.002.patch, 
> YARN-8524.003.patch
>
>
> The single parameter (long) constructor in Resource / LightWeightResource 
> sets all resource components to the same value.
> Since there are other constructors in these classes with (long, int) 
> parameters where the semantics are different, it could be confusing for the 
> users.
> The perfect place to create such a resource would be in the Resources class, 
> with a method named like "createResourceWithSameValue".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8501) Reduce complexity of RMWebServices' getApps method

2018-07-17 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547009#comment-16547009
 ] 

Szilard Nemeth commented on YARN-8501:
--

Hi [~eyang]!
Good catch, uploaded a new patch that fixes the name of the test class.
Please check again, thanks!

> Reduce complexity of RMWebServices' getApps method
> --
>
> Key: YARN-8501
> URL: https://issues.apache.org/jira/browse/YARN-8501
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: restapi
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8501.001.patch, YARN-8501.002.patch, 
> YARN-8501.003.patch, YARN-8501.004.patch, YARN-8501.005.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7974) Allow updating application tracking url after registration

2018-07-17 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547004#comment-16547004
 ] 

genericqa commented on YARN-7974:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red} 13m  
3s{color} | {color:red} Docker failed to build yetus/hadoop:abb62dd. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-7974 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12931980/YARN-7974.006.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21273/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Allow updating application tracking url after registration
> --
>
> Key: YARN-7974
> URL: https://issues.apache.org/jira/browse/YARN-7974
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-7974.001.patch, YARN-7974.002.patch, 
> YARN-7974.003.patch, YARN-7974.004.patch, YARN-7974.005.patch, 
> YARN-7974.006.patch
>
>
> Normally an application's tracking url is set on AM registration. We have a 
> use case for updating the tracking url after registration (e.g. the UI is 
> hosted on one of the containers).
> Approach is for AM to update tracking url on heartbeat to RM, and add related 
> API in AMRMClient.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8501) Reduce complexity of RMWebServices' getApps method

2018-07-17 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-8501:
-
Attachment: YARN-8501.005.patch

> Reduce complexity of RMWebServices' getApps method
> --
>
> Key: YARN-8501
> URL: https://issues.apache.org/jira/browse/YARN-8501
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: restapi
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8501.001.patch, YARN-8501.002.patch, 
> YARN-8501.003.patch, YARN-8501.004.patch, YARN-8501.005.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8544) [DS] AM registration fails when hadoop authorization is enabled

2018-07-17 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546997#comment-16546997
 ] 

genericqa commented on YARN-8544:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red}  0m  
8s{color} | {color:red} Docker failed to build yetus/hadoop:abb62dd. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-8544 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12931951/YARN-8544.001.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21276/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> [DS] AM registration fails when hadoop authorization is enabled
> ---
>
> Key: YARN-8544
> URL: https://issues.apache.org/jira/browse/YARN-8544
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
> Attachments: YARN-8544.001.patch
>
>
> Application master fails to register when hadoop authorization is enabled.
> DistributedSchedulingAMProtocol connection authorization fails are RM side  
> Issue credits: [~BilwaST]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7129) Application Catalog for YARN applications

2018-07-17 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546995#comment-16546995
 ] 

genericqa commented on YARN-7129:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red}  0m  
8s{color} | {color:red} Docker failed to build yetus/hadoop:abb62dd. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-7129 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12931952/YARN-7129.011.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21274/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Application Catalog for YARN applications
> -
>
> Key: YARN-7129
> URL: https://issues.apache.org/jira/browse/YARN-7129
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: applications
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN Appstore.pdf, YARN-7129.001.patch, 
> YARN-7129.002.patch, YARN-7129.003.patch, YARN-7129.004.patch, 
> YARN-7129.005.patch, YARN-7129.006.patch, YARN-7129.007.patch, 
> YARN-7129.008.patch, YARN-7129.009.patch, YARN-7129.010.patch, 
> YARN-7129.011.patch
>
>
> YARN native services provides web services API to improve usability of 
> application deployment on Hadoop using collection of docker images.  It would 
> be nice to have an application catalog system which provides an editorial and 
> search interface for YARN applications.  This improves usability of YARN for 
> manage the life cycle of applications.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7300) DiskValidator is not used in LocalDirAllocator

2018-07-17 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546992#comment-16546992
 ] 

genericqa commented on YARN-7300:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red}  0m  
8s{color} | {color:red} Docker failed to build yetus/hadoop:abb62dd. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-7300 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12931982/YARN-7300.002.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21275/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> DiskValidator is not used in LocalDirAllocator
> --
>
> Key: YARN-7300
> URL: https://issues.apache.org/jira/browse/YARN-7300
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Haibo Chen
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-7300.001.patch, YARN-7300.002.patch
>
>
> HADOOP-13254 introduced a pluggable disk validator to replace 
> DiskChecker.checkDir(). However, LocalDirAllocator still references the old 
> DiskChecker.checkDir(). It'd be nice to
> use the plugin uniformly so that user configurations take effect in all 
> places.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8517) getContainer and getContainers ResourceManager REST API methods are not documented

2018-07-17 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546990#comment-16546990
 ] 

Szilard Nemeth commented on YARN-8517:
--

Hi [~bsteinbach]!
Thanks for the patch!
Couple of comments: 
1. Containers for an application attempt API --> Please change to "Containers 
for an Application Attempt API" (use uppercase A letters)
2. Please change "The amount of virtual cores allocated the container" to "The 
amount of virtual cores allocated *for* the container"
3. Please change "The elapsed in ms since the startedTime" to "The elapsed 
*time* in ms since the startedTime"
4. Did you intentionally exclude "diagnosticsInfo" from the field table of 
ContainerInfo? 
5. Please change "Final exit status for the container" to "Final exit status 
*of* the container"
6. The URLs are wrong in the HTTP request section for both introduced API 
methods (both for JSON and XML): "GET 
http://rm-http-address:port/ws/v1/cluster/nodes; is wrong

> getContainer and getContainers ResourceManager REST API methods are not 
> documented
> --
>
> Key: YARN-8517
> URL: https://issues.apache.org/jira/browse/YARN-8517
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Szilard Nemeth
>Assignee: Antal Bálint Steinbach
>Priority: Major
>  Labels: newbie, newbie++
> Attachments: YARN-8517.001.patch
>
>
> Looking at the documentation here: 
> https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html
> I cannot find documentation for 2 RM REST endpoints: 
> - /apps/\{appid\}/appattempts/\{appattemptid\}/containers
> - /apps/\{appid\}/appattempts/\{appattemptid\}/containers/\{containerid\}
> I suppose they are not intentionally undocumented.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8529) Add timeout to RouterWebServiceUtil#invokeRMWebService

2018-07-17 Thread Giovanni Matteo Fumarola (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546988#comment-16546988
 ] 

Giovanni Matteo Fumarola commented on YARN-8529:


Attached v2. It fixes the FindBugs warning. 
We can't use Configuration#setTimeDuration since I am keeping the default 
timeout for the unit tests.

> Add timeout to RouterWebServiceUtil#invokeRMWebService
> --
>
> Key: YARN-8529
> URL: https://issues.apache.org/jira/browse/YARN-8529
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Giovanni Matteo Fumarola
>Priority: Major
> Attachments: YARN-8529.v1.patch, YARN-8529.v2.patch
>
>
> {{RouterWebServiceUtil#invokeRMWebService}} currently has a fixed timeout. 
> This should be configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8529) Add timeout to RouterWebServiceUtil#invokeRMWebService

2018-07-17 Thread Giovanni Matteo Fumarola (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giovanni Matteo Fumarola updated YARN-8529:
---
Attachment: YARN-8529.v2.patch

> Add timeout to RouterWebServiceUtil#invokeRMWebService
> --
>
> Key: YARN-8529
> URL: https://issues.apache.org/jira/browse/YARN-8529
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Giovanni Matteo Fumarola
>Priority: Major
> Attachments: YARN-8529.v1.patch, YARN-8529.v2.patch
>
>
> {{RouterWebServiceUtil#invokeRMWebService}} currently has a fixed timeout. 
> This should be configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8545) YARN native service should return container if launch failed

2018-07-17 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8545:
-
Description: 
In some cases, container launch may fail but container will not be properly 
returned to RM. 

This could happen when AM trying to prepare container launch context but failed 
w/o sending container launch context to NM (Once container launch context is 
sent to NM, NM will report failed container to RM).

Exception like: 
{code:java}
java.io.FileNotFoundException: File does not exist: 
hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh
at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
at 
org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388)
at 
org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253)
at 
org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152)
at 
org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745){code}
And even after container launch context prepare failed, AM still trying to 
monitor container's readiness:
{code:java}
2018-07-17 18:42:57,518 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 
18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP 
presence", exception="java.io.IOException: primary-worker-0: IP is not 
available yet"

...{code}

  was:
In some cases, container launch may fail but container will not be properly 
returned to RM. 

This could happen when AM trying to prepare container launch context but failed 
w/o sending container launch context to NM (Once container launch context is 
sent to NM, NM will report failed container to RM).

Exception like: 
{code:java}
java.io.FileNotFoundException: File does not exist: 
hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh
at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
at 
org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388)
at 
org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253)
at 
org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152)
at 
org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745){code}


> YARN native service should return container if launch failed
> 
>
> Key: YARN-8545
> URL: https://issues.apache.org/jira/browse/YARN-8545
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Priority: Critical
>
> In some cases, container launch may fail but container will not be properly 
> returned to RM. 
> This could happen when AM trying to prepare container launch context but 
> failed w/o sending container launch context to NM (Once container launch 
> context is sent to NM, NM will report failed container to RM).
> Exception like: 
> {code:java}
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
>   at 
> 

[jira] [Created] (YARN-8545) YARN native service should return container if launch failed

2018-07-17 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8545:


 Summary: YARN native service should return container if launch 
failed
 Key: YARN-8545
 URL: https://issues.apache.org/jira/browse/YARN-8545
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Wangda Tan


In some cases, container launch may fail but container will not be properly 
returned to RM. 

This could happen when AM trying to prepare container launch context but failed 
w/o sending container launch context to NM (Once container launch context is 
sent to NM, NM will report failed container to RM).

Exception like: 
{code:java}
java.io.FileNotFoundException: File does not exist: 
hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh
at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
at 
org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388)
at 
org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253)
at 
org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152)
at 
org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7300) DiskValidator is not used in LocalDirAllocator

2018-07-17 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546962#comment-16546962
 ] 

Szilard Nemeth commented on YARN-7300:
--

Hi [~haibochen]!
Thanks for your review comments.
1. I cannot use YarnRuntimeException as LocalDirAllocator is part of 
hadoop-common and yarn is not a dependency of this module.
2. Good idea, introduced a new constructor that has the disk validator as a 
dependency. LocalDirsHandlerService is the only user of this new constructor as 
it can read the disk validator config (YarnConfiguration.DISK_VALIDATOR) and 
instantiate an appropriate validator. The rest of the callers of the existing 
constructor remained the same.
3. Fair enough, it's even better to refer to the field.

Please check my new patch!
Thanks!

> DiskValidator is not used in LocalDirAllocator
> --
>
> Key: YARN-7300
> URL: https://issues.apache.org/jira/browse/YARN-7300
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Haibo Chen
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-7300.001.patch, YARN-7300.002.patch
>
>
> HADOOP-13254 introduced a pluggable disk validator to replace 
> DiskChecker.checkDir(). However, LocalDirAllocator still references the old 
> DiskChecker.checkDir(). It'd be nice to
> use the plugin uniformly so that user configurations take effect in all 
> places.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7300) DiskValidator is not used in LocalDirAllocator

2018-07-17 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-7300:
-
Attachment: YARN-7300.002.patch

> DiskValidator is not used in LocalDirAllocator
> --
>
> Key: YARN-7300
> URL: https://issues.apache.org/jira/browse/YARN-7300
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Haibo Chen
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-7300.001.patch, YARN-7300.002.patch
>
>
> HADOOP-13254 introduced a pluggable disk validator to replace 
> DiskChecker.checkDir(). However, LocalDirAllocator still references the old 
> DiskChecker.checkDir(). It'd be nice to
> use the plugin uniformly so that user configurations take effect in all 
> places.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8482) [Router] Add cache service for fast answers to getApps

2018-07-17 Thread Giovanni Matteo Fumarola (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546945#comment-16546945
 ] 

Giovanni Matteo Fumarola commented on YARN-8482:


Thanks [~Dillon.] for your interest. However, we have already a prototype 
working.

I will assign the Jira to me to avoid additional confusion.

> [Router] Add cache service for fast answers to getApps
> --
>
> Key: YARN-8482
> URL: https://issues.apache.org/jira/browse/YARN-8482
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Giovanni Matteo Fumarola
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8482) [Router] Add cache service for fast answers to getApps

2018-07-17 Thread Giovanni Matteo Fumarola (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giovanni Matteo Fumarola reassigned YARN-8482:
--

Assignee: Giovanni Matteo Fumarola

> [Router] Add cache service for fast answers to getApps
> --
>
> Key: YARN-8482
> URL: https://issues.apache.org/jira/browse/YARN-8482
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Giovanni Matteo Fumarola
>Assignee: Giovanni Matteo Fumarola
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7974) Allow updating application tracking url after registration

2018-07-17 Thread Jonathan Hung (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546903#comment-16546903
 ] 

Jonathan Hung commented on YARN-7974:
-

Thanks [~leftnoteasy], attached 006 patch addressing these.

> Allow updating application tracking url after registration
> --
>
> Key: YARN-7974
> URL: https://issues.apache.org/jira/browse/YARN-7974
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-7974.001.patch, YARN-7974.002.patch, 
> YARN-7974.003.patch, YARN-7974.004.patch, YARN-7974.005.patch, 
> YARN-7974.006.patch
>
>
> Normally an application's tracking url is set on AM registration. We have a 
> use case for updating the tracking url after registration (e.g. the UI is 
> hosted on one of the containers).
> Approach is for AM to update tracking url on heartbeat to RM, and add related 
> API in AMRMClient.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7974) Allow updating application tracking url after registration

2018-07-17 Thread Jonathan Hung (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-7974:

Attachment: YARN-7974.006.patch

> Allow updating application tracking url after registration
> --
>
> Key: YARN-7974
> URL: https://issues.apache.org/jira/browse/YARN-7974
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-7974.001.patch, YARN-7974.002.patch, 
> YARN-7974.003.patch, YARN-7974.004.patch, YARN-7974.005.patch, 
> YARN-7974.006.patch
>
>
> Normally an application's tracking url is set on AM registration. We have a 
> use case for updating the tracking url after registration (e.g. the UI is 
> hosted on one of the containers).
> Approach is for AM to update tracking url on heartbeat to RM, and add related 
> API in AMRMClient.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7974) Allow updating application tracking url after registration

2018-07-17 Thread Jonathan Hung (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-7974:

Attachment: (was: YARN-7974.006.patch)

> Allow updating application tracking url after registration
> --
>
> Key: YARN-7974
> URL: https://issues.apache.org/jira/browse/YARN-7974
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-7974.001.patch, YARN-7974.002.patch, 
> YARN-7974.003.patch, YARN-7974.004.patch, YARN-7974.005.patch
>
>
> Normally an application's tracking url is set on AM registration. We have a 
> use case for updating the tracking url after registration (e.g. the UI is 
> hosted on one of the containers).
> Approach is for AM to update tracking url on heartbeat to RM, and add related 
> API in AMRMClient.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8498) Yarn NodeManager OOM Listener Fails Compilation on Ubuntu 18.04

2018-07-17 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546889#comment-16546889
 ] 

Haibo Chen commented on YARN-8498:
--

My apologies for the late response. I'll take a look at this later today.

> Yarn NodeManager OOM Listener Fails Compilation on Ubuntu 18.04
> ---
>
> Key: YARN-8498
> URL: https://issues.apache.org/jira/browse/YARN-8498
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jack Bearden
>Priority: Blocker
>  Labels: trunk
>
> While building this project, I ran into a few compilation errors here. The 
> first one was in this file:
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/oom-listener/impl/oom_listener_main.c
> At the very end, during the compilation of the OOM test, it fails again:
>  
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/oom-listener/test/oom_listener_test_main.cc
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/oom-listener/test/oom_listener_test_main.cc:256:7:
>  error: ‘__WAIT_STATUS’ was not declared in this scope
>  __WAIT_STATUS mem_hog_status = {};
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/oom-listener/test/oom_listener_test_main.cc:257:30:
>  error: ‘mem_hog_status’ was not declared in this scope
>  __pid_t exited0 = wait(mem_hog_status);
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/oom-listener/test/oom_listener_test_main.cc:275:21:
>  error: expected ‘;’ before ‘oom_listener_status’
>  __WAIT_STATUS oom_listener_status = {};
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/oom-listener/test/oom_listener_test_main.cc:276:30:
>  error: ‘oom_listener_status’ was not declared in this scope
>  __pid_t exited1 = wait(oom_listener_status);
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7974) Allow updating application tracking url after registration

2018-07-17 Thread Jonathan Hung (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-7974:

Attachment: YARN-7974.006.patch

> Allow updating application tracking url after registration
> --
>
> Key: YARN-7974
> URL: https://issues.apache.org/jira/browse/YARN-7974
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-7974.001.patch, YARN-7974.002.patch, 
> YARN-7974.003.patch, YARN-7974.004.patch, YARN-7974.005.patch, 
> YARN-7974.006.patch
>
>
> Normally an application's tracking url is set on AM registration. We have a 
> use case for updating the tracking url after registration (e.g. the UI is 
> hosted on one of the containers).
> Approach is for AM to update tracking url on heartbeat to RM, and add related 
> API in AMRMClient.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8544) [DS] AM registration fails when hadoop authorization is enabled

2018-07-17 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-8544:
---
Summary: [DS] AM registration fails when hadoop authorization is enabled  
(was: Distributed scheduling AM registration fails when hadoop authorization is 
enabled)

> [DS] AM registration fails when hadoop authorization is enabled
> ---
>
> Key: YARN-8544
> URL: https://issues.apache.org/jira/browse/YARN-8544
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
> Attachments: YARN-8544.001.patch
>
>
> Application master fails to register when hadoop authorization is enabled.
> DistributedSchedulingAMProtocol connection authorization fails are RM side  
> Issue credits: [~BilwaST]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7590) Improve container-executor validation check

2018-07-17 Thread Eric Badger (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546828#comment-16546828
 ] 

Eric Badger commented on YARN-7590:
---

Is your NM running as root? 

{noformat}
  if (caller_uid != info.st_uid) {
fprintf(LOGFILE, "Permission mismatch for %s for caller uid: %d, owner uid: 
%d.\n", nm_root, caller_uid, info.st_uid);
return 1;
  }
{noformat}
Looks like you're running into this error, and caller_uid is set to 0. 
caller_uid is the first argument to check_nm_local_dir, which is always called 
with nm_uid as its first argument. So to me that looks like the NM is being run 
as root

> Improve container-executor validation check
> ---
>
> Key: YARN-7590
> URL: https://issues.apache.org/jira/browse/YARN-7590
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: security, yarn
>Affects Versions: 2.0.1-alpha, 2.2.0, 2.3.0, 2.4.0, 2.5.0, 2.6.0, 2.7.0, 
> 2.8.0, 2.8.1, 3.0.0-beta1
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Fix For: 2.6.6, 3.1.0, 2.10.0, 2.9.1, 3.0.1, 2.8.4, 2.7.6
>
> Attachments: YARN-7590.001.patch, YARN-7590.002.patch, 
> YARN-7590.003.patch, YARN-7590.004.patch, YARN-7590.005.patch, 
> YARN-7590.006.patch, YARN-7590.007.patch, YARN-7590.008.patch, 
> YARN-7590.009.patch, YARN-7590.010.patch, YARN-7590.branch-2.000.patch, 
> YARN-7590.branch-2.6.000.patch, YARN-7590.branch-2.7.000.patch, 
> YARN-7590.branch-2.8.000.patch, YARN-7590.branch-2.9.000.patch
>
>
> There is minimum check for prefix path for container-executor.  If YARN is 
> compromised, attacker  can use container-executor to change system files 
> ownership:
> {code}
> /usr/local/hadoop/bin/container-executor spark yarn 0 etc /home/yarn/tokens 
> /home/spark / ls
> {code}
> This will change /etc to be owned by spark user:
> {code}
> # ls -ld /etc
> drwxr-s---. 110 spark hadoop 8192 Nov 21 20:00 /etc
> {code}
> Spark user can rewrite /etc files to gain more access.  We can improve this 
> with additional check in container-executor:
> # Make sure the prefix path is owned by the same user as the caller to 
> container-executor.
> # Make sure the log directory prefix is owned by the same user as the caller.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8361) Change App Name Placement Rule to use App Name instead of App Id for configuration

2018-07-17 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8361:
-
Target Version/s: 3.2.0  (was: 3.2.0, 3.1.1)

> Change App Name Placement Rule to use App Name instead of App Id for 
> configuration
> --
>
> Key: YARN-8361
> URL: https://issues.apache.org/jira/browse/YARN-8361
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8361.001.patch, YARN-8361.002.patch, 
> YARN-8361.003.patch
>
>
> in YARN-8016, we expose a framework to let user specify custom placement rule 
> through CS configuration, and also add a new placement rule which is mapping 
> specific app with queues. However, the strategy implemented in YARN-8016 was 
> using application id which is hard for user to use this config. In this JIRA, 
> we are changing the mapping to use application name. More specifically, 
> 1. AppNamePlacementRule used app id while specifying queue mapping placement 
> rules, should change to app name
> 2. Change documentation as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8544) Distributed scheduling AM registration fails when hadoop authorization is enabled

2018-07-17 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546774#comment-16546774
 ] 

Bibin A Chundatt commented on YARN-8544:


[~subru]

Could you please  help to review.


> Distributed scheduling AM registration fails when hadoop authorization is 
> enabled
> -
>
> Key: YARN-8544
> URL: https://issues.apache.org/jira/browse/YARN-8544
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
> Attachments: YARN-8544.001.patch
>
>
> Application master fails to register when hadoop authorization is enabled.
> DistributedSchedulingAMProtocol connection authorization fails are RM side  
> Issue credits: [~BilwaST]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8434) Update federation documentation of Nodemanager configurations

2018-07-17 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546773#comment-16546773
 ] 

Bibin A Chundatt commented on YARN-8434:


Some CI issues during all different run

* Protobuf version mismatch
*  Docker failed to build image.

Will try again later today.

> Update federation documentation of Nodemanager configurations
> -
>
> Key: YARN-8434
> URL: https://issues.apache.org/jira/browse/YARN-8434
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8434-branch-3.0.001.patch, YARN-8434.001.patch, 
> YARN-8434.002.patch, YARN-8434.003.patch
>
>
> FederationRMFailoverProxyProvider doesn't handle connecting to active RM. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7129) Application Catalog for YARN applications

2018-07-17 Thread Eric Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-7129:

Attachment: YARN-7129.011.patch

> Application Catalog for YARN applications
> -
>
> Key: YARN-7129
> URL: https://issues.apache.org/jira/browse/YARN-7129
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: applications
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN Appstore.pdf, YARN-7129.001.patch, 
> YARN-7129.002.patch, YARN-7129.003.patch, YARN-7129.004.patch, 
> YARN-7129.005.patch, YARN-7129.006.patch, YARN-7129.007.patch, 
> YARN-7129.008.patch, YARN-7129.009.patch, YARN-7129.010.patch, 
> YARN-7129.011.patch
>
>
> YARN native services provides web services API to improve usability of 
> application deployment on Hadoop using collection of docker images.  It would 
> be nice to have an application catalog system which provides an editorial and 
> search interface for YARN applications.  This improves usability of YARN for 
> manage the life cycle of applications.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8544) Distributed scheduling AM registration fails when hadoop authorization is enabled

2018-07-17 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-8544:
---
Attachment: YARN-8544.001.patch

> Distributed scheduling AM registration fails when hadoop authorization is 
> enabled
> -
>
> Key: YARN-8544
> URL: https://issues.apache.org/jira/browse/YARN-8544
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
> Attachments: YARN-8544.001.patch
>
>
> Application master fails to register when hadoop authorization is enabled.
> DistributedSchedulingAMProtocol connection authorization fails are RM side  
> Issue credits: [~BilwaST]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8544) Distributed scheduling AM registration fails when hadoop authorization is enabled

2018-07-17 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-8544:
---
Target Version/s: 3.1.1

> Distributed scheduling AM registration fails when hadoop authorization is 
> enabled
> -
>
> Key: YARN-8544
> URL: https://issues.apache.org/jira/browse/YARN-8544
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
>
> Application master fails to register when hadoop authorization is enabled.
> DistributedSchedulingAMProtocol connection authorization fails are RM side  
> Issue credits: [~BilwaST]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8544) Distributed scheduling AM registration fails when hadoop authorization is enabled

2018-07-17 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-8544:
--

 Summary: Distributed scheduling AM registration fails when hadoop 
authorization is enabled
 Key: YARN-8544
 URL: https://issues.apache.org/jira/browse/YARN-8544
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt


Application master fails to register when hadoop authorization is enabled.

DistributedSchedulingAMProtocol connection authorization fails are RM side  




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8544) Distributed scheduling AM registration fails when hadoop authorization is enabled

2018-07-17 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-8544:
---
Description: 
Application master fails to register when hadoop authorization is enabled.

DistributedSchedulingAMProtocol connection authorization fails are RM side  

Issue credits: [~BilwaST]

  was:
Application master fails to register when hadoop authorization is enabled.

DistributedSchedulingAMProtocol connection authorization fails are RM side  



> Distributed scheduling AM registration fails when hadoop authorization is 
> enabled
> -
>
> Key: YARN-8544
> URL: https://issues.apache.org/jira/browse/YARN-8544
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
>
> Application master fails to register when hadoop authorization is enabled.
> DistributedSchedulingAMProtocol connection authorization fails are RM side  
> Issue credits: [~BilwaST]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8543) Prometheus /metrics http endpoint for monitoring integration

2018-07-17 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated YARN-8543:
--
Summary: Prometheus /metrics http endpoint for monitoring integration  
(was: Prometheus /metrics http endpoint for metrics monitoring integration)

> Prometheus /metrics http endpoint for monitoring integration
> 
>
> Key: YARN-8543
> URL: https://issues.apache.org/jira/browse/YARN-8543
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Hari Sekhon
>Priority: Major
>
> Feature Request to add Prometheus /metrics http endpoint for monitoring 
> integration:
> https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Cscrape_config%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8536) Add max heap config option for Federation Router

2018-07-17 Thread Allen Wittenauer (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546654#comment-16546654
 ] 

Allen Wittenauer commented on YARN-8536:


bq.  friendly to legacy scripts.

Adding a new var doesn't help legacy scripts since the var didn't exist before 
for them to use

bq. Shouldn't hurt right?

It does in a variety of ways:

1)  Done properly, every config variable adds at least 10 lines of bash and 5 
lines of DOS batch.  (and that's not counting src/site documentation, assuming 
that contributors even bother).  That makes it a long-term support burden for 
just a bit of syntactic sugar. 

2) There is already _OPTS to tune JVMs.  If _HEAPSIZE is used and _OPTS is 
used, where should the Xmx value come from?  Prior to the work I did in 
HADOOP-9902,  this wasn't implemented consistently nor was it obvious to the 
end user which one took precedence. 

3) This is a slippery slope.  Why should Xmx be the only JVM param with a 
custom variable?

4) Before it gets said, I don't buy the "easier for end users" argument either. 
 In production scenarios, daemons almost always need additional parameters 
above and beyond heap (gc logging, etc).  So _OPTS gets defined anyway.  

Long-term, we'd be better served to remove the _HEAPSIZE variables and to 
standardize on _OPTS.  It would greatly cut back on a lot of excess code and 
make it absolutely clear to users that _OPTS is where all JVM tuning should go.

> Add max heap config option for Federation Router
> 
>
> Key: YARN-8536
> URL: https://issues.apache.org/jira/browse/YARN-8536
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
> Attachments: YARN-8536.v1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8543) Prometheus /metrics http endpoint for metrics monitoring integration

2018-07-17 Thread Hari Sekhon (JIRA)
Hari Sekhon created YARN-8543:
-

 Summary: Prometheus /metrics http endpoint for metrics monitoring 
integration
 Key: YARN-8543
 URL: https://issues.apache.org/jira/browse/YARN-8543
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Hari Sekhon


Feature Request to add Prometheus /metrics http endpoint for monitoring 
integration:

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Cscrape_config%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart

2018-07-17 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546599#comment-16546599
 ] 

Szilard Nemeth commented on YARN-6966:
--

Looks like we had build issues globally as another patch on one of my jiras 
also had a docker error.

> NodeManager metrics may return wrong negative values when NM restart
> 
>
> Key: YARN-6966
> URL: https://issues.apache.org/jira/browse/YARN-6966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-6966.001.patch, YARN-6966.002.patch, 
> YARN-6966.003.patch, YARN-6966.004.patch, YARN-6966.005.patch
>
>
> Just as YARN-6212. However, I think it is not a duplicate of YARN-3933.
> The primary cause of negative values is that metrics do not recover properly 
> when NM restart.
> AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores
>  in metrics also need to recover when NM restart.
> This should be done in ContainerManagerImpl#recoverContainer.
> The scenario could be reproduction by the following steps:
> # Make sure 
> YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true
>  in NM
> # Submit an application and keep running
> # Restart NM
> # Stop the application
> # Now you get the negative values
> {code}
> /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics
> {code}
> {code}
> {
> name: "Hadoop:service=NodeManager,name=NodeManagerMetrics",
> modelerType: "NodeManagerMetrics",
> tag.Context: "yarn",
> tag.Hostname: "hadoop.com",
> ContainersLaunched: 0,
> ContainersCompleted: 0,
> ContainersFailed: 2,
> ContainersKilled: 0,
> ContainersIniting: 0,
> ContainersRunning: 0,
> AllocatedGB: 0,
> AllocatedContainers: -2,
> AvailableGB: 160,
> AllocatedVCores: -11,
> AvailableVCores: 3611,
> ContainerLaunchDurationNumOps: 2,
> ContainerLaunchDurationAvgTime: 6,
> BadLocalDirs: 0,
> BadLogDirs: 0,
> GoodLocalDirsDiskUtilizationPerc: 2,
> GoodLogDirsDiskUtilizationPerc: 2
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6995) Improve use of ResourceNotFoundException in resource types code

2018-07-17 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546592#comment-16546592
 ] 

Szilard Nemeth commented on YARN-6995:
--

Hi [~haibochen]!
Could you please retrigger the build? As I don't have permission to do that.
The docker error seems like we had some environment issues while building.
Thanks!

> Improve use of ResourceNotFoundException in resource types code
> ---
>
> Key: YARN-6995
> URL: https://issues.apache.org/jira/browse/YARN-6995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Daniel Templeton
>Assignee: Szilard Nemeth
>Priority: Minor
> Attachments: YARN-6995.005.patch, YARN-6995.006.patch, 
> YARN-6995.007.patch, YARN-6995.008.patch, YARN-6995.YARN-3926.001.patch, 
> YARN-6995.YARN-3926.002.patch, YARN-6995.YARN-3926.003.patch, 
> YARN-6995.YARN-3926.004.patch
>
>
> Now that all the YarnExceptions have been replaced with 
> ResourceNotFoundExceptions, we should make the ResourceNotFoundExceptions as 
> useful as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6995) Improve use of ResourceNotFoundException in resource types code

2018-07-17 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546473#comment-16546473
 ] 

genericqa commented on YARN-6995:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red} 12m 
37s{color} | {color:red} Docker failed to build yetus/hadoop:abb62dd. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-6995 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12931927/YARN-6995.008.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21272/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Improve use of ResourceNotFoundException in resource types code
> ---
>
> Key: YARN-6995
> URL: https://issues.apache.org/jira/browse/YARN-6995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Daniel Templeton
>Assignee: Szilard Nemeth
>Priority: Minor
> Attachments: YARN-6995.005.patch, YARN-6995.006.patch, 
> YARN-6995.007.patch, YARN-6995.008.patch, YARN-6995.YARN-3926.001.patch, 
> YARN-6995.YARN-3926.002.patch, YARN-6995.YARN-3926.003.patch, 
> YARN-6995.YARN-3926.004.patch
>
>
> Now that all the YarnExceptions have been replaced with 
> ResourceNotFoundExceptions, we should make the ResourceNotFoundExceptions as 
> useful as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart

2018-07-17 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546457#comment-16546457
 ] 

genericqa commented on YARN-6966:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red}  0m 
10s{color} | {color:red} Docker failed to build yetus/hadoop:abb62dd. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-6966 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12931926/YARN-6966.005.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21271/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> NodeManager metrics may return wrong negative values when NM restart
> 
>
> Key: YARN-6966
> URL: https://issues.apache.org/jira/browse/YARN-6966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-6966.001.patch, YARN-6966.002.patch, 
> YARN-6966.003.patch, YARN-6966.004.patch, YARN-6966.005.patch
>
>
> Just as YARN-6212. However, I think it is not a duplicate of YARN-3933.
> The primary cause of negative values is that metrics do not recover properly 
> when NM restart.
> AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores
>  in metrics also need to recover when NM restart.
> This should be done in ContainerManagerImpl#recoverContainer.
> The scenario could be reproduction by the following steps:
> # Make sure 
> YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true
>  in NM
> # Submit an application and keep running
> # Restart NM
> # Stop the application
> # Now you get the negative values
> {code}
> /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics
> {code}
> {code}
> {
> name: "Hadoop:service=NodeManager,name=NodeManagerMetrics",
> modelerType: "NodeManagerMetrics",
> tag.Context: "yarn",
> tag.Hostname: "hadoop.com",
> ContainersLaunched: 0,
> ContainersCompleted: 0,
> ContainersFailed: 2,
> ContainersKilled: 0,
> ContainersIniting: 0,
> ContainersRunning: 0,
> AllocatedGB: 0,
> AllocatedContainers: -2,
> AvailableGB: 160,
> AllocatedVCores: -11,
> AvailableVCores: 3611,
> ContainerLaunchDurationNumOps: 2,
> ContainerLaunchDurationAvgTime: 6,
> BadLocalDirs: 0,
> BadLogDirs: 0,
> GoodLocalDirsDiskUtilizationPerc: 2,
> GoodLogDirsDiskUtilizationPerc: 2
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6995) Improve use of ResourceNotFoundException in resource types code

2018-07-17 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546451#comment-16546451
 ] 

Szilard Nemeth commented on YARN-6995:
--

Thanks [~haibochen]!
Uploaded a new patch.
Actually, YARN-8363 upgraded apache commons to 3.7 so I needed to use the newer 
class while importing ExceptionUtils.

> Improve use of ResourceNotFoundException in resource types code
> ---
>
> Key: YARN-6995
> URL: https://issues.apache.org/jira/browse/YARN-6995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Daniel Templeton
>Assignee: Szilard Nemeth
>Priority: Minor
> Attachments: YARN-6995.005.patch, YARN-6995.006.patch, 
> YARN-6995.007.patch, YARN-6995.008.patch, YARN-6995.YARN-3926.001.patch, 
> YARN-6995.YARN-3926.002.patch, YARN-6995.YARN-3926.003.patch, 
> YARN-6995.YARN-3926.004.patch
>
>
> Now that all the YarnExceptions have been replaced with 
> ResourceNotFoundExceptions, we should make the ResourceNotFoundExceptions as 
> useful as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6995) Improve use of ResourceNotFoundException in resource types code

2018-07-17 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-6995:
-
Attachment: YARN-6995.008.patch

> Improve use of ResourceNotFoundException in resource types code
> ---
>
> Key: YARN-6995
> URL: https://issues.apache.org/jira/browse/YARN-6995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Daniel Templeton
>Assignee: Szilard Nemeth
>Priority: Minor
> Attachments: YARN-6995.005.patch, YARN-6995.006.patch, 
> YARN-6995.007.patch, YARN-6995.008.patch, YARN-6995.YARN-3926.001.patch, 
> YARN-6995.YARN-3926.002.patch, YARN-6995.YARN-3926.003.patch, 
> YARN-6995.YARN-3926.004.patch
>
>
> Now that all the YarnExceptions have been replaced with 
> ResourceNotFoundExceptions, we should make the ResourceNotFoundExceptions as 
> useful as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart

2018-07-17 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546439#comment-16546439
 ] 

Szilard Nemeth commented on YARN-6966:
--

Hi [~haibochen] and [~rkanter]!
Thanks for taking time for the review, see my latest patch that fixes the 
issues.
[~haibochen]: Yes, you were right, it wasn't necessary to store the container 
token as it can be created from the startRequest.
Moved the container recovery logic to the recoverActiveContainer() method.

[~rkanter]: Your first comment no longer applies as I'm not saving the 
container token separately, see my answer to Haibo above.
For the second comment, it is a very good idea to have the negative values in 
the testcase.
I was struggling to reproduce it with a test and ultimetely I gave up as it's 
not that straightforward. I hope we can live with this and you think patch 005 
is fine even without this kind of testcase. 

> NodeManager metrics may return wrong negative values when NM restart
> 
>
> Key: YARN-6966
> URL: https://issues.apache.org/jira/browse/YARN-6966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-6966.001.patch, YARN-6966.002.patch, 
> YARN-6966.003.patch, YARN-6966.004.patch, YARN-6966.005.patch
>
>
> Just as YARN-6212. However, I think it is not a duplicate of YARN-3933.
> The primary cause of negative values is that metrics do not recover properly 
> when NM restart.
> AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores
>  in metrics also need to recover when NM restart.
> This should be done in ContainerManagerImpl#recoverContainer.
> The scenario could be reproduction by the following steps:
> # Make sure 
> YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true
>  in NM
> # Submit an application and keep running
> # Restart NM
> # Stop the application
> # Now you get the negative values
> {code}
> /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics
> {code}
> {code}
> {
> name: "Hadoop:service=NodeManager,name=NodeManagerMetrics",
> modelerType: "NodeManagerMetrics",
> tag.Context: "yarn",
> tag.Hostname: "hadoop.com",
> ContainersLaunched: 0,
> ContainersCompleted: 0,
> ContainersFailed: 2,
> ContainersKilled: 0,
> ContainersIniting: 0,
> ContainersRunning: 0,
> AllocatedGB: 0,
> AllocatedContainers: -2,
> AvailableGB: 160,
> AllocatedVCores: -11,
> AvailableVCores: 3611,
> ContainerLaunchDurationNumOps: 2,
> ContainerLaunchDurationAvgTime: 6,
> BadLocalDirs: 0,
> BadLogDirs: 0,
> GoodLocalDirsDiskUtilizationPerc: 2,
> GoodLogDirsDiskUtilizationPerc: 2
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8434) Update federation documentation of Nodemanager configurations

2018-07-17 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-8434:
---
Fix Version/s: 3.1.1
   3.2.0

> Update federation documentation of Nodemanager configurations
> -
>
> Key: YARN-8434
> URL: https://issues.apache.org/jira/browse/YARN-8434
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8434-branch-3.0.001.patch, YARN-8434.001.patch, 
> YARN-8434.002.patch, YARN-8434.003.patch
>
>
> FederationRMFailoverProxyProvider doesn't handle connecting to active RM. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart

2018-07-17 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-6966:
-
Attachment: YARN-6966.005.patch

> NodeManager metrics may return wrong negative values when NM restart
> 
>
> Key: YARN-6966
> URL: https://issues.apache.org/jira/browse/YARN-6966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-6966.001.patch, YARN-6966.002.patch, 
> YARN-6966.003.patch, YARN-6966.004.patch, YARN-6966.005.patch
>
>
> Just as YARN-6212. However, I think it is not a duplicate of YARN-3933.
> The primary cause of negative values is that metrics do not recover properly 
> when NM restart.
> AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores
>  in metrics also need to recover when NM restart.
> This should be done in ContainerManagerImpl#recoverContainer.
> The scenario could be reproduction by the following steps:
> # Make sure 
> YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true
>  in NM
> # Submit an application and keep running
> # Restart NM
> # Stop the application
> # Now you get the negative values
> {code}
> /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics
> {code}
> {code}
> {
> name: "Hadoop:service=NodeManager,name=NodeManagerMetrics",
> modelerType: "NodeManagerMetrics",
> tag.Context: "yarn",
> tag.Hostname: "hadoop.com",
> ContainersLaunched: 0,
> ContainersCompleted: 0,
> ContainersFailed: 2,
> ContainersKilled: 0,
> ContainersIniting: 0,
> ContainersRunning: 0,
> AllocatedGB: 0,
> AllocatedContainers: -2,
> AvailableGB: 160,
> AllocatedVCores: -11,
> AvailableVCores: 3611,
> ContainerLaunchDurationNumOps: 2,
> ContainerLaunchDurationAvgTime: 6,
> BadLocalDirs: 0,
> BadLogDirs: 0,
> GoodLocalDirsDiskUtilizationPerc: 2,
> GoodLogDirsDiskUtilizationPerc: 2
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5565) Capacity Scheduler not assigning value correctly.

2018-07-17 Thread gurmukh singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546333#comment-16546333
 ] 

gurmukh singh commented on YARN-5565:
-

Thanks Zian, for looking into this

> Capacity Scheduler not assigning value correctly.
> -
>
> Key: YARN-5565
> URL: https://issues.apache.org/jira/browse/YARN-5565
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.2
> Environment: hadoop 2.7.2
>Reporter: gurmukh singh
>Assignee: Zian Chen
>Priority: Major
>  Labels: capacity-scheduler, scheduler, yarn
>
> Hi
> I was testing and found out that value assigned in the scheduler 
> configuration is not consistent with what ResourceManager is assigning.
> If i set the configuration as below and understand that it is java float, but 
> the rounding is not correct.
> capacity-sheduler.xml
> 
>   yarn.scheduler.capacity.q1.capacity
>   7.142857142857143
> 
> In Java:  System.err.println((7.142857142857143f)) ===> 7.142587 
> But, instead Resource Manager is assigning is 7.1428566
> Tested this on hadoop 2.7.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5464) Server-Side NM Graceful Decommissioning with RM HA

2018-07-17 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546329#comment-16546329
 ] 

Junping Du commented on YARN-5464:
--

sure. unassign it. please feel free to take it.

> Server-Side NM Graceful Decommissioning with RM HA
> --
>
> Key: YARN-5464
> URL: https://issues.apache.org/jira/browse/YARN-5464
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful
>Reporter: Robert Kanter
>Assignee: Junping Du
>Priority: Major
> Attachments: YARN-5464.wip.patch
>
>
> Make sure to remove the note added by YARN-7094 about RM HA failover not 
> working right.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-5464) Server-Side NM Graceful Decommissioning with RM HA

2018-07-17 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du reassigned YARN-5464:


Assignee: (was: Junping Du)

> Server-Side NM Graceful Decommissioning with RM HA
> --
>
> Key: YARN-5464
> URL: https://issues.apache.org/jira/browse/YARN-5464
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful
>Reporter: Robert Kanter
>Priority: Major
> Attachments: YARN-5464.wip.patch
>
>
> Make sure to remove the note added by YARN-7094 about RM HA failover not 
> working right.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5464) Server-Side NM Graceful Decommissioning with RM HA

2018-07-17 Thread JIRA


[ 
https://issues.apache.org/jira/browse/YARN-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546327#comment-16546327
 ] 

Antal Bálint Steinbach commented on YARN-5464:
--

Hi [~djp],

I've got the information from [~rkanter], that this task is important to finish 
this epic.  If you don't mind, I would like to take this task from you.

> Server-Side NM Graceful Decommissioning with RM HA
> --
>
> Key: YARN-5464
> URL: https://issues.apache.org/jira/browse/YARN-5464
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful
>Reporter: Robert Kanter
>Assignee: Junping Du
>Priority: Major
> Attachments: YARN-5464.wip.patch
>
>
> Make sure to remove the note added by YARN-7094 about RM HA failover not 
> working right.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7590) Improve container-executor validation check

2018-07-17 Thread Aljoscha Krettek (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546289#comment-16546289
 ] 

Aljoscha Krettek commented on YARN-7590:


Hi,

I just came across this issue. I have a kerberized YARN cluster setup that used 
to work with Hadoop 2.8.3. Now I'm getting the following error:
{code}
main : run as user is hadoop-user
main : requested yarn user is hadoop-user
Permission mismatch for /hadoop-data/nm-local-dirs for caller uid: 0, owner 
uid: 1001.
Couldn't get userdir directory for hadoop-user.
{code}

{{hadoop-user}} is the user that I want to use to run my application, {{0}} is 
the uid of {{root}}, {{1001}} is the uid of the {{yarn}} user. {{hadoop-user}} 
is only in the group {{hadoop-user}}, {{yarn}} is in the groups ({{yarn}}, 
{{hadoop}}).

{{container-executor}} has these permissions:
{code}
---Sr-s--- 1 root yarn 234175 May  8 02:58 container-executor
{code}

{{container-executor.cfg}} has these permissions:
{code}
-r 1 root yarn   208 Jul 17 08:20 container-executor.cfg
{code}

My directories have these permissions:
{code}
root@slave1:/hadoop-data# ls -lah
total 16K
drwxr-xr-x 1 yarn yarn 4.0K Jul 17 08:33 .
drwxr-xr-x 1 root root 4.0K Jul 17 08:37 ..
drwxr-xr-x 1 yarn yarn 4.0K Jul 17 08:33 nm-local-dirs
drwxr-xr-x 1 yarn yarn 4.0K Jul 17 08:33 nm-log-dirs
{code}

Anyone know what could be causing this? Any help is greatly appreciated.

> Improve container-executor validation check
> ---
>
> Key: YARN-7590
> URL: https://issues.apache.org/jira/browse/YARN-7590
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: security, yarn
>Affects Versions: 2.0.1-alpha, 2.2.0, 2.3.0, 2.4.0, 2.5.0, 2.6.0, 2.7.0, 
> 2.8.0, 2.8.1, 3.0.0-beta1
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Fix For: 2.6.6, 3.1.0, 2.10.0, 2.9.1, 3.0.1, 2.8.4, 2.7.6
>
> Attachments: YARN-7590.001.patch, YARN-7590.002.patch, 
> YARN-7590.003.patch, YARN-7590.004.patch, YARN-7590.005.patch, 
> YARN-7590.006.patch, YARN-7590.007.patch, YARN-7590.008.patch, 
> YARN-7590.009.patch, YARN-7590.010.patch, YARN-7590.branch-2.000.patch, 
> YARN-7590.branch-2.6.000.patch, YARN-7590.branch-2.7.000.patch, 
> YARN-7590.branch-2.8.000.patch, YARN-7590.branch-2.9.000.patch
>
>
> There is minimum check for prefix path for container-executor.  If YARN is 
> compromised, attacker  can use container-executor to change system files 
> ownership:
> {code}
> /usr/local/hadoop/bin/container-executor spark yarn 0 etc /home/yarn/tokens 
> /home/spark / ls
> {code}
> This will change /etc to be owned by spark user:
> {code}
> # ls -ld /etc
> drwxr-s---. 110 spark hadoop 8192 Nov 21 20:00 /etc
> {code}
> Spark user can rewrite /etc files to gain more access.  We can improve this 
> with additional check in container-executor:
> # Make sure the prefix path is owned by the same user as the caller to 
> container-executor.
> # Make sure the log directory prefix is owned by the same user as the caller.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8482) [Router] Add cache service for fast answers to getApps

2018-07-17 Thread Dillon Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546225#comment-16546225
 ] 

Dillon Zhang commented on YARN-8482:


Now, I'm working on this task.

> [Router] Add cache service for fast answers to getApps
> --
>
> Key: YARN-8482
> URL: https://issues.apache.org/jira/browse/YARN-8482
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Giovanni Matteo Fumarola
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6482) TestSLSRunner runs but doesn't executed jobs (.json parsing issue)

2018-07-17 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546223#comment-16546223
 ] 

Weiwei Yang commented on YARN-6482:
---

Hi [~yuanbo]

Thanks for working on this, looks like the UT fails though, can you please take 
a look?

> TestSLSRunner runs but doesn't executed jobs (.json parsing issue)
> --
>
> Key: YARN-6482
> URL: https://issues.apache.org/jira/browse/YARN-6482
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Carlo Curino
>Assignee: Yuanbo Liu
>Priority: Minor
> Attachments: YARN-6482.001.patch
>
>
> The TestSLSRunner runs correctly brining up and RM, but the parsing of the 
> rumen trace fails somehow silently, and no nodes nor jobs are loaded. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org