[jira] [Commented] (YARN-9394) Use new API of RackResolver to get better performance

2019-04-03 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808423#comment-16808423
 ] 

Hadoop QA commented on YARN-9394:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
27s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
18s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
31s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
32s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 24s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
23s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 16s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client: The patch generated 1 new + 
38 unchanged - 0 fixed = 39 total (was 38) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 55s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 26m  
1s{color} | {color:green} hadoop-yarn-client in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
30s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 77m 20s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-9394 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12964664/YARN-9394.002.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 8c164d5bbd4c 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 
5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / aaaf856 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/23869/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-client.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/23869/testReport/ |
| Max. process+thread count | 681 (vs. ulimit of 1) |
| modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client |
| Console output | 

[jira] [Commented] (YARN-9435) Add Opportunistic Scheduler metrics in ResourceManager.

2019-04-03 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808453#comment-16808453
 ] 

Hadoop QA commented on YARN-9435:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
35s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
34s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 
35s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m  
9s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 4s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
26s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 51s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
55s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
12s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m  
4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m  
4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 41s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
36s{color} | {color:green} hadoop-yarn-server-common in the patch passed. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 81m 29s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
30s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}149m 23s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService 
|
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-9435 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12964662/YARN-9435.002.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 39875177edb0 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 
5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / aaaf856 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| unit | 

[jira] [Commented] (YARN-9281) Add express upgrade button to Appcatalog UI

2019-04-03 Thread Adam Antal (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808557#comment-16808557
 ] 

Adam Antal commented on YARN-9281:
--

Thanks for taking care of the items, LGTM (non-binding).

> Add express upgrade button to Appcatalog UI
> ---
>
> Key: YARN-9281
> URL: https://issues.apache.org/jira/browse/YARN-9281
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9281.001.patch, YARN-9281.002.patch, 
> YARN-9281.003.patch, YARN-9281.004.patch, YARN-9281.005.patch, 
> YARN-9281.006.patch, YARN-9281.007.patch
>
>
> It would be nice to have ability to upgrade applications deployed by 
> Application catalog from Application catalog UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9430) Recovering containers does not check available resources on node

2019-04-03 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-9430:


Assignee: (was: Szilard Nemeth)

> Recovering containers does not check available resources on node
> 
>
> Key: YARN-9430
> URL: https://issues.apache.org/jira/browse/YARN-9430
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Priority: Critical
>
> I have a testcase that checks if some GPU devices gone offline and recovery 
> happens, only the containers that fit into the node's resources will be 
> recovered. Unfortunately, this is not the case: RM does not check available 
> resources on node during recovery.
> *Detailed explanation:*
> *Testcase:* 
>  1. There are 2 nodes running NodeManagers
>  2. nvidia-smi is replaced with a fake bash script that reports 2 GPU devices 
> per node, initially. This means 4 GPU devices in the cluster altogether.
>  3. RM / NM recovery is enabled
>  4. The test starts off a sleep job, requesting 4 containers, 1 GPU device 
> for each (AM does not request GPUs)
>  5. Before restart, the fake bash script is adjusted to report 1 GPU device 
> per node (2 in the cluster) after restart.
>  6. Restart is initiated.
>  
> *Expected behavior:* 
>  After restart, only the AM and 2 normal containers should have been started, 
> as there are only 2 GPU devices in the cluster.
>  
> *Actual behaviour:* 
>  AM + 4 containers are allocated, this is all containers started originally 
> with step 4.
> App id was: 1553977186701_0001
> *Logs*:
>  
> {code:java}
> 2019-03-30 13:22:30,299 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Processing event for appattempt_1553977186701_0001_01 of type RECOVER
> 2019-03-30 13:22:30,366 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Added Application Attempt appattempt_1553977186701_0001_01 to scheduler 
> from user: systest
>  2019-03-30 13:22:30,366 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> appattempt_1553977186701_0001_01 is recovering. Skipping notifying 
> ATTEMPT_ADDED
>  2019-03-30 13:22:30,367 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1553977186701_0001_01 State change from NEW to LAUNCHED on 
> event = RECOVER
> 2019-03-30 13:22:33,257 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
>  Recovering container [container_e84_1553977186701_0001_01_01, 
> CreateTime: 1553977260732, Version: 0, State: RUNNING, Capability: 
> , Diagnostics: , ExitStatus: -1000, 
> NodeLabelExpression: Priority: 0]
> 2019-03-30 13:22:33,275 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
>  Recovering container [container_e84_1553977186701_0001_01_04, 
> CreateTime: 1553977272802, Version: 0, State: RUNNING, Capability: 
> , Diagnostics: , ExitStatus: -1000, 
> NodeLabelExpression: Priority: 0]
> 2019-03-30 13:22:33,275 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
> Assigned container container_e84_1553977186701_0001_01_04 of capacity 
>  on host 
> snemeth-gpu-2.vpc.cloudera.com:8041, which has 2 containers,  vCores:2, yarn.io/gpu: 1> used and  available after 
> allocation
> 2019-03-30 13:22:33,276 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
>  Recovering container [container_e84_1553977186701_0001_01_05, 
> CreateTime: 1553977272803, Version: 0, State: RUNNING, Capability: 
> , Diagnostics: , ExitStatus: -1000, 
> NodeLabelExpression: Priority: 0]
>  2019-03-30 13:22:33,276 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> Processing container_e84_1553977186701_0001_01_05 of type RECOVER
>  2019-03-30 13:22:33,276 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e84_1553977186701_0001_01_05 Container Transitioned from NEW to 
> RUNNING
>  2019-03-30 13:22:33,276 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
> Assigned container container_e84_1553977186701_0001_01_05 of capacity 
>  on host 
> snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers,  vCores:3, yarn.io/gpu: 2> used and  
> available after allocation
> 2019-03-30 13:22:33,279 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
>  Recovering container [container_e84_1553977186701_0001_01_03, 
> CreateTime: 1553977272166, Version: 0, State: RUNNING, Capability: 
> , Diagnostics: , ExitStatus: -1000, 
> NodeLabelExpression: Priority: 0]
>  2019-03-30 13:22:33,280 DEBUG 
> 

[jira] [Commented] (YARN-9421) Implement SafeMode for ResourceManager by defining a resource threshold

2019-04-03 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808563#comment-16808563
 ] 

Szilard Nemeth commented on YARN-9421:
--

As per our further discussion with [~wilfreds]: 

Let’s give a minimum percentage of nodes, e.g. 75% of nodes are registered or 
combined with the timeout.
For percentage, we should check if the NM whitelist file is always present. 
If we don’t have this file or it's empty, we need to drop the percentage 
criteria and only use the timeout value.
This should be configurable as flexible as possible.

Another corner case: What if the whitelist contains more machines than really 
available (IP whitelist, etc)? 
We can also add number of nodes to wait for as a 3rd grade of threshold, but 
this is optional.
What we need to do with the applications: Park applications until we reached 
the threshold.
We need to pay attention to give an upper-limit of the timeout value so user's 
don't have the freedom to accidentally provide some very high (e.g. 100 
minutes) value. I would define the maximum of timeout as 1 minute.


> Implement SafeMode for ResourceManager by defining a resource threshold
> ---
>
> Key: YARN-9421
> URL: https://issues.apache.org/jira/browse/YARN-9421
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Szilard Nemeth
>Priority: Major
> Attachments: client-log.log, nodemanager.log, resourcemanager.log
>
>
> We have a hypothetical testcase in our test suite that tests Resource Types.
>  The test does the following: 
>  1. Sets up a resource named "gpu"
>  2. Out of 9 NodeManager nodes, 1 node has 100 of "gpu".
>  3. It executes a sleep job with resoure requests: 
>  "-Dmapreduce.reduce.resource.gpu=7" and 
> "-Dyarn.app.mapreduce.am.resource.gpu=11"
> Sometimes, we encounter situations when the app submission fails with: 
> {code:java}
> 2019-02-25 06:09:56,795 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: RM app submission 
> failed in validating AM resource request for application 
> application_1551103768202_0001
>  org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request! Cannot allocate containers as requested resource is greater 
> than maximum allowed allocation. Requested resource type=[gpu], Requested 
> resource=, maximum allowed 
> allocation=, please note that maximum allowed 
> allocation is calculated by scheduler based on maximum resource of registered 
> NodeManagers, which might be less than configured maximum 
> allocation={code}
> It's clearly visible that the maximum allowed allocation does not have any 
> "gpu" resources.
>  
> Looking into the logs further, I realized that sometimes the node having the 
> "gpu" resources are registered after the app is submitted.
>  In a real world situation and even with this very special test exexution, we 
> can't be sure which order NMs are registering with RM.
>  With the advent of resource types, this issue was more likely surface.
> If we have a cluster with some "rare" resources like GPUs only on some nodes 
> out of a 100, we can quickly run into a situation when the NMs with GPUs are 
> registering later than the normal nodes. While the critical NMs are still 
> registering, we will most likely experience the same 
> InvalidResourceRequestException if we submit jobs requesting GPUs.
> There is a naive solution to this: 
>  1. Give some time for RM to wait for NMs to be able to register themselves 
> and put submitted applications on hold. This could work in some situations 
> but it's not the most flexible solution as different clusters can have 
> different requirements. Of course, we can make this more flexible by making 
> the timeout value configurable.
> *A more flexible alternative would be:*
>  2. We define a threshold of Resource capability: While we haven't reached 
> this threshold, we put submitted jobs on hold. Once we reached the threshold, 
> we enable jobs to pass through. 
>  This is very similar to an already existing concept, the SafeMode in HDFS 
> ([https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Safemode]).
>  Back to my GPU example above, the threshold could be: 8 vcores, 16GB, 3 
> GPUs. 
>  Defining a threshold like this, we can ensure most of the submitted jobs 
> won't be lost, just "parked" until NMs are registered.
> The final solution could be the Resource threshold, or the combination of the 
> threshold and timeout value. I'm open for any other suggestion as well.
> *Last but not least, a very easy way to reproduce the issue on a 3 node 
> cluster:* 
>  1. Configure a resource type, named 'testres'.
>  2. Node1 runs RM, Node 2/3 runs NMs
>  3. Node2 has 1 testres
>  4. Node3 has 0 testres
>  5. Stop all nodes
>  

[jira] [Commented] (YARN-9421) Implement SafeMode for ResourceManager by defining a resource threshold

2019-04-03 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808567#comment-16808567
 ] 

Szilard Nemeth commented on YARN-9421:
--

[~adam.antal]: Coming back to your corner case: As [~wilfreds] said: This case 
can happen with any default resources like memory, vcores, etc.
Do you still have concerns?

[~eyang]: Thanks for your comments!
You are right about the concern of cluster can change frequently. I haven't 
mentioned but I meant to: I want to use the safemode mechanism only on startup. 
If we define a low enough timeout value, jobs can't queue up so we don't use 
much memory. I agree with you as the safemode concept wouldn't be a default 
behavior and I never wanted to be like that: This is definitely planned as an 
opt-in feature.

> Implement SafeMode for ResourceManager by defining a resource threshold
> ---
>
> Key: YARN-9421
> URL: https://issues.apache.org/jira/browse/YARN-9421
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Szilard Nemeth
>Priority: Major
> Attachments: client-log.log, nodemanager.log, resourcemanager.log
>
>
> We have a hypothetical testcase in our test suite that tests Resource Types.
>  The test does the following: 
>  1. Sets up a resource named "gpu"
>  2. Out of 9 NodeManager nodes, 1 node has 100 of "gpu".
>  3. It executes a sleep job with resoure requests: 
>  "-Dmapreduce.reduce.resource.gpu=7" and 
> "-Dyarn.app.mapreduce.am.resource.gpu=11"
> Sometimes, we encounter situations when the app submission fails with: 
> {code:java}
> 2019-02-25 06:09:56,795 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: RM app submission 
> failed in validating AM resource request for application 
> application_1551103768202_0001
>  org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request! Cannot allocate containers as requested resource is greater 
> than maximum allowed allocation. Requested resource type=[gpu], Requested 
> resource=, maximum allowed 
> allocation=, please note that maximum allowed 
> allocation is calculated by scheduler based on maximum resource of registered 
> NodeManagers, which might be less than configured maximum 
> allocation={code}
> It's clearly visible that the maximum allowed allocation does not have any 
> "gpu" resources.
>  
> Looking into the logs further, I realized that sometimes the node having the 
> "gpu" resources are registered after the app is submitted.
>  In a real world situation and even with this very special test exexution, we 
> can't be sure which order NMs are registering with RM.
>  With the advent of resource types, this issue was more likely surface.
> If we have a cluster with some "rare" resources like GPUs only on some nodes 
> out of a 100, we can quickly run into a situation when the NMs with GPUs are 
> registering later than the normal nodes. While the critical NMs are still 
> registering, we will most likely experience the same 
> InvalidResourceRequestException if we submit jobs requesting GPUs.
> There is a naive solution to this: 
>  1. Give some time for RM to wait for NMs to be able to register themselves 
> and put submitted applications on hold. This could work in some situations 
> but it's not the most flexible solution as different clusters can have 
> different requirements. Of course, we can make this more flexible by making 
> the timeout value configurable.
> *A more flexible alternative would be:*
>  2. We define a threshold of Resource capability: While we haven't reached 
> this threshold, we put submitted jobs on hold. Once we reached the threshold, 
> we enable jobs to pass through. 
>  This is very similar to an already existing concept, the SafeMode in HDFS 
> ([https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Safemode]).
>  Back to my GPU example above, the threshold could be: 8 vcores, 16GB, 3 
> GPUs. 
>  Defining a threshold like this, we can ensure most of the submitted jobs 
> won't be lost, just "parked" until NMs are registered.
> The final solution could be the Resource threshold, or the combination of the 
> threshold and timeout value. I'm open for any other suggestion as well.
> *Last but not least, a very easy way to reproduce the issue on a 3 node 
> cluster:* 
>  1. Configure a resource type, named 'testres'.
>  2. Node1 runs RM, Node 2/3 runs NMs
>  3. Node2 has 1 testres
>  4. Node3 has 0 testres
>  5. Stop all nodes
>  6. Start RM on Node1
>  7. Start NM on Node3 (the one without the resource)
>  8. Start a pi job, request 1 testres for the AM
> Here's the command to start the job:
> {code:java}
> MY_HADOOP_VERSION=3.3.0-SNAPSHOT;pushd /opt/hadoop;bin/yarn jar 
> 

[jira] [Comment Edited] (YARN-9421) Implement SafeMode for ResourceManager by defining a resource threshold

2019-04-03 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808567#comment-16808567
 ] 

Szilard Nemeth edited comment on YARN-9421 at 4/3/19 10:05 AM:
---

[~adam.antal]: Coming back to your corner case: As [~wilfreds] said: This case 
can happen with any default resources like memory, vcores, etc.
Do you still have concerns?

[~eyang]: Thanks for your comments!
You are right about the concern of cluster can change frequently. I haven't 
mentioned but I meant to: I want to use the safemode mechanism only on startup. 
If we define a low enough timeout value, jobs can't queue up so we don't use 
much memory. I agree with you as the safemode concept wouldn't be a default 
behavior and I never wanted to be like that: This is definitely planned as an 
opt-in feature.
Does this answer all of your concerns / questions? I didn't really get the SLA 
part, sorry.


was (Author: snemeth):
[~adam.antal]: Coming back to your corner case: As [~wilfreds] said: This case 
can happen with any default resources like memory, vcores, etc.
Do you still have concerns?

[~eyang]: Thanks for your comments!
You are right about the concern of cluster can change frequently. I haven't 
mentioned but I meant to: I want to use the safemode mechanism only on startup. 
If we define a low enough timeout value, jobs can't queue up so we don't use 
much memory. I agree with you as the safemode concept wouldn't be a default 
behavior and I never wanted to be like that: This is definitely planned as an 
opt-in feature.

> Implement SafeMode for ResourceManager by defining a resource threshold
> ---
>
> Key: YARN-9421
> URL: https://issues.apache.org/jira/browse/YARN-9421
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Szilard Nemeth
>Priority: Major
> Attachments: client-log.log, nodemanager.log, resourcemanager.log
>
>
> We have a hypothetical testcase in our test suite that tests Resource Types.
>  The test does the following: 
>  1. Sets up a resource named "gpu"
>  2. Out of 9 NodeManager nodes, 1 node has 100 of "gpu".
>  3. It executes a sleep job with resoure requests: 
>  "-Dmapreduce.reduce.resource.gpu=7" and 
> "-Dyarn.app.mapreduce.am.resource.gpu=11"
> Sometimes, we encounter situations when the app submission fails with: 
> {code:java}
> 2019-02-25 06:09:56,795 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: RM app submission 
> failed in validating AM resource request for application 
> application_1551103768202_0001
>  org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request! Cannot allocate containers as requested resource is greater 
> than maximum allowed allocation. Requested resource type=[gpu], Requested 
> resource=, maximum allowed 
> allocation=, please note that maximum allowed 
> allocation is calculated by scheduler based on maximum resource of registered 
> NodeManagers, which might be less than configured maximum 
> allocation={code}
> It's clearly visible that the maximum allowed allocation does not have any 
> "gpu" resources.
>  
> Looking into the logs further, I realized that sometimes the node having the 
> "gpu" resources are registered after the app is submitted.
>  In a real world situation and even with this very special test exexution, we 
> can't be sure which order NMs are registering with RM.
>  With the advent of resource types, this issue was more likely surface.
> If we have a cluster with some "rare" resources like GPUs only on some nodes 
> out of a 100, we can quickly run into a situation when the NMs with GPUs are 
> registering later than the normal nodes. While the critical NMs are still 
> registering, we will most likely experience the same 
> InvalidResourceRequestException if we submit jobs requesting GPUs.
> There is a naive solution to this: 
>  1. Give some time for RM to wait for NMs to be able to register themselves 
> and put submitted applications on hold. This could work in some situations 
> but it's not the most flexible solution as different clusters can have 
> different requirements. Of course, we can make this more flexible by making 
> the timeout value configurable.
> *A more flexible alternative would be:*
>  2. We define a threshold of Resource capability: While we haven't reached 
> this threshold, we put submitted jobs on hold. Once we reached the threshold, 
> we enable jobs to pass through. 
>  This is very similar to an already existing concept, the SafeMode in HDFS 
> ([https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Safemode]).
>  Back to my GPU example above, the threshold could be: 8 vcores, 16GB, 3 
> GPUs. 
>  Defining a threshold like this, we can ensure most of the 

[jira] [Commented] (YARN-9430) Recovering containers does not check available resources on node

2019-04-03 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808578#comment-16808578
 ] 

Szilard Nemeth commented on YARN-9430:
--

As per our further discussion with [~wilfreds], we need to check the following 
further:

1. Verify if the test we execute working with work-preserving recovery: This is 
most likely the case (99%).
Why does it matter? 
Because with work-preserving recovery of NM, we don't kill containers when the 
NM is killed/stopped, we keep them running instead. 
That's why after restart, the recovery of the containers happens and they are 
running.
As I simulated the GPU to disappear with the "fake nvidia-smi script", 
containers can't detect that the GPU device disappeared.
We need to come up with a mechanism to simulate "GPU goes offline" event while 
the containers are running, one idea for that is to kill the GPU binary process 
that the container communicates with, but we definitely need to look into this 
in more details. Container should crash and finish this case.

2. We also need to check simple (non work-preserving) recovery as well. If the 
containers are killed on restart and we come back with less GPUs, we should 
still see the issue on RM side.
In non work-preserving case, RM should not allow to start the containers at all 
as there's not enough resources for it to start. The application's AM should 
handle these situations.

*Nevertheless, the testcase pasted in the description should be added to the 
code and RM should not allow any resource going less than zero*.
A big fat error log definitely need to be added to the deduct function 
mentioned in the description.

[~adam.antal], [~shuzirra], [~bsteinbach]: Anything to add?


> Recovering containers does not check available resources on node
> 
>
> Key: YARN-9430
> URL: https://issues.apache.org/jira/browse/YARN-9430
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Priority: Critical
>
> I have a testcase that checks if some GPU devices gone offline and recovery 
> happens, only the containers that fit into the node's resources will be 
> recovered. Unfortunately, this is not the case: RM does not check available 
> resources on node during recovery.
> *Detailed explanation:*
> *Testcase:* 
>  1. There are 2 nodes running NodeManagers
>  2. nvidia-smi is replaced with a fake bash script that reports 2 GPU devices 
> per node, initially. This means 4 GPU devices in the cluster altogether.
>  3. RM / NM recovery is enabled
>  4. The test starts off a sleep job, requesting 4 containers, 1 GPU device 
> for each (AM does not request GPUs)
>  5. Before restart, the fake bash script is adjusted to report 1 GPU device 
> per node (2 in the cluster) after restart.
>  6. Restart is initiated.
>  
> *Expected behavior:* 
>  After restart, only the AM and 2 normal containers should have been started, 
> as there are only 2 GPU devices in the cluster.
>  
> *Actual behaviour:* 
>  AM + 4 containers are allocated, this is all containers started originally 
> with step 4.
> App id was: 1553977186701_0001
> *Logs*:
>  
> {code:java}
> 2019-03-30 13:22:30,299 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Processing event for appattempt_1553977186701_0001_01 of type RECOVER
> 2019-03-30 13:22:30,366 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Added Application Attempt appattempt_1553977186701_0001_01 to scheduler 
> from user: systest
>  2019-03-30 13:22:30,366 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> appattempt_1553977186701_0001_01 is recovering. Skipping notifying 
> ATTEMPT_ADDED
>  2019-03-30 13:22:30,367 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1553977186701_0001_01 State change from NEW to LAUNCHED on 
> event = RECOVER
> 2019-03-30 13:22:33,257 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
>  Recovering container [container_e84_1553977186701_0001_01_01, 
> CreateTime: 1553977260732, Version: 0, State: RUNNING, Capability: 
> , Diagnostics: , ExitStatus: -1000, 
> NodeLabelExpression: Priority: 0]
> 2019-03-30 13:22:33,275 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
>  Recovering container [container_e84_1553977186701_0001_01_04, 
> CreateTime: 1553977272802, Version: 0, State: RUNNING, Capability: 
> , Diagnostics: , ExitStatus: -1000, 
> NodeLabelExpression: Priority: 0]
> 2019-03-30 13:22:33,275 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
> Assigned container container_e84_1553977186701_0001_01_04 of capacity 
>  on host 
> 

[jira] [Commented] (YARN-9436) Flaky test testApplicationLifetimeMonitor

2019-04-03 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808805#comment-16808805
 ] 

Peter Bacsko commented on YARN-9436:


Whoah, thanks [~Prabhu Joseph] - yes it's exactly the same. I'm closing this.

> Flaky test testApplicationLifetimeMonitor
> -
>
> Key: YARN-9436
> URL: https://issues.apache.org/jira/browse/YARN-9436
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> In our test environment, we occasionally encounter this failure:
> {noformat}
> 2019-04-03 12:49:32 [INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor
> 2019-04-03 12:53:08 [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, 
> Time elapsed: 215.535 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor
> 2019-04-03 12:53:08 [ERROR] 
> testApplicationLifetimeMonitor[0](org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor)
>   Time elapsed: 34.244 s  <<< FAILURE!
> 2019-04-03 12:53:08 java.lang.AssertionError: Application killed before 
> lifetime value
> 2019-04-03 12:53:08   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor.testApplicationLifetimeMonitor(TestApplicationLifetimeMonitor.java:218)
> 2019-04-03 12:53:08 
> {noformat}
> The root cause is the condition here:
> {noformat}
> Assert.assertTrue("Application killed before lifetime value",
> totalTimeRun > maxLifetime);
> {noformat}
> However, there are two problems with this condition:
>  1. Logically it's not correct. In fact, since the app should be killed after 
> 30 seconds, one would expect to see {{totalTimeRun = maxLifetime}}. Due to 
> some asynchronicity and rounding, most of the time {{totalTimeRun}} ends up 
> being 31.
> 2. Sometimes the application is killed fast enough and {{totalTimeRun}} is 
> 30, but this is correct, because in {{setUpCSQueue}} we set the queue 
> lifetime:
> {noformat}
> csConf.setMaximumLifetimePerQueue(
> CapacitySchedulerConfiguration.ROOT + ".default", maxLifetime);
> csConf.setDefaultLifetimePerQueue(
> CapacitySchedulerConfiguration.ROOT + ".default", defaultLifetime);
> {noformat}
> A more proper condition is:
> {noformat}
> Assert.assertTrue("Application killed before lifetime value",
> totalTimeRun >= maxLifetime);
> {noformat}
> The assertion message in the next line is also misleading:
> {noformat}
> Assert.assertTrue(
> "Application killed before lifetime value " + totalTimeRun,
> totalTimeRun < maxLifetime + 10L);
> {noformat}
> If it false, it means that the application is killed _after_ 40 seconds, 
> which exceeds both the app's lifetime (40s) and that of the queue (30s).
> {noformat}
> Assert.assertTrue(
> "Application killed after queue/app lifetime value: " + 
> totalTimeRun,
> totalTimeRun < maxLifetime + 10L);
> {noformat}
> We can be even be stricter, since we expect a kill almost immediately after 
> 30 seconds:
> {noformat}
> Assert.assertTrue(
> "Application killed too late: " + totalTimeRun,
> totalTimeRun < maxLifetime + 2L);
> {noformat}
> where we allow a 2 second tolerance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5670) Add support for Docker image clean up

2019-04-03 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809338#comment-16809338
 ] 

Eric Yang commented on YARN-5670:
-

In today's YARN Docker meeting, there is consensus on using Node Manager track 
LRU by digest ID, and apply mark and sweep algorithm to prune images seem by 
node manager.  Open concern is still around corner case where locally tagged 
system admin images can get deleted when the same image is used by a job.  
Kubernetes tackles docker image pruning problem by making assumption that 
[system not require human operators to work 
reliably|https://thenewstack.io/deletion-garbage-collection-kubernetes-objects/].
  I think this is a safe assumption, and wait for [~shaneku...@gmail.com] and 
[~ebadger] to process this information.

> Add support for Docker image clean up
> -
>
> Key: YARN-5670
> URL: https://issues.apache.org/jira/browse/YARN-5670
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Zhankun Tang
>Priority: Major
>  Labels: Docker
> Attachments: Localization Support For Docker Images_002.pdf
>
>
> Regarding to Docker image localization, we also need a way to clean up the 
> old/stale Docker image to save storage space. We may extend deletion service 
> to utilize "docker rm" to do this.
> This is related to YARN-3854 and may depend on its implementation. Please 
> refer to YARN-3854 for Docker image localization details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9254) Externalize Solr data storage

2019-04-03 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809154#comment-16809154
 ] 

Hadoop QA commented on YARN-9254:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
15s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
 0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
21s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 19s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
17s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} shellcheck {color} | {color:red}  0m  
0s{color} | {color:red} The patch generated 1 new + 0 unchanged - 0 fixed = 1 
total (was 0) {color} |
| {color:green}+1{color} | {color:green} shelldocs {color} | {color:green}  0m 
19s{color} | {color:green} The patch generated 0 new + 104 unchanged - 132 
fixed = 104 total (was 236) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 10s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
19s{color} | {color:green} hadoop-yarn-applications-catalog-docker in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
31s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 43m 22s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-9254 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12964761/YARN-9254.001.patch |
| Optional Tests |  dupname  asflicense  mvnsite  unit  shellcheck  shelldocs  |
| uname | Linux b5df5453a28a 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 
10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / d797907 |
| maven | version: Apache Maven 3.3.9 |
| shellcheck | v0.4.6 |
| shellcheck | 
https://builds.apache.org/job/PreCommit-YARN-Build/23871/artifact/out/diff-patch-shellcheck.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/23871/testReport/ |
| Max. process+thread count | 413 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-catalog/hadoop-yarn-applications-catalog-docker
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-catalog/hadoop-yarn-applications-catalog-docker
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/23871/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> Externalize Solr data storage
> -
>
> Key: YARN-9254
> URL: https://issues.apache.org/jira/browse/YARN-9254
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9254.001.patch
>
>
> Application catalog contains embedded Solr.  By default, Solr data is stored 
> in temp space of the docker container.  For user who likes to persist Solr 
> data on HDFS, it would be nice to have a way to pass 

[jira] [Updated] (YARN-9254) Externalize Solr data storage

2019-04-03 Thread Eric Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-9254:

Attachment: YARN-9254.002.patch

> Externalize Solr data storage
> -
>
> Key: YARN-9254
> URL: https://issues.apache.org/jira/browse/YARN-9254
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9254.001.patch, YARN-9254.002.patch
>
>
> Application catalog contains embedded Solr.  By default, Solr data is stored 
> in temp space of the docker container.  For user who likes to persist Solr 
> data on HDFS, it would be nice to have a way to pass solr.hdfs.home setting 
> to embedded Solr to externalize Solr data storage.  This also implies passing 
> Kerberos credential settings to Solr JVM in order to access secure HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9254) Externalize Solr data storage

2019-04-03 Thread Eric Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang reassigned YARN-9254:
---

Assignee: Eric Yang

> Externalize Solr data storage
> -
>
> Key: YARN-9254
> URL: https://issues.apache.org/jira/browse/YARN-9254
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9254.001.patch
>
>
> Application catalog contains embedded Solr.  By default, Solr data is stored 
> in temp space of the docker container.  For user who likes to persist Solr 
> data on HDFS, it would be nice to have a way to pass solr.hdfs.home setting 
> to embedded Solr to externalize Solr data storage.  This also implies passing 
> Kerberos credential settings to Solr JVM in order to access secure HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9254) Externalize Solr data storage

2019-04-03 Thread Eric Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-9254:

Attachment: YARN-9254.001.patch

> Externalize Solr data storage
> -
>
> Key: YARN-9254
> URL: https://issues.apache.org/jira/browse/YARN-9254
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Priority: Major
> Attachments: YARN-9254.001.patch
>
>
> Application catalog contains embedded Solr.  By default, Solr data is stored 
> in temp space of the docker container.  For user who likes to persist Solr 
> data on HDFS, it would be nice to have a way to pass solr.hdfs.home setting 
> to embedded Solr to externalize Solr data storage.  This also implies passing 
> Kerberos credential settings to Solr JVM in order to access secure HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9080) Bucket Directories as part of ATS done accumulates

2019-04-03 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808801#comment-16808801
 ] 

Prabhu Joseph commented on YARN-9080:
-

Thanks [~snemeth] and [~pbacsko] for the detailed explanation. Working on it, 
will update you.

> Bucket Directories as part of ATS done accumulates
> --
>
> Key: YARN-9080
> URL: https://issues.apache.org/jira/browse/YARN-9080
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: 0001-YARN-9080.patch, 0002-YARN-9080.patch, 
> 0003-YARN-9080.patch, YARN-9080-004.patch, YARN-9080-005.patch, 
> YARN-9080-006.patch
>
>
> Have observed older bucket directories cluster_timestamp, bucket1 and bucket2 
> as part of ATS done accumulates. The cleanLogs part of EntityLogCleaner 
> removes only the app directories and not the bucket directories.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9421) Implement SafeMode for ResourceManager by defining a resource threshold

2019-04-03 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808945#comment-16808945
 ] 

Eric Yang commented on YARN-9421:
-

[~snemeth] SLA are predefined time window that a program is allowed to run.  If 
resources gone away, and cause the job to queued up without running.  (admin 
setup cron job to automatically restart YARN when system is down.) Application 
may miss their opportunity to execute for remaining in safe mode for extended 
period of time.  

The proposal is optional feature and default to be disabled.  Hence, my concern 
is addressed.  Thank you

> Implement SafeMode for ResourceManager by defining a resource threshold
> ---
>
> Key: YARN-9421
> URL: https://issues.apache.org/jira/browse/YARN-9421
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Szilard Nemeth
>Priority: Major
> Attachments: client-log.log, nodemanager.log, resourcemanager.log
>
>
> We have a hypothetical testcase in our test suite that tests Resource Types.
>  The test does the following: 
>  1. Sets up a resource named "gpu"
>  2. Out of 9 NodeManager nodes, 1 node has 100 of "gpu".
>  3. It executes a sleep job with resoure requests: 
>  "-Dmapreduce.reduce.resource.gpu=7" and 
> "-Dyarn.app.mapreduce.am.resource.gpu=11"
> Sometimes, we encounter situations when the app submission fails with: 
> {code:java}
> 2019-02-25 06:09:56,795 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: RM app submission 
> failed in validating AM resource request for application 
> application_1551103768202_0001
>  org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request! Cannot allocate containers as requested resource is greater 
> than maximum allowed allocation. Requested resource type=[gpu], Requested 
> resource=, maximum allowed 
> allocation=, please note that maximum allowed 
> allocation is calculated by scheduler based on maximum resource of registered 
> NodeManagers, which might be less than configured maximum 
> allocation={code}
> It's clearly visible that the maximum allowed allocation does not have any 
> "gpu" resources.
>  
> Looking into the logs further, I realized that sometimes the node having the 
> "gpu" resources are registered after the app is submitted.
>  In a real world situation and even with this very special test exexution, we 
> can't be sure which order NMs are registering with RM.
>  With the advent of resource types, this issue was more likely surface.
> If we have a cluster with some "rare" resources like GPUs only on some nodes 
> out of a 100, we can quickly run into a situation when the NMs with GPUs are 
> registering later than the normal nodes. While the critical NMs are still 
> registering, we will most likely experience the same 
> InvalidResourceRequestException if we submit jobs requesting GPUs.
> There is a naive solution to this: 
>  1. Give some time for RM to wait for NMs to be able to register themselves 
> and put submitted applications on hold. This could work in some situations 
> but it's not the most flexible solution as different clusters can have 
> different requirements. Of course, we can make this more flexible by making 
> the timeout value configurable.
> *A more flexible alternative would be:*
>  2. We define a threshold of Resource capability: While we haven't reached 
> this threshold, we put submitted jobs on hold. Once we reached the threshold, 
> we enable jobs to pass through. 
>  This is very similar to an already existing concept, the SafeMode in HDFS 
> ([https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Safemode]).
>  Back to my GPU example above, the threshold could be: 8 vcores, 16GB, 3 
> GPUs. 
>  Defining a threshold like this, we can ensure most of the submitted jobs 
> won't be lost, just "parked" until NMs are registered.
> The final solution could be the Resource threshold, or the combination of the 
> threshold and timeout value. I'm open for any other suggestion as well.
> *Last but not least, a very easy way to reproduce the issue on a 3 node 
> cluster:* 
>  1. Configure a resource type, named 'testres'.
>  2. Node1 runs RM, Node 2/3 runs NMs
>  3. Node2 has 1 testres
>  4. Node3 has 0 testres
>  5. Stop all nodes
>  6. Start RM on Node1
>  7. Start NM on Node3 (the one without the resource)
>  8. Start a pi job, request 1 testres for the AM
> Here's the command to start the job:
> {code:java}
> MY_HADOOP_VERSION=3.3.0-SNAPSHOT;pushd /opt/hadoop;bin/yarn jar 
> "./share/hadoop/mapreduce/hadoop-mapreduce-examples-$MY_HADOOP_VERSION.jar" 
> pi -Dyarn.app.mapreduce.am.resource.testres=1 1 1000;popd{code}
>  
> *Configurations*: 
>  node1: yarn-site.xml of ResourceManager:
> {code:java}
> 
>  

[jira] [Created] (YARN-9436) Flaky test testApplicationLifetimeMonitor

2019-04-03 Thread Peter Bacsko (JIRA)
Peter Bacsko created YARN-9436:
--

 Summary: Flaky test testApplicationLifetimeMonitor
 Key: YARN-9436
 URL: https://issues.apache.org/jira/browse/YARN-9436
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler, test
Reporter: Peter Bacsko
Assignee: Peter Bacsko


In our test environment, we occasionally encounter this failure:
{noformat}
2019-04-03 12:49:32 [INFO] Running 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor
2019-04-03 12:53:08 [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, 
Time elapsed: 215.535 s <<< FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor
2019-04-03 12:53:08 [ERROR] 
testApplicationLifetimeMonitor[0](org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor)
  Time elapsed: 34.244 s  <<< FAILURE!
2019-04-03 12:53:08 java.lang.AssertionError: Application killed before 
lifetime value
2019-04-03 12:53:08 at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor.testApplicationLifetimeMonitor(TestApplicationLifetimeMonitor.java:218)
2019-04-03 12:53:08 
{noformat}
The root cause is the condition here:
{noformat}
Assert.assertTrue("Application killed before lifetime value",
totalTimeRun > maxLifetime);
{noformat}
However, there are two problems with this condition:
 1. Logically it's not correct. In fact, since the app should be killed after 
30 seconds, one would expect to see {{totalTimeRun = maxLifetime}}. Due to some 
asynchronicity and rounding, most of the time {{totalTimeRun}} ends up being 31.

2. Sometimes the application is killed fast enough and {{totalTimeRun}} is 30, 
but this is correct, because in {{setUpCSQueue}} we set the queue lifetime:
{noformat}
csConf.setMaximumLifetimePerQueue(
CapacitySchedulerConfiguration.ROOT + ".default", maxLifetime);
csConf.setDefaultLifetimePerQueue(
CapacitySchedulerConfiguration.ROOT + ".default", defaultLifetime);
{noformat}
A more proper condition is:
{noformat}
Assert.assertTrue("Application killed before lifetime value",
totalTimeRun >= maxLifetime);
{noformat}
The assertion message in the next line is also misleading:
{noformat}
Assert.assertTrue(
"Application killed before lifetime value " + totalTimeRun,
totalTimeRun < maxLifetime + 10L);
{noformat}
If it false, it means that the application is killed _after_ 40 seconds, which 
exceeds both the app's lifetime (40s) and that of the queue (30s).
{noformat}
Assert.assertTrue(
"Application killed after queue/app lifetime value: " + 
totalTimeRun,
totalTimeRun < maxLifetime + 10L);
{noformat}
We can be even be stricter, since we expect a kill almost immediately after 30 
seconds:
{noformat}
Assert.assertTrue(
"Application killed too late: " + totalTimeRun,
totalTimeRun < maxLifetime + 2L);
{noformat}
where we allow a 2 second tolerance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9436) Flaky test testApplicationLifetimeMonitor

2019-04-03 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808803#comment-16808803
 ] 

Prabhu Joseph commented on YARN-9436:
-

[~pbacsko] I think this issue will be fixed by YARN-9404. Can you validate the 
same.

> Flaky test testApplicationLifetimeMonitor
> -
>
> Key: YARN-9436
> URL: https://issues.apache.org/jira/browse/YARN-9436
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> In our test environment, we occasionally encounter this failure:
> {noformat}
> 2019-04-03 12:49:32 [INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor
> 2019-04-03 12:53:08 [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, 
> Time elapsed: 215.535 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor
> 2019-04-03 12:53:08 [ERROR] 
> testApplicationLifetimeMonitor[0](org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor)
>   Time elapsed: 34.244 s  <<< FAILURE!
> 2019-04-03 12:53:08 java.lang.AssertionError: Application killed before 
> lifetime value
> 2019-04-03 12:53:08   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor.testApplicationLifetimeMonitor(TestApplicationLifetimeMonitor.java:218)
> 2019-04-03 12:53:08 
> {noformat}
> The root cause is the condition here:
> {noformat}
> Assert.assertTrue("Application killed before lifetime value",
> totalTimeRun > maxLifetime);
> {noformat}
> However, there are two problems with this condition:
>  1. Logically it's not correct. In fact, since the app should be killed after 
> 30 seconds, one would expect to see {{totalTimeRun = maxLifetime}}. Due to 
> some asynchronicity and rounding, most of the time {{totalTimeRun}} ends up 
> being 31.
> 2. Sometimes the application is killed fast enough and {{totalTimeRun}} is 
> 30, but this is correct, because in {{setUpCSQueue}} we set the queue 
> lifetime:
> {noformat}
> csConf.setMaximumLifetimePerQueue(
> CapacitySchedulerConfiguration.ROOT + ".default", maxLifetime);
> csConf.setDefaultLifetimePerQueue(
> CapacitySchedulerConfiguration.ROOT + ".default", defaultLifetime);
> {noformat}
> A more proper condition is:
> {noformat}
> Assert.assertTrue("Application killed before lifetime value",
> totalTimeRun >= maxLifetime);
> {noformat}
> The assertion message in the next line is also misleading:
> {noformat}
> Assert.assertTrue(
> "Application killed before lifetime value " + totalTimeRun,
> totalTimeRun < maxLifetime + 10L);
> {noformat}
> If it false, it means that the application is killed _after_ 40 seconds, 
> which exceeds both the app's lifetime (40s) and that of the queue (30s).
> {noformat}
> Assert.assertTrue(
> "Application killed after queue/app lifetime value: " + 
> totalTimeRun,
> totalTimeRun < maxLifetime + 10L);
> {noformat}
> We can be even be stricter, since we expect a kill almost immediately after 
> 30 seconds:
> {noformat}
> Assert.assertTrue(
> "Application killed too late: " + totalTimeRun,
> totalTimeRun < maxLifetime + 2L);
> {noformat}
> where we allow a 2 second tolerance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9436) Flaky test testApplicationLifetimeMonitor

2019-04-03 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YARN-9436.

Resolution: Duplicate

> Flaky test testApplicationLifetimeMonitor
> -
>
> Key: YARN-9436
> URL: https://issues.apache.org/jira/browse/YARN-9436
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> In our test environment, we occasionally encounter this failure:
> {noformat}
> 2019-04-03 12:49:32 [INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor
> 2019-04-03 12:53:08 [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, 
> Time elapsed: 215.535 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor
> 2019-04-03 12:53:08 [ERROR] 
> testApplicationLifetimeMonitor[0](org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor)
>   Time elapsed: 34.244 s  <<< FAILURE!
> 2019-04-03 12:53:08 java.lang.AssertionError: Application killed before 
> lifetime value
> 2019-04-03 12:53:08   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor.testApplicationLifetimeMonitor(TestApplicationLifetimeMonitor.java:218)
> 2019-04-03 12:53:08 
> {noformat}
> The root cause is the condition here:
> {noformat}
> Assert.assertTrue("Application killed before lifetime value",
> totalTimeRun > maxLifetime);
> {noformat}
> However, there are two problems with this condition:
>  1. Logically it's not correct. In fact, since the app should be killed after 
> 30 seconds, one would expect to see {{totalTimeRun = maxLifetime}}. Due to 
> some asynchronicity and rounding, most of the time {{totalTimeRun}} ends up 
> being 31.
> 2. Sometimes the application is killed fast enough and {{totalTimeRun}} is 
> 30, but this is correct, because in {{setUpCSQueue}} we set the queue 
> lifetime:
> {noformat}
> csConf.setMaximumLifetimePerQueue(
> CapacitySchedulerConfiguration.ROOT + ".default", maxLifetime);
> csConf.setDefaultLifetimePerQueue(
> CapacitySchedulerConfiguration.ROOT + ".default", defaultLifetime);
> {noformat}
> A more proper condition is:
> {noformat}
> Assert.assertTrue("Application killed before lifetime value",
> totalTimeRun >= maxLifetime);
> {noformat}
> The assertion message in the next line is also misleading:
> {noformat}
> Assert.assertTrue(
> "Application killed before lifetime value " + totalTimeRun,
> totalTimeRun < maxLifetime + 10L);
> {noformat}
> If it false, it means that the application is killed _after_ 40 seconds, 
> which exceeds both the app's lifetime (40s) and that of the queue (30s).
> {noformat}
> Assert.assertTrue(
> "Application killed after queue/app lifetime value: " + 
> totalTimeRun,
> totalTimeRun < maxLifetime + 10L);
> {noformat}
> We can be even be stricter, since we expect a kill almost immediately after 
> 30 seconds:
> {noformat}
> Assert.assertTrue(
> "Application killed too late: " + totalTimeRun,
> totalTimeRun < maxLifetime + 2L);
> {noformat}
> where we allow a 2 second tolerance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9435) Add Opportunistic Scheduler metrics in ResourceManager.

2019-04-03 Thread Abhishek Modi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Modi updated YARN-9435:

Attachment: YARN-9435.003.patch

> Add Opportunistic Scheduler metrics in ResourceManager.
> ---
>
> Key: YARN-9435
> URL: https://issues.apache.org/jira/browse/YARN-9435
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9435.001.patch, YARN-9435.002.patch, 
> YARN-9435.003.patch
>
>
> Right now there are no metrics available for Opportunistic Scheduler at 
> ResourceManager. As part of this jira, we will add metrics like number of 
> allocated opportunistic containers, released opportunistic containers, node 
> level allocations, rack level allocations etc. for Opportunistic Scheduler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9435) Add Opportunistic Scheduler metrics in ResourceManager.

2019-04-03 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808951#comment-16808951
 ] 

Hadoop QA commented on YARN-9435:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
23s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
32s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
10s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 6s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
36s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 13s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
20s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
56s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
12s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 28s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
58s{color} | {color:green} hadoop-yarn-server-common in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 83m 
30s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
30s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}154m 21s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-9435 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12964709/YARN-9435.003.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux a37adc2c1817 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 
5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 002dcc4 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/23870/testReport/ |
| Max. process+thread count | 915 (vs. 

[jira] [Commented] (YARN-9303) Username splits won't help timelineservice.app_flow table

2019-04-03 Thread Vrushali C (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809392#comment-16809392
 ] 

Vrushali C commented on YARN-9303:
--

+1 to patch v1. I am reviewing the other patch. But this one is correct, will 
commit shortly 

> Username splits won't help timelineservice.app_flow table
> -
>
> Key: YARN-9303
> URL: https://issues.apache.org/jira/browse/YARN-9303
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Affects Versions: 3.1.2
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: Only_Last_Region_Used.png, YARN-9303-001.patch
>
>
> timelineservice.app_flow hbase table uses pre split logic based on username 
> whereas the rowkeys starts with inverted timestamp (Long.MAX_VALUE - ts). All 
> data will go to the last region and remaining regions will never be inserted. 
> Need to choose right split or use auto-split.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9303) Username splits won't help timelineservice.app_flow table

2019-04-03 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809497#comment-16809497
 ] 

Prabhu Joseph commented on YARN-9303:
-

Thanks [~vrushalic] for reviewing.

> Username splits won't help timelineservice.app_flow table
> -
>
> Key: YARN-9303
> URL: https://issues.apache.org/jira/browse/YARN-9303
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Affects Versions: 3.1.2
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: Only_Last_Region_Used.png, YARN-9303-001.patch
>
>
> timelineservice.app_flow hbase table uses pre split logic based on username 
> whereas the rowkeys starts with inverted timestamp (Long.MAX_VALUE - ts). All 
> data will go to the last region and remaining regions will never be inserted. 
> Need to choose right split or use auto-split.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9254) Externalize Solr data storage

2019-04-03 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809341#comment-16809341
 ] 

Hadoop QA commented on YARN-9254:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
17s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
32s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
10s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
50s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 19s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
19s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} shellcheck {color} | {color:red}  0m  
0s{color} | {color:red} The patch generated 1 new + 0 unchanged - 0 fixed = 1 
total (was 0) {color} |
| {color:green}+1{color} | {color:green} shelldocs {color} | {color:green}  0m 
18s{color} | {color:green} The patch generated 0 new + 104 unchanged - 132 
fixed = 104 total (was 236) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 15s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
18s{color} | {color:green} hadoop-yarn-applications-catalog-docker in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
19s{color} | {color:green} hadoop-yarn-site in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
30s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 46m 50s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-9254 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12964779/YARN-9254.002.patch |
| Optional Tests |  dupname  asflicense  mvnsite  unit  shellcheck  shelldocs  |
| uname | Linux df6fe5d932c3 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 
10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 8ff41d6 |
| maven | version: Apache Maven 3.3.9 |
| shellcheck | v0.4.6 |
| shellcheck | 
https://builds.apache.org/job/PreCommit-YARN-Build/23872/artifact/out/diff-patch-shellcheck.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/23872/testReport/ |
| Max. process+thread count | 447 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-catalog/hadoop-yarn-applications-catalog-docker
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site U: 
hadoop-yarn-project/hadoop-yarn |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/23872/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> Externalize Solr data storage
> -
>
> Key: YARN-9254
> URL: https://issues.apache.org/jira/browse/YARN-9254
> Project: Hadoop YARN
>

[jira] [Assigned] (YARN-8466) Add Chaos Monkey unit test framework for feature validation in scale

2019-04-03 Thread Yesha Vora (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yesha Vora reassigned YARN-8466:


Assignee: Yesha Vora

> Add Chaos Monkey unit test framework for feature validation in scale
> 
>
> Key: YARN-8466
> URL: https://issues.apache.org/jira/browse/YARN-8466
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Yesha Vora
>Priority: Critical
> Attachments: YARN-8466.poc.001.patch
>
>
> Currently we don't have such framework for testing. 
> We need a framework to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9335) [atsv2] Restrict the number of elements held in timeline collector when backend is unreachable for async calls

2019-04-03 Thread Vrushali C (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809385#comment-16809385
 ] 

Vrushali C commented on YARN-9335:
--

Thanks for the patch v3 Abhishek, lgtm. Will commit shortly 

> [atsv2] Restrict the number of elements held in timeline collector when 
> backend is unreachable for async calls
> --
>
> Key: YARN-9335
> URL: https://issues.apache.org/jira/browse/YARN-9335
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vrushali C
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9335.001.patch, YARN-9335.002.patch, 
> YARN-9335.003.patch
>
>
> For ATSv2 , if the backend is unreachable, the number/size of data held in 
> timeline collector's memory increases significantly. This is not good for the 
> NM memory. 
> Filing jira to set a limit on how many/much should be retained by the 
> timeline collector in memory in case the backend is not reachable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9438) launchTime not written to state store for running applications

2019-04-03 Thread Jonathan Hung (JIRA)
Jonathan Hung created YARN-9438:
---

 Summary: launchTime not written to state store for running 
applications
 Key: YARN-9438
 URL: https://issues.apache.org/jira/browse/YARN-9438
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jonathan Hung
Assignee: Jonathan Hung


launchTime is only saved to state store after application finishes, so if 
restart happens, any running applications will have launchTime set as -1 (since 
this is the default timestamp of the recovery event).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9394) Use new API of RackResolver to get better performance

2019-04-03 Thread Lantao Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809476#comment-16809476
 ] 

Lantao Jin commented on YARN-9394:
--

Attache  [^YARN-9394.003.patch] to address checkstyle.

> Use new API of RackResolver to get better performance
> -
>
> Key: YARN-9394
> URL: https://issues.apache.org/jira/browse/YARN-9394
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.0, 3.2.1
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
> Attachments: YARN-9394.001.patch, YARN-9394.002.patch, 
> YARN-9394.003.patch
>
>
> After adding a new API in RackResolver YARN-9332, some old callers should 
> switch to new API to get better performance. As an example, Spark 
> [YarnAllocator|https://github.com/apache/spark/blob/733f2c0b98208815f8408e36ab669d7c07e3767f/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L361-L363]
>  for Dynamic Allocation invokes 
> [https://github.com/apache/hadoop/blob/6fa229891e06eea62cb9634efde755f40247e816/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/AMRMClientImpl.java#L550]
>  to resolve racks in a loop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9394) Use new API of RackResolver to get better performance

2019-04-03 Thread Lantao Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated YARN-9394:
-
Attachment: YARN-9394.003.patch

> Use new API of RackResolver to get better performance
> -
>
> Key: YARN-9394
> URL: https://issues.apache.org/jira/browse/YARN-9394
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.0, 3.2.1
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
> Attachments: YARN-9394.001.patch, YARN-9394.002.patch, 
> YARN-9394.003.patch
>
>
> After adding a new API in RackResolver YARN-9332, some old callers should 
> switch to new API to get better performance. As an example, Spark 
> [YarnAllocator|https://github.com/apache/spark/blob/733f2c0b98208815f8408e36ab669d7c07e3767f/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L361-L363]
>  for Dynamic Allocation invokes 
> [https://github.com/apache/hadoop/blob/6fa229891e06eea62cb9634efde755f40247e816/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/AMRMClientImpl.java#L550]
>  to resolve racks in a loop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9382) Publish container killed, paused and resumed events to ATSv2.

2019-04-03 Thread Vrushali C (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809387#comment-16809387
 ] 

Vrushali C commented on YARN-9382:
--

thanks Abhishek, patch v2 looks good. Will commit it shortly

> Publish container killed, paused and resumed events to ATSv2.
> -
>
> Key: YARN-9382
> URL: https://issues.apache.org/jira/browse/YARN-9382
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9382.001.patch, YARN-9382.002.patch
>
>
> There are some events missing in container lifecycle. We need to add support 
> for adding events for when container gets killed, paused and resumed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9373) HBaseTimelineSchemaCreator has to allow user to configure pre-splits

2019-04-03 Thread Vrushali C (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809390#comment-16809390
 ] 

Vrushali C commented on YARN-9373:
--

Thanks Prabhu, overall patch v2 looks good. I want to look at it a bit more in 
detail today. Will update the jira with comments, else will commit it. 

> HBaseTimelineSchemaCreator has to allow user to configure pre-splits
> 
>
> Key: YARN-9373
> URL: https://issues.apache.org/jira/browse/YARN-9373
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Affects Versions: 3.2.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: Configurable_PreSplits.png, YARN-9373-001.patch, 
> YARN-9373-002.patch
>
>
> Most of the TimelineService HBase tables is set with username splits which is 
> based on lowercase alphabet (a,ad,an,b,ca). This won't help if the rowkey 
> starts with either number or uppercase alphabet. We need to allow user to 
> configure based upon their data. For example, say a user has configured the 
> yarn.resourcemanager.cluster-id to be ATS or 123, then the splits can be 
> configured as A,B,C,,, or 100,200,300,,,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9437) RMNodeImpls occupy too much memory and causes RM GC to take a long time

2019-04-03 Thread qiuliang (JIRA)
qiuliang created YARN-9437:
--

 Summary: RMNodeImpls occupy too much memory and causes RM GC to 
take a long time
 Key: YARN-9437
 URL: https://issues.apache.org/jira/browse/YARN-9437
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.9.1
Reporter: qiuliang
 Attachments: 1.png, 2.png, 3.png

We use hadoop-2.9.1 in our production environment with 1600+ nodes. 95.63% of 
RM memory is occupied by RMNodeImpl. Analysis of RM memory found that each 
RMNodeImpl has approximately 14M. The reason is that there is a 13W+ 
completedcontainers in each RMNodeImpl that has not been released.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9303) Username splits won't help timelineservice.app_flow table

2019-04-03 Thread Vrushali C (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vrushali C updated YARN-9303:
-
Labels: atsv2 atsv2-hbase  (was: atsv2)

> Username splits won't help timelineservice.app_flow table
> -
>
> Key: YARN-9303
> URL: https://issues.apache.org/jira/browse/YARN-9303
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Affects Versions: 3.1.2
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: atsv2, atsv2-hbase
> Attachments: Only_Last_Region_Used.png, YARN-9303-001.patch
>
>
> timelineservice.app_flow hbase table uses pre split logic based on username 
> whereas the rowkeys starts with inverted timestamp (Long.MAX_VALUE - ts). All 
> data will go to the last region and remaining regions will never be inserted. 
> Need to choose right split or use auto-split.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9303) Username splits won't help timelineservice.app_flow table

2019-04-03 Thread Vrushali C (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vrushali C updated YARN-9303:
-
Labels: atsv2  (was: )

> Username splits won't help timelineservice.app_flow table
> -
>
> Key: YARN-9303
> URL: https://issues.apache.org/jira/browse/YARN-9303
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Affects Versions: 3.1.2
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: atsv2
> Attachments: Only_Last_Region_Used.png, YARN-9303-001.patch
>
>
> timelineservice.app_flow hbase table uses pre split logic based on username 
> whereas the rowkeys starts with inverted timestamp (Long.MAX_VALUE - ts). All 
> data will go to the last region and remaining regions will never be inserted. 
> Need to choose right split or use auto-split.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9382) Publish container killed, paused and resumed events to ATSv2.

2019-04-03 Thread Abhishek Modi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809536#comment-16809536
 ] 

Abhishek Modi commented on YARN-9382:
-

Thanks Vrushali - let me check at my end.

> Publish container killed, paused and resumed events to ATSv2.
> -
>
> Key: YARN-9382
> URL: https://issues.apache.org/jira/browse/YARN-9382
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9382.001.patch, YARN-9382.002.patch
>
>
> There are some events missing in container lifecycle. We need to add support 
> for adding events for when container gets killed, paused and resumed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9382) Publish container killed, paused and resumed events to ATSv2.

2019-04-03 Thread Vrushali C (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809527#comment-16809527
 ] 

Vrushali C commented on YARN-9382:
--

Hi Abhishek, 
I am somehow not able to apply the patch (with p0 or p1). Can you check?

{code}
[tw-mbp13-channapattan hadoop (trunk)]$ git apply -p0 -v 
~/Downloads/YARN-9382.002.patch
Checking patch 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/server/metrics/ContainerMetricsConstants.java
 => 
b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/server/metrics/ContainerMetricsConstants.java...
error: 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/server/metrics/ContainerMetricsConstants.java:
 No such file or directory
Checking patch 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/NMTimelinePublisher.java
 => 
b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/NMTimelinePublisher.java...
error: 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/NMTimelinePublisher.java:
 No such file or directory
Checking patch 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/TestNMTimelinePublisher.java
 => 
b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/TestNMTimelinePublisher.java...
error: 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/TestNMTimelinePublisher.java:
 No such file or directory
[tw-mbp13-channapattan hadoop (trunk)]$

{code}

{code}

[tw-mbp13-channapattan hadoop (trunk)]$ git apply -p1 -v 
~/Downloads/YARN-9382.002.patch
Checking patch 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/server/metrics/ContainerMetricsConstants.java...
Checking patch 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/NMTimelinePublisher.java...
Hunk #2 succeeded at 255 (offset -7 lines).
error: while searching for:
case INIT_CONTAINER:
  publishContainerCreatedEvent(event);
  break;

default:
  if (LOG.isDebugEnabled()) {
LOG.debug(event.getType()

error: patch failed: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/NMTimelinePublisher.java:402
error: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/NMTimelinePublisher.java:
 patch does not apply
Checking patch 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/TestNMTimelinePublisher.java...
error: while searching for:
import org.apache.hadoop.yarn.server.nodemanager.Context;
import 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationContainerFinishedEvent;
import 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container;
import org.apache.hadoop.yarn.util.ResourceCalculatorProcessTree;
import org.junit.Assert;
import org.junit.Test;

error: patch failed: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/TestNMTimelinePublisher.java:45
error: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/TestNMTimelinePublisher.java:
 patch does not apply
[tw-mbp13-channapattan hadoop (trunk)]$

{code}


> Publish container killed, paused and resumed events to ATSv2.
> -
>
> Key: YARN-9382
> URL: https://issues.apache.org/jira/browse/YARN-9382
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9382.001.patch, YARN-9382.002.patch
>
>
> There are some events missing in container lifecycle. We need to add support 
> for adding events for when container gets killed, paused and resumed. 



--
This message was sent by Atlassian JIRA

[jira] [Updated] (YARN-9303) Username splits won't help timelineservice.app_flow table

2019-04-03 Thread Vrushali C (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vrushali C updated YARN-9303:
-
Fix Version/s: 3.3.0

> Username splits won't help timelineservice.app_flow table
> -
>
> Key: YARN-9303
> URL: https://issues.apache.org/jira/browse/YARN-9303
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Affects Versions: 3.1.2
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: atsv2, atsv2-hbase
> Fix For: 3.3.0
>
> Attachments: Only_Last_Region_Used.png, YARN-9303-001.patch
>
>
> timelineservice.app_flow hbase table uses pre split logic based on username 
> whereas the rowkeys starts with inverted timestamp (Long.MAX_VALUE - ts). All 
> data will go to the last region and remaining regions will never be inserted. 
> Need to choose right split or use auto-split.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9408) @Path("/apps/{appid}/appattempts") error message misleads

2019-04-03 Thread Vrushali C (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809525#comment-16809525
 ] 

Vrushali C commented on YARN-9408:
--

Hmm, so I am trying understand this error. Looks like it may be thrown at this 
line
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice-hbase/hadoop-yarn-server-timelineservice-hbase-client/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/reader/AbstractTimelineStorageReader.java#L85

It's because there result set was empty/null. 

Looking at the code, it is trying to look up the flow context for this app id 
and it does not find anything. I am wondering if catching all 
NotFoundExceptions is a good idea. Perhaps we can add to the exception message 
and enhance it rather than printing out a completely new message. 



> @Path("/apps/{appid}/appattempts") error message misleads
> -
>
> Key: YARN-9408
> URL: https://issues.apache.org/jira/browse/YARN-9408
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Affects Versions: 3.2.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Minor
> Attachments: YARN-9408-001.patch, YARN-9408-002.patch
>
>
> {code} @Path("/apps/{appid}/appattempts") {code} error message is misleading. 
> NotFoundException "Unable to find the context flow name, and flow run id, and 
> user id" is displayed while app attempts is looked.
> {code}
> [hbase@yarn-ats-3 ~]$ curl -s 
> "http://yarn-ats-3:8198/ws/v2/timeline/apps/application_1553258815132_0004/appattempts?user.name=hbase;
>  | jq .
> {
>   "exception": "NotFoundException",
>   "message": "java.lang.Exception: Unable to find the context flow name, and 
> flow run id, and user id for clusterId=ats, 
> appId=application_1553258815132_0004",
>   "javaClassName": "org.apache.hadoop.yarn.webapp.NotFoundException"
> }
> [hbase@yarn-ats-3 ~]$ curl -s 
> "http://yarn-ats-3:8198/ws/v2/timeline/clusters/ats/apps/application_1553258815132_0005/appattempts?user.name=hbase;
>  | jq .
> {
>   "exception": "NotFoundException",
>   "message": "java.lang.Exception: Unable to find the context flow name, and 
> flow run id, and user id for clusterId=ats, 
> appId=application_1553258815132_0005",
>   "javaClassName": "org.apache.hadoop.yarn.webapp.NotFoundException"
> }
> [hbase@yarn-ats-3 ~]$ curl -s 
> "http://yarn-ats-3:8198/ws/v2/timeline/clusters/ats1/apps/application_1553258815132_0001/containers/container_e14_1553258815132_0001_01_01?user.name=hbase;
>  | jq .
> {
>   "exception": "NotFoundException",
>   "message": "java.lang.Exception: Unable to find the context flow name, and 
> flow run id, and user id for clusterId=ats1, 
> appId=application_1553258815132_0001",
>   "javaClassName": "org.apache.hadoop.yarn.webapp.NotFoundException"
> }
> [hbase@yarn-ats-3 ~]$ curl -s 
> "http://yarn-ats-3:8198/ws/v2/timeline/clusters/ats1/apps/application_1553258815132_0001/appattempts/appattempt_1553258815132_0001_01/containers?user.name=hbase;
>  | jq .
> {
>   "exception": "NotFoundException",
>   "message": "java.lang.Exception: Unable to find the context flow name, and 
> flow run id, and user id for clusterId=ats1, 
> appId=application_1553258815132_0001",
>   "javaClassName": "org.apache.hadoop.yarn.webapp.NotFoundException"
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9403) GET /apps/{appid}/entities/YARN_APPLICATION accesses application table instead of entity table

2019-04-03 Thread Vrushali C (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809526#comment-16809526
 ] 

Vrushali C commented on YARN-9403:
--

I am not sure I understand the issue correctly. For YARN_APPLICATIOn entities, 
they are being written to the application table, no? If so, why do we need to 
go to the entities table? If there any information missing in the response that 
was expected. 

> GET /apps/{appid}/entities/YARN_APPLICATION accesses application table 
> instead of entity table
> --
>
> Key: YARN-9403
> URL: https://issues.apache.org/jira/browse/YARN-9403
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Affects Versions: 3.2.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-9403-001.patch, YARN-9403-002.patch, 
> YARN-9403-003.patch, YARN-9403-004.patch
>
>
> {noformat}"GET /apps/{appid}/entities/YARN_APPLICATION"{noformat} accesses 
> application table instead of entity table. As per the doc, With this API, you 
> can query generic entities identified by cluster ID, application ID and 
> per-framework entity type. But it also provides all the apps when entityType 
> is set to YARN_APPLICATION. It should only access Entity Table through 
> {{GenericEntityReader}}.
> Wrong Output: With YARN_APPLICATION entityType, all applications listed from 
> application tables.
> {code}
> [hbase@yarn-ats-3 centos]$ curl -s 
> "http://yarn-ats-3:8198/ws/v2/timeline/apps/application_1553258815132_0002/entities/YARN_APPLICATION?user.name=hbase=hbase=word%20count;
>  | jq .
> [
>   {
> "metrics": [],
> "events": [],
> "createdtime": 1553258922721,
> "idprefix": 0,
> "isrelatedto": {},
> "relatesto": {},
> "info": {
>   "UID": "ats!application_1553258815132_0002",
>   "FROM_ID": "ats!hbase!word 
> count!1553258922721!application_1553258815132_0002"
> },
> "configs": {},
> "type": "YARN_APPLICATION",
> "id": "application_1553258815132_0002"
>   },
>   {
> "metrics": [],
> "events": [],
> "createdtime": 1553258825918,
> "idprefix": 0,
> "isrelatedto": {},
> "relatesto": {},
> "info": {
>   "UID": "ats!application_1553258815132_0001",
>   "FROM_ID": "ats!hbase!word 
> count!1553258825918!application_1553258815132_0001"
> },
> "configs": {},
> "type": "YARN_APPLICATION",
> "id": "application_1553258815132_0001"
>   }
> ]
> {code}
> Right Output: With correct entity type (MAPREDUCE_JOB) it accesses entity 
> table for given applicationId and entityType.
> {code}
> [hbase@yarn-ats-3 centos]$ curl -s 
> "http://yarn-ats-3:8198/ws/v2/timeline/apps/application_1553258815132_0002/entities/MAPREDUCE_JOB?user.name=hbase=hbase=word%20count;
>  | jq .
> [
>   {
> "metrics": [],
> "events": [],
> "createdtime": 1553258926667,
> "idprefix": 0,
> "isrelatedto": {},
> "relatesto": {},
> "info": {
>   "UID": 
> "ats!application_1553258815132_0002!MAPREDUCE_JOB!0!job_1553258815132_0002",
>   "FROM_ID": "ats!hbase!word 
> count!1553258922721!application_1553258815132_0002!MAPREDUCE_JOB!0!job_1553258815132_0002"
> },
> "configs": {},
> "type": "MAPREDUCE_JOB",
> "id": "job_1553258815132_0002"
>   }
> ]
> {code}
> Flow Activity and Flow Run tables can also be accessed using similar way.
> {code}
> GET /apps/{appid}/entities/YARN_FLOW_ACTIVITY
> GET /apps/{appid}/entities/YARN_FLOW_RUN
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9335) [atsv2] Restrict the number of elements held in timeline collector when backend is unreachable for async calls

2019-04-03 Thread Abhishek Modi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809539#comment-16809539
 ] 

Abhishek Modi commented on YARN-9335:
-

Thanks [~vrushalic]. I will check at my end. 

Let me also run complete UTs with patch as I am afraid it can cause some other 
UT failures as we have made writes async.

> [atsv2] Restrict the number of elements held in timeline collector when 
> backend is unreachable for async calls
> --
>
> Key: YARN-9335
> URL: https://issues.apache.org/jira/browse/YARN-9335
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vrushali C
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9335.001.patch, YARN-9335.002.patch, 
> YARN-9335.003.patch
>
>
> For ATSv2 , if the backend is unreachable, the number/size of data held in 
> timeline collector's memory increases significantly. This is not good for the 
> NM memory. 
> Filing jira to set a limit on how many/much should be retained by the 
> timeline collector in memory in case the backend is not reachable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3488) AM get timeline service info from RM rather than Application specific configuration.

2019-04-03 Thread Vrushali C (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809538#comment-16809538
 ] 

Vrushali C commented on YARN-3488:
--

Hi Abhishek
Yes I will try to get to this soon. 

> AM get timeline service info from RM rather than Application specific 
> configuration.
> 
>
> Key: YARN-3488
> URL: https://issues.apache.org/jira/browse/YARN-3488
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: applications
>Reporter: Junping Du
>Assignee: Abhishek Modi
>Priority: Major
>  Labels: YARN-5355
> Attachments: YARN-3488.001.patch, YARN-3488.002.patch, 
> YARN-3488.003.patch
>
>
> Since v1 timeline service, we have MR configuration to enable/disable putting 
> history event to timeline service. For today's v2 timeline service ongoing 
> effort, currently we have different methods/structures between v1 and v2 for 
> consuming TimelineClient, so application have to be aware of which version 
> timeline service get used there.
> There are basically two options here:
> First option is as current way in DistributedShell or MR to let application 
> has specific configuration to point out that if enabling ATS and which 
> version could be, like: MRJobConfig.MAPREDUCE_JOB_EMIT_TIMELINE_DATA, etc.
> The other option is to let application to figure out timeline related info 
> from YARN/RM, it can be done through registerApplicationMaster() in 
> ApplicationMasterProtocol with return value for service "off", "v1_on", or 
> "v2_on".
> We prefer the latter option because application owner doesn't have to aware 
> RM/YARN infrastructure details. Please note that we should keep compatible 
> (consistent behavior with the same setting) with released configurations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9335) [atsv2] Restrict the number of elements held in timeline collector when backend is unreachable for async calls

2019-04-03 Thread Vrushali C (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809528#comment-16809528
 ] 

Vrushali C commented on YARN-9335:
--

Hi Abhishek,
Could you check applying this patch as well? It seems to not work for me. Do 
you see anything incorrect in my command below:

{code}
[tw-mbp13-channapattan hadoop (trunk)]$ git apply -p0 
~/Downloads/YARN-9335.003.patch
error: 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java:
 No such file or directory
error: 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml:
 No such file or directory
error: 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/TimelineCollector.java:
 No such file or directory
error: 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/test/java/org/apache/hadoop/yarn/server/timelineservice/collector/TestTimelineCollector.java:
 No such file or directory
[tw-mbp13-channapattan hadoop (trunk)]$ git apply -p1 
~/Downloads/YARN-9335.003.patch
error: patch failed: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/TimelineCollector.java:221
error: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/TimelineCollector.java:
 patch does not apply
[tw-mbp13-channapattan hadoop (trunk)]$
{code}


> [atsv2] Restrict the number of elements held in timeline collector when 
> backend is unreachable for async calls
> --
>
> Key: YARN-9335
> URL: https://issues.apache.org/jira/browse/YARN-9335
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vrushali C
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9335.001.patch, YARN-9335.002.patch, 
> YARN-9335.003.patch
>
>
> For ATSv2 , if the backend is unreachable, the number/size of data held in 
> timeline collector's memory increases significantly. This is not good for the 
> NM memory. 
> Filing jira to set a limit on how many/much should be retained by the 
> timeline collector in memory in case the backend is not reachable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4901) QueueMetrics needs to be cleared before MockRM is initialized

2019-04-03 Thread Sunil Govindan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-4901:
-
Summary: QueueMetrics needs to be cleared before MockRM is initialized  
(was: MockRM should clear the QueueMetrics when it starts)

> QueueMetrics needs to be cleared before MockRM is initialized
> -
>
> Key: YARN-4901
> URL: https://issues.apache.org/jira/browse/YARN-4901
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Reporter: Daniel Templeton
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-4901-001.patch
>
>
> The {{ResourceManager}} rightly assumes that when it starts, it's starting 
> from naught.  The {{MockRM}}, however, violates that assumption.  For 
> example, in {{TestNMReconnect}}, each test method creates a new {{MockRM}} 
> instance.  The {{QueueMetrics.queueMetrics}} field is static, which means 
> that when multiple {{MockRM}} instances are created, the {{QueueMetrics}} 
> bleed over.  Having the MockRM clear the {{QueueMetrics}} when it starts 
> should resolve the issue.  I haven't looked yet at scope to see how hard easy 
> that is to do.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4901) QueueMetrics needs to be cleared before MockRM is initialized

2019-04-03 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808733#comment-16808733
 ] 

Hudson commented on YARN-4901:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #16334 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/16334/])
YARN-4901. QueueMetrics needs to be cleared before MockRM is (sunilg: rev 
002dcc4ebf79bbaa5e603565640d8289991d781f)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java


> QueueMetrics needs to be cleared before MockRM is initialized
> -
>
> Key: YARN-4901
> URL: https://issues.apache.org/jira/browse/YARN-4901
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Reporter: Daniel Templeton
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-4901-001.patch
>
>
> The {{ResourceManager}} rightly assumes that when it starts, it's starting 
> from naught.  The {{MockRM}}, however, violates that assumption.  For 
> example, in {{TestNMReconnect}}, each test method creates a new {{MockRM}} 
> instance.  The {{QueueMetrics.queueMetrics}} field is static, which means 
> that when multiple {{MockRM}} instances are created, the {{QueueMetrics}} 
> bleed over.  Having the MockRM clear the {{QueueMetrics}} when it starts 
> should resolve the issue.  I haven't looked yet at scope to see how hard easy 
> that is to do.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9080) Bucket Directories as part of ATS done accumulates

2019-04-03 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808747#comment-16808747
 ] 

Peter Bacsko commented on YARN-9080:


I'd like to join Szilard in suggesting making the code more readable.

I can imagine sth like:

{code}
RemoteIterator clustertsIter = list(dirpath);
while (clustertsIter.hasNext()) {
   FileStatus clustertsStat = clustertsIter.next();
   MutableBoolean toBeRemoved = new MutableBoolean();
   MutableBoolean isValid = new MutableBoolean();

   if (clustertsStat.isDirectory()) {
processClusterTsDir(clustertsStat, toBeRemoved, isValid);
 }
   ..

private void processClusterTsDir(FileStatus fs, MutableBoolean toBeRemoved, 
MutableBoolean isValid) {
  Path clustertsPath = fs.getPath();
  RemoteIterator bucket1Iter = list(dir);
  
  while (bucket1Iter.hasNext()) {
FileStatus bucket1Stat = bucket1Iter.next();
Path bucket1Path = bucket1Stat.getPath();
if (bucket1Stat.isDirectory() &&
bucket1Path.getName().matches(bucket1Regex)) {
  processBucket1Dir(bucket1Stat, toBeRemoved, isValid);
   }
}

private void processBucket1Dir(FileStatus fs, MutableBoolean toBeRemoved, 
MutableBoolean isValid) {
// walk through directories, check condition, go down to processBucket2Dir if 
it's true
...
}

private void processBucket2Dir(FileStatus fs, MutableBoolean toBeRemoved, 
MutableBoolean isValid) {
...
// walk through directories, check condition, go down to processAppDir if it's 
true
}

private void processAppDir(FileStatus fs, MutableBoolean toBeRemoved, 
MutableBoolean isValid) {
...
}
{code}

So basically each time you descend down the hierarchy, you enter a new method 
and pass around fields that you need later - then changes will be reflected in 
the outermost call.

> Bucket Directories as part of ATS done accumulates
> --
>
> Key: YARN-9080
> URL: https://issues.apache.org/jira/browse/YARN-9080
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: 0001-YARN-9080.patch, 0002-YARN-9080.patch, 
> 0003-YARN-9080.patch, YARN-9080-004.patch, YARN-9080-005.patch, 
> YARN-9080-006.patch
>
>
> Have observed older bucket directories cluster_timestamp, bucket1 and bucket2 
> as part of ATS done accumulates. The cleanLogs part of EntityLogCleaner 
> removes only the app directories and not the bucket directories.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9080) Bucket Directories as part of ATS done accumulates

2019-04-03 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808714#comment-16808714
 ] 

Szilard Nemeth commented on YARN-9080:
--

Hi [~Prabhu Joseph]!

Here are my comments: 
1. The depth of the nested while / if statements makes to code very hard to 
read and increase cyclomatic complexity 
(https://en.wikipedia.org/wiki/Cyclomatic_complexity)
First of all, I would extract the logic into some private methods.

Essentially, the pseudo-code of the algorithm is this: 
1. Loop over list of files under dirPath
2. if file is a directory, we should do something with the dir, let's call this 
"dir1"
3. We loop over files under "dir1" (bucket1Iter)
4. If file is a directory and it matches bucket1Regex, we iterate over files 
under the file (bucket2Iter)
5. If the file matches bucket2Regex then we have a valid dir
6. If we have files under this dir, we loop over those
7. If we find a directory and it's a valid applicationId, we invoke delete.

Please try to come up with something more readable, more easy to understand. 
I would try to extract methods first along with the while-loops then go until 
you have reasonable chunks of code.


2. I was wondering what is the meaning of "clusterts" and only realized from 
the tests that is clusterTimeStamp. You should either use this latter name or 
use clusterTs, but I prefer clusterTimeStamp.

3. Please extract the condition of the if-statement into a method from here: 
{code:java}
if ((fs.listStatus(bucket2Path).length != 0) || (now
- bucket2Stat.getModificationTime() <= retainMillis)) {
{code}

Please let me know if you are ready and I will check again, thanks!




> Bucket Directories as part of ATS done accumulates
> --
>
> Key: YARN-9080
> URL: https://issues.apache.org/jira/browse/YARN-9080
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: 0001-YARN-9080.patch, 0002-YARN-9080.patch, 
> 0003-YARN-9080.patch, YARN-9080-004.patch, YARN-9080-005.patch, 
> YARN-9080-006.patch
>
>
> Have observed older bucket directories cluster_timestamp, bucket1 and bucket2 
> as part of ATS done accumulates. The cleanLogs part of EntityLogCleaner 
> removes only the app directories and not the bucket directories.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org