[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-06-13 Thread zhuqi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863712#comment-16863712
 ] 

zhuqi commented on YARN-8995:
-

cc  [~Tao Yang]

Thanks [~Tao Yang] for your comment and persuasive test result.

Now i have changed my code in new patch, but there is no serviceInit method, i 
init my conf in the construct method. 

 

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-06-13 Thread zhuqi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-8995:

Attachment: YARN-8995.004.patch

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9623) Auto adjust max queue length of app activities to make sure activities on all nodes can be covered

2019-06-13 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9623:
---
Summary: Auto adjust max queue length of app activities to make sure 
activities on all nodes can be covered  (was: Auto adjust queue length of app 
activities to make sure activities on all nodes can be covered)

> Auto adjust max queue length of app activities to make sure activities on all 
> nodes can be covered
> --
>
> Key: YARN-9623
> URL: https://issues.apache.org/jira/browse/YARN-9623
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>
> Currently we can use configuration entry 
> "yarn.resourcemanager.activities-manager.app-activities.max-queue-length" to 
> control max queue length of app activities, but in some scenarios , this 
> configuration may need to be updated in a growing cluster. Moreover, it's 
> better for users to ignore that conf therefor it should be auto adjusted 
> internally.
>  There are some differences among different scheduling modes:
>  * multi-node placement disabled
>  ** Heartbeat driven scheduling: max queue length of app activities should 
> not less than the number of nodes, considering nodes can not be always in 
> order, we should make some room for misorder, for example, we can guarantee 
> that max queue length should not be less than 1.2 * numNodes
>  ** Async scheduling: every async scheduling thread goes through all nodes in 
> order, in this mode, we should guarantee that max queue length should be 
> numThreads * numNodes.
>  * multi-node placement enabled: activities on all nodes can be involved in a 
> single app allocation, therefor there's no need to adjust for this mode.
> To sum up, we can adjust the max queue length of app activities like this:
> {code}
> int configuredMaxQueueLength;
> int maxQueueLength;
> serviceInit(){
>   ...
>   configuredMaxQueueLength = ...; //read configured max queue length
>   maxQueueLength = configuredMaxQueueLength; //take configured value as 
> default
> }
> CleanupThread#run(){
>   ...
>   if (multiNodeDisabled) {
> if (asyncSchedulingEnabled) {
>maxQueueLength = max(configuredMaxQueueLength, numSchedulingThreads * 
> numNodes);
> } else {
>maxQueueLength = max(configuredMaxQueueLength, 1.2 * numNodes);
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9623) Auto adjust queue length of app activities to make sure activities on all nodes can be covered

2019-06-13 Thread Tao Yang (JIRA)
Tao Yang created YARN-9623:
--

 Summary: Auto adjust queue length of app activities to make sure 
activities on all nodes can be covered
 Key: YARN-9623
 URL: https://issues.apache.org/jira/browse/YARN-9623
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tao Yang
Assignee: Tao Yang


Currently we can use configuration entry 
"yarn.resourcemanager.activities-manager.app-activities.max-queue-length" to 
control max queue length of app activities, but in some scenarios , this 
configuration may need to be updated in a growing cluster. Moreover, it's 
better for users to ignore that conf therefor it should be auto adjusted 
internally.
 There are some differences among different scheduling modes:
 * multi-node placement disabled
 ** Heartbeat driven scheduling: max queue length of app activities should not 
less than the number of nodes, considering nodes can not be always in order, we 
should make some room for misorder, for example, we can guarantee that max 
queue length should not be less than 1.2 * numNodes
 ** Async scheduling: every async scheduling thread goes through all nodes in 
order, in this mode, we should guarantee that max queue length should be 
numThreads * numNodes.
 * multi-node placement enabled: activities on all nodes can be involved in a 
single app allocation, therefor there's no need to adjust for this mode.

To sum up, we can adjust the max queue length of app activities like this:
{code}
int configuredMaxQueueLength;
int maxQueueLength;
serviceInit(){
  ...
  configuredMaxQueueLength = ...; //read configured max queue length
  maxQueueLength = configuredMaxQueueLength; //take configured value as default
}
CleanupThread#run(){
  ...
  if (multiNodeDisabled) {
if (asyncSchedulingEnabled) {
   maxQueueLength = max(configuredMaxQueueLength, numSchedulingThreads * 
numNodes);
} else {
   maxQueueLength = max(configuredMaxQueueLength, 1.2 * numNodes);
}
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page

2019-06-13 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863691#comment-16863691
 ] 

Tao Yang commented on YARN-9567:


Thanks [~cheersyang]. Attached v1 patch for review.

> Add diagnostics for outstanding resource requests on app attempts page
> --
>
> Key: YARN-9567
> URL: https://issues.apache.org/jira/browse/YARN-9567
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9567.001.patch, image-2019-06-04-17-29-29-368.png, 
> image-2019-06-04-17-31-31-820.png, image-2019-06-04-17-58-11-886.png, 
> image-2019-06-14-11-21-41-066.png, no_diagnostic_at_first.png, 
> show_diagnostics_after_requesting_app_activities_REST_API.png
>
>
> Currently on app attempt page we can see outstanding resource requests, it 
> will be helpful for users to know why if we can join diagnostics of this app 
> with these. 
> Discussed with [~cheersyang], we can passively load diagnostics from cache of 
> completed app activities instead of actively triggering which may bring 
> uncontrollable risks.
> For example:
> (1) At first we can see no diagnostic in cache if app activities not 
> triggered below the outstanding requests.
> !no_diagnostic_at_first.png|width=793,height=248!
> (2) After requesting the application activities REST API, we can see 
> diagnostics now.
> !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page

2019-06-13 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9567:
---
Attachment: YARN-9567.001.patch

> Add diagnostics for outstanding resource requests on app attempts page
> --
>
> Key: YARN-9567
> URL: https://issues.apache.org/jira/browse/YARN-9567
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9567.001.patch, image-2019-06-04-17-29-29-368.png, 
> image-2019-06-04-17-31-31-820.png, image-2019-06-04-17-58-11-886.png, 
> image-2019-06-14-11-21-41-066.png, no_diagnostic_at_first.png, 
> show_diagnostics_after_requesting_app_activities_REST_API.png
>
>
> Currently on app attempt page we can see outstanding resource requests, it 
> will be helpful for users to know why if we can join diagnostics of this app 
> with these. 
> Discussed with [~cheersyang], we can passively load diagnostics from cache of 
> completed app activities instead of actively triggering which may bring 
> uncontrollable risks.
> For example:
> (1) At first we can see no diagnostic in cache if app activities not 
> triggered below the outstanding requests.
> !no_diagnostic_at_first.png|width=793,height=248!
> (2) After requesting the application activities REST API, we can see 
> diagnostics now.
> !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page

2019-06-13 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855530#comment-16855530
 ] 

Tao Yang edited comment on YARN-9567 at 6/14/19 3:24 AM:
-

Some updates about this issue:
 # Support summarizing app activities on nodes in multiple scheduling processes 
to get the comprehensive information for better debugging, based on YARN-9578.
 # Support partial refresh on app attempt page, so that we have two ways to get 
diagnostics:
 ** When refresh the app attempt page, query and show activities directly from 
cache.
 ** When click the refresh button, update activities immediately and get 
activities and show them after about 2 seconds.
 # Diagnostics information can be classified to 3 levels (request, app and 
scheduler activities).
 ** Request level !image-2019-06-04-17-29-29-368.png|width=1287,height=90!
 ** App level  !image-2019-06-04-17-31-31-820.png|width=648,height=63!
 ** Scheduler activities level (If can't found app diagnostics, will show all 
nodes in scheduling process from scheduler activities for debugging) 
!image-2019-06-14-11-21-41-066.png|width=891,height=159!

Please feel free to give your suggestions! 

I will attach the patch after its dependency issue YARN-9578 resolved.


was (Author: tao yang):
Some updates about this issue:
 # Support summarizing app activities on nodes in multiple scheduling processes 
to get the comprehensive information for better debugging, based on YARN-9578.
 # Support partial refresh on app attempt page, so that we have two ways to get 
diagnostics:
 ** When refresh the app attempt page, query and show activities directly from 
cache.
 ** When click the refresh button, update activities immediately and get 
activities and show them after about 2 seconds.
 # Diagnostics information can be classified to 3 levels (request, app and 
scheduler activities).
 ** Request level !image-2019-06-04-17-29-29-368.png|width=1287,height=90!
 ** App level !image-2019-06-04-17-31-31-820.png|width=648,height=63!
 ** Scheduler activities level 
!image-2019-06-04-17-58-11-886.png|width=731,height=121!

Please feel free to give your suggestions! 

I will attach the patch after its dependency issue YARN-9578 resolved.

> Add diagnostics for outstanding resource requests on app attempts page
> --
>
> Key: YARN-9567
> URL: https://issues.apache.org/jira/browse/YARN-9567
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: image-2019-06-04-17-29-29-368.png, 
> image-2019-06-04-17-31-31-820.png, image-2019-06-04-17-58-11-886.png, 
> image-2019-06-14-11-21-41-066.png, no_diagnostic_at_first.png, 
> show_diagnostics_after_requesting_app_activities_REST_API.png
>
>
> Currently on app attempt page we can see outstanding resource requests, it 
> will be helpful for users to know why if we can join diagnostics of this app 
> with these. 
> Discussed with [~cheersyang], we can passively load diagnostics from cache of 
> completed app activities instead of actively triggering which may bring 
> uncontrollable risks.
> For example:
> (1) At first we can see no diagnostic in cache if app activities not 
> triggered below the outstanding requests.
> !no_diagnostic_at_first.png|width=793,height=248!
> (2) After requesting the application activities REST API, we can see 
> diagnostics now.
> !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page

2019-06-13 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9567:
---
Attachment: (was: image-2019-06-14-11-14-31-874.png)

> Add diagnostics for outstanding resource requests on app attempts page
> --
>
> Key: YARN-9567
> URL: https://issues.apache.org/jira/browse/YARN-9567
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: image-2019-06-04-17-29-29-368.png, 
> image-2019-06-04-17-31-31-820.png, image-2019-06-04-17-58-11-886.png, 
> image-2019-06-14-11-21-41-066.png, no_diagnostic_at_first.png, 
> show_diagnostics_after_requesting_app_activities_REST_API.png
>
>
> Currently on app attempt page we can see outstanding resource requests, it 
> will be helpful for users to know why if we can join diagnostics of this app 
> with these. 
> Discussed with [~cheersyang], we can passively load diagnostics from cache of 
> completed app activities instead of actively triggering which may bring 
> uncontrollable risks.
> For example:
> (1) At first we can see no diagnostic in cache if app activities not 
> triggered below the outstanding requests.
> !no_diagnostic_at_first.png|width=793,height=248!
> (2) After requesting the application activities REST API, we can see 
> diagnostics now.
> !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9619) Transfer error AM host/ip when launching app using docker container with bridge network

2019-06-13 Thread caozhiqiang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863616#comment-16863616
 ] 

caozhiqiang commented on YARN-9619:
---

Thanks for your comments, [~eyang]. Launching application with docker cantainer 
allow several kinds networks. In document it has declared that it both support 
allowed host network and bridge network.[launch with 
docker|[https://hadoop.apache.org/docs/r3.1.2/hadoop-yarn/hadoop-yarn-site/DockerContainers.html]]
{code:java}
//   

yarn.nodemanager.runtime.linux.docker.allowed-container-networks
host,none,bridge

  Optional. A comma-separated set of networks allowed when launching
  containers. Valid values are determined by Docker networks available from
  `docker network ls`

  {code}
With host network, AM running in docker can work well because the AM's IP is 
the same with NM's.

With bridge network, I think if AM register correct host/IP(the real docker 
container IP, not nodemanager IP) to RM, and all hadoop components running in 
overlay network, such as deploying flannel, it should also work well. 

In overlay network, docker can bi-directional communication with any other 
docker or other nodes. So RM and NMs can also bi-directional communication with 
AM running in docker. I have verified these.

> Transfer error AM host/ip when launching app using docker container with 
> bridge network
> ---
>
> Key: YARN-9619
> URL: https://issues.apache.org/jira/browse/YARN-9619
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: caozhiqiang
>Priority: Major
>
> When launching application using docker container with bridge network in 
> overlay networks, client will polling the rate of application process from 
> ApplicationMaster with error host/IP. client also polling from the 
> nodemanager's hostname/IP, but not from the docker's IP which AM real running 
> in. The error message is below(the server hadoop3-1/192.168.2.105 is NM's, 
> not AM's docker IP, so it can't be accessed):
> 2019-05-11 08:28:46,361 INFO ipc.Client: Retrying connect to server: 
> hadoop3-1/192.168.2.105:37963. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
> 2019-05-11 08:28:47,363 INFO ipc.Client: Retrying connect to server: 
> hadoop3-1/192.168.2.105:37963. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
> 2019-05-11 08:28:48,365 INFO ipc.Client: Retrying connect to server: 
> hadoop3-1/192.168.2.105:37963. Already tried 2 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
> 2019-05-10 08:34:40,235 INFO mapred.ClientServiceDelegate: Application state 
> is completed. FinalApplicationStatus=FAILED. Redirecting to job history server
> 2019-05-10 08:35:00,408 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:12020. Already tried 8 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2019-05-10 08:35:00,408 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:12020. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> java.io.IOException: java.net.ConnectException: Your endpoint configuration 
> is wrong; For more details see: 
> http://wiki.apache.org/hadoop/UnsetHostnameOrPort
>  at 
> org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:345)
>  at 
> org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:430)
>  at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:871)
>  at org.apache.hadoop.mapreduce.Job$1.run(Job.java:331)
>  at org.apache.hadoop.mapreduce.Job$1.run(Job.java:328)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:328)
>  at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:612)
>  at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1629)
>  at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591)
>  at 
> org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:307)
>  at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:360)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>  at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:368)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

[jira] [Commented] (YARN-9619) Transfer error AM host/ip when launching app using docker container with bridge network

2019-06-13 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863509#comment-16863509
 ] 

Eric Yang commented on YARN-9619:
-

[~caozhiqiang] Sorry, I am not entirely sure that I understand the description 
of this problem.  This seems to indicate mapreduce workload doesn't work with 
bridge network in overlay network.  YARN framework requires application master 
to run in the same flat network as resource manager and node manager.  This 
ensures bi-directional communication between application master and YARN 
framework are not blocked.  

Overlay network implies some level of privacy from host network level.  Overlay 
network often allows only outbound network access.  By running application 
master in overlay network, resource manager and node manager can not have 
bi-directional communication with application master.

I don't think it is possible to run AM in docker in current implementation of 
YARN.

> Transfer error AM host/ip when launching app using docker container with 
> bridge network
> ---
>
> Key: YARN-9619
> URL: https://issues.apache.org/jira/browse/YARN-9619
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: caozhiqiang
>Priority: Major
>
> When launching application using docker container with bridge network in 
> overlay networks, client will polling the rate of application process from 
> ApplicationMaster with error host/IP. client also polling from the 
> nodemanager's hostname/IP, but not from the docker's IP which AM real running 
> in. The error message is below(the server hadoop3-1/192.168.2.105 is NM's, 
> not AM's docker IP, so it can't be accessed):
> 2019-05-11 08:28:46,361 INFO ipc.Client: Retrying connect to server: 
> hadoop3-1/192.168.2.105:37963. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
> 2019-05-11 08:28:47,363 INFO ipc.Client: Retrying connect to server: 
> hadoop3-1/192.168.2.105:37963. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
> 2019-05-11 08:28:48,365 INFO ipc.Client: Retrying connect to server: 
> hadoop3-1/192.168.2.105:37963. Already tried 2 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
> 2019-05-10 08:34:40,235 INFO mapred.ClientServiceDelegate: Application state 
> is completed. FinalApplicationStatus=FAILED. Redirecting to job history server
> 2019-05-10 08:35:00,408 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:12020. Already tried 8 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2019-05-10 08:35:00,408 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:12020. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> java.io.IOException: java.net.ConnectException: Your endpoint configuration 
> is wrong; For more details see: 
> http://wiki.apache.org/hadoop/UnsetHostnameOrPort
>  at 
> org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:345)
>  at 
> org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:430)
>  at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:871)
>  at org.apache.hadoop.mapreduce.Job$1.run(Job.java:331)
>  at org.apache.hadoop.mapreduce.Job$1.run(Job.java:328)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:328)
>  at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:612)
>  at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1629)
>  at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591)
>  at 
> org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:307)
>  at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:360)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>  at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:368)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
>  at 

[jira] [Commented] (YARN-9621) FIX TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1

2019-06-13 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863472#comment-16863472
 ] 

Hadoop QA commented on YARN-9621:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 25m 
17s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} branch-3.1 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 28m 
11s{color} | {color:green} branch-3.1 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
26s{color} | {color:green} branch-3.1 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
21s{color} | {color:green} branch-3.1 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
35s{color} | {color:green} branch-3.1 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 17s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
48s{color} | {color:green} branch-3.1 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
26s{color} | {color:green} branch-3.1 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
23s{color} | {color:red} hadoop-yarn-applications-distributedshell in the patch 
failed. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
24s{color} | {color:red} hadoop-yarn-applications-distributedshell in the patch 
failed. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 24s{color} 
| {color:red} hadoop-yarn-applications-distributedshell in the patch failed. 
{color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
13s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} mvnsite {color} | {color:red}  0m 
24s{color} | {color:red} hadoop-yarn-applications-distributedshell in the patch 
failed. {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 35s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  0m 
24s{color} | {color:red} hadoop-yarn-applications-distributedshell in the patch 
failed. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
19s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  0m 25s{color} 
| {color:red} hadoop-yarn-applications-distributedshell in the patch failed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 89m 31s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=18.09.5 Server=18.09.5 Image:yetus/hadoop:080e9d0f9b3 |
| JIRA Issue | YARN-9621 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12971721/YARN-9621-branch-3.1.001.patch
 |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 5f9d4c9da06f 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 
08:28:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | branch-3.1 / fee1e67 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_212 |
| findbugs | v3.1.0-RC1 |
| mvninstall | 
https://builds.apache.org/job/PreCommit-YARN-Build/24266/artifact/out/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-applications-distributedshell.txt
 |
| compile | 

[jira] [Commented] (YARN-8856) TestTimelineReaderWebServicesHBaseStorage tests failing with NoClassDefFoundError

2019-06-13 Thread JIRA


[ 
https://issues.apache.org/jira/browse/YARN-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863471#comment-16863471
 ] 

Íñigo Goiri commented on YARN-8856:
---

[~Prabhu Joseph], backported to branch-3.2.

> TestTimelineReaderWebServicesHBaseStorage tests failing with 
> NoClassDefFoundError
> -
>
> Key: YARN-8856
> URL: https://issues.apache.org/jira/browse/YARN-8856
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jason Lowe
>Assignee: Sushil Ks
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-8856.001.patch
>
>
> TestTimelineReaderWebServicesHBaseStorage has been failing in nightly builds 
> with NoClassDefFoundError in the tests.  Sample error and stacktrace to 
> follow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9621) FIX TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1

2019-06-13 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863409#comment-16863409
 ] 

Prabhu Joseph commented on YARN-9621:
-

{{TestDSWithMultipleNodeManager}} does not have tearDown method. This causes 
all testcases to use same {{MiniYarnCluster}} and having conflicts while 
calculating the actual containers launched on a node by {{NMContainerMonitor}}. 
This issue is present only in branch-3.1 as YARN-9252 has added the tearDown to 
branch-3.2 and 3.3.



> FIX TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint 
> on branch-3.1
> ---
>
> Key: YARN-9621
> URL: https://issues.apache.org/jira/browse/YARN-9621
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-shell, test
>Affects Versions: 3.1.2
>Reporter: Peter Bacsko
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-9621-branch-3.1.001.patch
>
>
> Testcase 
> {{TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint}} 
> seems to constantly fail on branch 3.1. I believe it was introduced by 
> YARN-9253.
> {noformat}
> testDistributedShellWithPlacementConstraint(org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager)
>   Time elapsed: 24.636 s  <<< FAILURE!
> java.lang.AssertionError: expected:<1> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint(TestDSWithMultipleNodeManager.java:178)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9621) FIX TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1

2019-06-13 Thread Prabhu Joseph (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-9621:

Attachment: YARN-9621-branch-3.1.001.patch

> FIX TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint 
> on branch-3.1
> ---
>
> Key: YARN-9621
> URL: https://issues.apache.org/jira/browse/YARN-9621
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-shell, test
>Affects Versions: 3.1.2
>Reporter: Peter Bacsko
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-9621-branch-3.1.001.patch
>
>
> Testcase 
> {{TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint}} 
> seems to constantly fail on branch 3.1. I believe it was introduced by 
> YARN-9253.
> {noformat}
> testDistributedShellWithPlacementConstraint(org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager)
>   Time elapsed: 24.636 s  <<< FAILURE!
> java.lang.AssertionError: expected:<1> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint(TestDSWithMultipleNodeManager.java:178)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9621) FIX TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1

2019-06-13 Thread Prabhu Joseph (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-9621:

Summary: FIX 
TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on 
branch-3.1  (was: Test failure 
TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on 
branch-3.1)

> FIX TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint 
> on branch-3.1
> ---
>
> Key: YARN-9621
> URL: https://issues.apache.org/jira/browse/YARN-9621
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-shell, test
>Affects Versions: 3.1.2
>Reporter: Peter Bacsko
>Assignee: Prabhu Joseph
>Priority: Major
>
> Testcase 
> {{TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint}} 
> seems to constantly fail on branch 3.1. I believe it was introduced by 
> YARN-9253.
> {noformat}
> testDistributedShellWithPlacementConstraint(org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager)
>   Time elapsed: 24.636 s  <<< FAILURE!
> java.lang.AssertionError: expected:<1> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint(TestDSWithMultipleNodeManager.java:178)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9599) TestContainerSchedulerQueuing#testQueueShedding fails intermittently.

2019-06-13 Thread Giovanni Matteo Fumarola (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863345#comment-16863345
 ] 

Giovanni Matteo Fumarola commented on YARN-9599:


Committed to trunk.
Thanks [~elgoiri] for the review and [~abmodi] for the patch.

> TestContainerSchedulerQueuing#testQueueShedding fails intermittently.
> -
>
> Key: YARN-9599
> URL: https://issues.apache.org/jira/browse/YARN-9599
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Minor
> Fix For: 3.3.0
>
> Attachments: YARN-9599.001.patch, YARN-9599.002.patch, 
> YARN-9599.003.patch, YARN-9599.004.patch
>
>
> TestQueueShedding fails intermittently.
> java.lang.AssertionError: expected:<6> but was:<5>
> at org.junit.Assert.fail(Assert.java:88) 
> at org.junit.Assert.failNotEquals(Assert.java:834) 
> at org.junit.Assert.assertEquals(Assert.java:645) 
> at org.junit.Assert.assertEquals(Assert.java:631) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.TestContainerSchedulerQueuing.testQueueShedding(TestContainerSchedulerQueuing.java:775)
>  
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:498) 
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  
> at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) 
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>  
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>  
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) 
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) 
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) 
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) 
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) 
> at org.junit.runners.ParentRunner.run(ParentRunner.java:363) 
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>  
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>  
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>  
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>  
> at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>  
> at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>  
> at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) 
> at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9599) TestContainerSchedulerQueuing#testQueueShedding fails intermittently.

2019-06-13 Thread Giovanni Matteo Fumarola (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giovanni Matteo Fumarola updated YARN-9599:
---
Fix Version/s: 3.3.0

> TestContainerSchedulerQueuing#testQueueShedding fails intermittently.
> -
>
> Key: YARN-9599
> URL: https://issues.apache.org/jira/browse/YARN-9599
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Minor
> Fix For: 3.3.0
>
> Attachments: YARN-9599.001.patch, YARN-9599.002.patch, 
> YARN-9599.003.patch, YARN-9599.004.patch
>
>
> TestQueueShedding fails intermittently.
> java.lang.AssertionError: expected:<6> but was:<5>
> at org.junit.Assert.fail(Assert.java:88) 
> at org.junit.Assert.failNotEquals(Assert.java:834) 
> at org.junit.Assert.assertEquals(Assert.java:645) 
> at org.junit.Assert.assertEquals(Assert.java:631) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.TestContainerSchedulerQueuing.testQueueShedding(TestContainerSchedulerQueuing.java:775)
>  
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:498) 
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  
> at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) 
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>  
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>  
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) 
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) 
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) 
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) 
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) 
> at org.junit.runners.ParentRunner.run(ParentRunner.java:363) 
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>  
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>  
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>  
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>  
> at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>  
> at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>  
> at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) 
> at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9599) TestContainerSchedulerQueuing#testQueueShedding fails intermittently.

2019-06-13 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863343#comment-16863343
 ] 

Hudson commented on YARN-9599:
--

FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #16739 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/16739/])
YARN-9599. TestContainerSchedulerQueuing#testQueueShedding fails (gifuma: rev 
bcfd22833633e24881891208503971c8ef59d63c)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/scheduler/TestContainerSchedulerQueuing.java


> TestContainerSchedulerQueuing#testQueueShedding fails intermittently.
> -
>
> Key: YARN-9599
> URL: https://issues.apache.org/jira/browse/YARN-9599
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Minor
> Attachments: YARN-9599.001.patch, YARN-9599.002.patch, 
> YARN-9599.003.patch, YARN-9599.004.patch
>
>
> TestQueueShedding fails intermittently.
> java.lang.AssertionError: expected:<6> but was:<5>
> at org.junit.Assert.fail(Assert.java:88) 
> at org.junit.Assert.failNotEquals(Assert.java:834) 
> at org.junit.Assert.assertEquals(Assert.java:645) 
> at org.junit.Assert.assertEquals(Assert.java:631) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.TestContainerSchedulerQueuing.testQueueShedding(TestContainerSchedulerQueuing.java:775)
>  
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:498) 
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  
> at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) 
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>  
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>  
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) 
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) 
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) 
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) 
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) 
> at org.junit.runners.ParentRunner.run(ParentRunner.java:363) 
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>  
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>  
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>  
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>  
> at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>  
> at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>  
> at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) 
> at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8499) ATS v2 Generic TimelineStorageMonitor

2019-06-13 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863324#comment-16863324
 ] 

Prabhu Joseph commented on YARN-8499:
-

Thanks [~snemeth] for the review. [~eyang] Can you review this Jira when you 
get time. This makes the timeline storage monitor generic.

> ATS v2 Generic TimelineStorageMonitor
> -
>
> Key: YARN-8499
> URL: https://issues.apache.org/jira/browse/YARN-8499
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Reporter: Sunil Govindan
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: atsv2
> Attachments: YARN-8499-001.patch, YARN-8499-002.patch, 
> YARN-8499-003.patch, YARN-8499-004.patch, YARN-8499-005.patch, 
> YARN-8499-006.patch, YARN-8499-007.patch, YARN-8499-008.patch, 
> YARN-8499-009.patch, YARN-8499-010.patch, YARN-8499-011.patch, 
> YARN-8499-012.patch
>
>
> Post YARN-8302, Hbase connection issues are handled in ATSv2. However this 
> could be made general by introducing an api in storage interface and 
> implementing in each of the storage as per the store semantics.
>  
> cc [~rohithsharma] [~vinodkv] [~vrushalic]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9525) IFile format is not working against s3a remote folder

2019-06-13 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko reassigned YARN-9525:
--

Assignee: Adam Antal  (was: Peter Bacsko)

> IFile format is not working against s3a remote folder
> -
>
> Key: YARN-9525
> URL: https://issues.apache.org/jira/browse/YARN-9525
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation
>Affects Versions: 3.1.2
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Attachments: IFile-S3A-POC01.patch, YARN-9525-001.patch, 
> YARN-9525.002.patch, YARN-9525.003.patch, YARN-9525.004.patch
>
>
> Using the IndexedFileFormat {{yarn.nodemanager.remote-app-log-dir}} 
> configured to an s3a URI throws the following exception during log 
> aggregation:
> {noformat}
> Cannot create writer for app application_1556199768861_0001. Skip log upload 
> this time. 
> java.io.IOException: java.io.FileNotFoundException: No such file or 
> directory: 
> s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:247)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:306)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:464)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:420)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2488)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2382)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2321)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128)
>   at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1244)
>   at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1240)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1246)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController$1.run(LogAggregationIndexedFileController.java:228)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:195)
>   ... 7 more
> {noformat}
> This stack trace point to 
> {{LogAggregationIndexedFileController$initializeWriter}} where we do the 
> following steps (in a non-rolling log aggregation setup):
> - create FSDataOutputStream
> - writing out a UUID
> - flushing
> - immediately after that we call a GetFileStatus to get the length of the log 
> file (the bytes we just wrote out), and that's where the failures happens: 
> the file is not there yet due to eventual consistency.
> Maybe we can get rid of that, so we can use IFile format against a s3a target.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8856) TestTimelineReaderWebServicesHBaseStorage tests failing with NoClassDefFoundError

2019-06-13 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863287#comment-16863287
 ] 

Prabhu Joseph commented on YARN-8856:
-

[~Sushil-K-S] [~elgoiri] The testcases are failing in branch-3.2 as well, can 
we backport this patch to branch-3.2. This patch works fine in branch-3.2.

> TestTimelineReaderWebServicesHBaseStorage tests failing with 
> NoClassDefFoundError
> -
>
> Key: YARN-8856
> URL: https://issues.apache.org/jira/browse/YARN-8856
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jason Lowe
>Assignee: Sushil Ks
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-8856.001.patch
>
>
> TestTimelineReaderWebServicesHBaseStorage has been failing in nightly builds 
> with NoClassDefFoundError in the tests.  Sample error and stacktrace to 
> follow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9622) All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2

2019-06-13 Thread Prabhu Joseph (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph resolved YARN-9622.
-
Resolution: Duplicate

> All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2
> -
>
> Key: YARN-9622
> URL: https://issues.apache.org/jira/browse/YARN-9622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver, timelineservice
>Affects Versions: 3.2.0
>Reporter: Peter Bacsko
>Assignee: Prabhu Joseph
>Priority: Major
>
> When you try to run all tests from TestTimelineReaderWebServicesHBaseStorage, 
> the result is the following:
> {noformat}
> [ERROR] Failures: 
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetAppNotPresent:->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
>  Response from server should have been Not Found
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunNotPresent:2192->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
>  Response from server should have been Not Found
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testUIDNotProperlyEscaped:905->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
>  Response from server should have been Bad Request
> [ERROR] Errors: 
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testForFlowAppsPagination:2375->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testForFlowRunAppsPagination:2420->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testForFlowRunsPagination:2465->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGenericEntitiesForPagination:2272->verifyEntitiesForPagination:2288->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetApp:1024->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetAppWithoutFlowInfo:1064->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetAppsMetricsRange:2516->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesByUID:662->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesConfigFilters:1263->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesDataToRetrieve:1154->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesEventFilters:1640->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesInfoFilters:1380->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricFilters:1494->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricsTimeRange:1820->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesRelationFilters:1696->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesWithoutFlowInfo:1130->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntityDataToRetrieve:1905->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntityWithoutFlowInfo:1113->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowApps:2047->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsFilters:2153->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsNotPresent:2253->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowRun:443->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunApps:1984->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> 

[jira] [Commented] (YARN-9622) All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2

2019-06-13 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863280#comment-16863280
 ] 

Prabhu Joseph commented on YARN-9622:
-

[~pbacsko] This issues is fixed by YARN-8856 in trunk. It also needs to be 
backported to branch-3.2. 

> All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2
> -
>
> Key: YARN-9622
> URL: https://issues.apache.org/jira/browse/YARN-9622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver, timelineservice
>Affects Versions: 3.2.0
>Reporter: Peter Bacsko
>Assignee: Prabhu Joseph
>Priority: Major
>
> When you try to run all tests from TestTimelineReaderWebServicesHBaseStorage, 
> the result is the following:
> {noformat}
> [ERROR] Failures: 
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetAppNotPresent:->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
>  Response from server should have been Not Found
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunNotPresent:2192->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
>  Response from server should have been Not Found
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testUIDNotProperlyEscaped:905->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
>  Response from server should have been Bad Request
> [ERROR] Errors: 
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testForFlowAppsPagination:2375->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testForFlowRunAppsPagination:2420->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testForFlowRunsPagination:2465->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGenericEntitiesForPagination:2272->verifyEntitiesForPagination:2288->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetApp:1024->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetAppWithoutFlowInfo:1064->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetAppsMetricsRange:2516->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesByUID:662->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesConfigFilters:1263->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesDataToRetrieve:1154->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesEventFilters:1640->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesInfoFilters:1380->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricFilters:1494->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricsTimeRange:1820->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesRelationFilters:1696->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesWithoutFlowInfo:1130->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntityDataToRetrieve:1905->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntityWithoutFlowInfo:1113->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowApps:2047->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsFilters:2153->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsNotPresent:2253->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowRun:443->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> 

[jira] [Commented] (YARN-9525) IFile format is not working against s3a remote folder

2019-06-13 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863230#comment-16863230
 ] 

Hadoop QA commented on YARN-9525:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
38s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
20s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 37s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
45s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 21s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common: The patch generated 2 new + 
9 unchanged - 0 fixed = 11 total (was 9) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 14s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  4m  
2s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
37s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 59m 31s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=18.09.5 Server=18.09.5 Image:yetus/hadoop:bdbca0e53b4 |
| JIRA Issue | YARN-9525 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12971702/YARN-9525.004.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux a3fa58851c51 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 
08:28:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 940bcf0 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_212 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/24265/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-common.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/24265/testReport/ |
| Max. process+thread count | 309 (vs. ulimit of 5500) |
| modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common U: 

[jira] [Commented] (YARN-6055) ContainersMonitorImpl need be adjusted when NM resource changed.

2019-06-13 Thread Pradeep Ambati (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863228#comment-16863228
 ] 

Pradeep Ambati commented on YARN-6055:
--

Patch looks good. +1

> ContainersMonitorImpl need be adjusted when NM resource changed.
> 
>
> Key: YARN-6055
> URL: https://issues.apache.org/jira/browse/YARN-6055
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, scheduler
>Reporter: Junping Du
>Assignee: Íñigo Goiri
>Priority: Major
> Attachments: YARN-6055.000.patch, YARN-6055.001.patch, 
> YARN-6055.002.patch, YARN-6055.003.patch, YARN-6055.004.patch
>
>
> Per Ravi's comments in YARN-4832, we need to check some limits in 
> containerMonitorImpl to make sure it get updated also when Resource updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9525) IFile format is not working against s3a remote folder

2019-06-13 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863207#comment-16863207
 ] 

Peter Bacsko commented on YARN-9525:


That's some nice finding [~adam.antal]. Looks like I oversimplified it a bit 
with the POC.

> IFile format is not working against s3a remote folder
> -
>
> Key: YARN-9525
> URL: https://issues.apache.org/jira/browse/YARN-9525
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation
>Affects Versions: 3.1.2
>Reporter: Adam Antal
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: IFile-S3A-POC01.patch, YARN-9525-001.patch, 
> YARN-9525.002.patch, YARN-9525.003.patch, YARN-9525.004.patch
>
>
> Using the IndexedFileFormat {{yarn.nodemanager.remote-app-log-dir}} 
> configured to an s3a URI throws the following exception during log 
> aggregation:
> {noformat}
> Cannot create writer for app application_1556199768861_0001. Skip log upload 
> this time. 
> java.io.IOException: java.io.FileNotFoundException: No such file or 
> directory: 
> s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:247)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:306)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:464)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:420)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2488)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2382)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2321)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128)
>   at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1244)
>   at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1240)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1246)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController$1.run(LogAggregationIndexedFileController.java:228)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:195)
>   ... 7 more
> {noformat}
> This stack trace point to 
> {{LogAggregationIndexedFileController$initializeWriter}} where we do the 
> following steps (in a non-rolling log aggregation setup):
> - create FSDataOutputStream
> - writing out a UUID
> - flushing
> - immediately after that we call a GetFileStatus to get the length of the log 
> file (the bytes we just wrote out), and that's where the failures happens: 
> the file is not there yet due to eventual consistency.
> Maybe we can get rid of that, so we can use IFile format against a s3a target.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9209) When nodePartition is not set in Placement Constraints, containers are allocated only in default partition

2019-06-13 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863205#comment-16863205
 ] 

Hadoop QA commented on YARN-9209:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
34s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 
17s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
33s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 30s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
15s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 27s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 1 new + 11 unchanged - 0 fixed = 12 total (was 11) {color} 
|
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 32s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 87m 
51s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}141m 36s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=18.09.5 Server=18.09.5 Image:yetus/hadoop:bdbca0e53b4 |
| JIRA Issue | YARN-9209 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12956277/YARN-9209.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 633fbf6b1513 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 
08:28:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 940bcf0 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_212 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/24264/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/24264/testReport/ |
| Max. process+thread count | 868 

[jira] [Updated] (YARN-9525) IFile format is not working against s3a remote folder

2019-06-13 Thread Adam Antal (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal updated YARN-9525:
-
Attachment: YARN-9525.004.patch

> IFile format is not working against s3a remote folder
> -
>
> Key: YARN-9525
> URL: https://issues.apache.org/jira/browse/YARN-9525
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation
>Affects Versions: 3.1.2
>Reporter: Adam Antal
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: IFile-S3A-POC01.patch, YARN-9525-001.patch, 
> YARN-9525.002.patch, YARN-9525.003.patch, YARN-9525.004.patch
>
>
> Using the IndexedFileFormat {{yarn.nodemanager.remote-app-log-dir}} 
> configured to an s3a URI throws the following exception during log 
> aggregation:
> {noformat}
> Cannot create writer for app application_1556199768861_0001. Skip log upload 
> this time. 
> java.io.IOException: java.io.FileNotFoundException: No such file or 
> directory: 
> s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:247)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:306)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:464)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:420)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2488)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2382)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2321)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128)
>   at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1244)
>   at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1240)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1246)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController$1.run(LogAggregationIndexedFileController.java:228)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:195)
>   ... 7 more
> {noformat}
> This stack trace point to 
> {{LogAggregationIndexedFileController$initializeWriter}} where we do the 
> following steps (in a non-rolling log aggregation setup):
> - create FSDataOutputStream
> - writing out a UUID
> - flushing
> - immediately after that we call a GetFileStatus to get the length of the log 
> file (the bytes we just wrote out), and that's where the failures happens: 
> the file is not there yet due to eventual consistency.
> Maybe we can get rid of that, so we can use IFile format against a s3a target.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9525) IFile format is not working against s3a remote folder

2019-06-13 Thread Adam Antal (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863169#comment-16863169
 ] 

Adam Antal commented on YARN-9525:
--

Sorry for the delayed answer, let me recap my current progress.

I run integration tests multiple times in every scenario just to have a decent 
knowledge about what we're dealing with. The tests were passing against remote 
folder in s3, so I thought the patch was ok, but checked the existing behaviour 
(HDFS remote app dir's case) as well - according to [~wangda]'s last comment.

Though IFile is reported to succeed in aggregating logs in those scenarios, 
during rolling log aggregation I have problems trying to access the logs 
through the logs CLI (reading through the associated file controller). It does 
not display any error, it just returns bad parts of the log - in my case, I ran 
a sleep job in the child container and its logs are mixed up with the AM's logs 
when I try to read it.

I compiled some debug messages into the hadoop-yarn-common jar, and run the 
tests again. It seems that the offset was miscalculated (due to the patch 
obviously), and in case of the regular HDFS remote dir when we read back the 
logs, we try to read it with wrong offset in the aggregated file, thus the logs 
get messed up. Although the length were ok. (it tried to read the correct 
number of bits, but starting from a bad position)
 The funny thing is that the patch works excellently against s3a, so I had to 
dig a bit further, and found the following:

Pre-patch when:
 - HDFS path is set as remote app folder
 - we're in rolling log aggregation situation
 - there was already a rolling session
 during the next rolling session there is no rollover (if the file is not big 
enough), so there won't be any new file generated. Meanwhile new OutputStream 
will be created targeting the existing file in append mode, but this time the 
"cursor" will point to the end of the file. Detecting this (after writing the 
dummyBytes, flushing, and checking the just written bits) the currentOffset 
will be set to 0.

After applying the patch: 
 Again, there is no rollover, hence the local bool variable createdNew will be 
set to false. Thus the currentOffset will be set according to the following 
piece of code:
{noformat}
currentOffSet = fc.getFileStatus(aggregatedLogFile).getLen();
{noformat}
which is wrong - it has to be zero, as before. The "cursor" still points to the 
end of the file, while the code thinks that it also has to be pushed/offset by 
the current length of the file.
 That information will be written to the index part, so when we read the file 
back, we will display bad bits, pushed away by that many bits.

The solution is simple: for cloud remote app folders rollover will be set to 0 
(see related jira: YARN-9607), so there will always be created a new file. 
(This is unavoidable as no append is not available.)
 So we should first check whether createdNew is true and we should only touch 
getFileStatus if it's false:
 - if there's no append we're fine, because a new file will always be created, 
thus the boolean will always be true, and the offset will always be zero 
(starting write from the beginning of the new empty file every rollover session)
 - if there is append, we fallback to the currently existing behaviour: if 
createdNew is true, then we're good. if it's not, then we're defaulting to the 
existing behaviour.

Uploaded new patch which addresses the comment above (actually it's just an 
extra if), and I also hope that this investigation is clear and it makes sense.
 Setting rollover to zero for non-appendable filesystems will be addressed in 
YARN-9607, but this patch makes sense without that, so the issues are not 
depending on each other.

Reacting to the [~ste...@apache.org]'s and [~tmarquardt]'s comments:
{quote}Good point. Would it actually be possible to pull this out into 
something you could actually make a standalone test against a filesystem?{quote}
Well it seems that it can hardly be modularised that way - so a simple 
"extracting a few lines of code" for test is not really applicable.
I can see a possible solution though, re-reading the code and collecting all 
the prerequisites or implicit things that the IFile is using, and putting it 
into a FSContract-based test. Is that what you were originally thinking?

{quote}getPos does seem a better strategy here. Adam: what do you think?{quote}
It makes sense to change this (use getPos), but I don't know how the existing 
behaviour (HDFS) would alter. I will test that as well, but was pretty occupied 
figuring out the above.

It seems HDFS is a bit hardwired into this, but at this point my integration 
tests are passing, which is a good sign.

Please review, if you can spare some time, and ask any questions that you may 
have - I will make an attempt to clarify it.

> IFile format is not working against s3a remote folder
> 

[jira] [Assigned] (YARN-9622) All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2

2019-06-13 Thread Prabhu Joseph (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph reassigned YARN-9622:
---

Assignee: Prabhu Joseph

> All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2
> -
>
> Key: YARN-9622
> URL: https://issues.apache.org/jira/browse/YARN-9622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver, timelineservice
>Affects Versions: 3.2.0
>Reporter: Peter Bacsko
>Assignee: Prabhu Joseph
>Priority: Major
>
> When you try to run all tests from TestTimelineReaderWebServicesHBaseStorage, 
> the result is the following:
> {noformat}
> [ERROR] Failures: 
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetAppNotPresent:->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
>  Response from server should have been Not Found
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunNotPresent:2192->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
>  Response from server should have been Not Found
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testUIDNotProperlyEscaped:905->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
>  Response from server should have been Bad Request
> [ERROR] Errors: 
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testForFlowAppsPagination:2375->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testForFlowRunAppsPagination:2420->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testForFlowRunsPagination:2465->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGenericEntitiesForPagination:2272->verifyEntitiesForPagination:2288->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetApp:1024->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetAppWithoutFlowInfo:1064->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetAppsMetricsRange:2516->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesByUID:662->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesConfigFilters:1263->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesDataToRetrieve:1154->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesEventFilters:1640->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesInfoFilters:1380->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricFilters:1494->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricsTimeRange:1820->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesRelationFilters:1696->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesWithoutFlowInfo:1130->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntityDataToRetrieve:1905->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntityWithoutFlowInfo:1113->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowApps:2047->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsFilters:2153->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsNotPresent:2253->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowRun:443->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunApps:1984->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> 

[jira] [Assigned] (YARN-9621) Test failure TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1

2019-06-13 Thread Prabhu Joseph (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph reassigned YARN-9621:
---

Assignee: Prabhu Joseph

> Test failure 
> TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on 
> branch-3.1
> 
>
> Key: YARN-9621
> URL: https://issues.apache.org/jira/browse/YARN-9621
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-shell, test
>Affects Versions: 3.1.2
>Reporter: Peter Bacsko
>Assignee: Prabhu Joseph
>Priority: Major
>
> Testcase 
> {{TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint}} 
> seems to constantly fail on branch 3.1. I believe it was introduced by 
> YARN-9253.
> {noformat}
> testDistributedShellWithPlacementConstraint(org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager)
>   Time elapsed: 24.636 s  <<< FAILURE!
> java.lang.AssertionError: expected:<1> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint(TestDSWithMultipleNodeManager.java:178)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9622) All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2

2019-06-13 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863128#comment-16863128
 ] 

Prabhu Joseph commented on YARN-9622:
-

Will work on this, assigning to me.

> All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2
> -
>
> Key: YARN-9622
> URL: https://issues.apache.org/jira/browse/YARN-9622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver, timelineservice
>Affects Versions: 3.2.0
>Reporter: Peter Bacsko
>Priority: Major
>
> When you try to run all tests from TestTimelineReaderWebServicesHBaseStorage, 
> the result is the following:
> {noformat}
> [ERROR] Failures: 
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetAppNotPresent:->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
>  Response from server should have been Not Found
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunNotPresent:2192->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
>  Response from server should have been Not Found
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testUIDNotProperlyEscaped:905->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
>  Response from server should have been Bad Request
> [ERROR] Errors: 
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testForFlowAppsPagination:2375->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testForFlowRunAppsPagination:2420->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testForFlowRunsPagination:2465->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGenericEntitiesForPagination:2272->verifyEntitiesForPagination:2288->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetApp:1024->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetAppWithoutFlowInfo:1064->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetAppsMetricsRange:2516->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesByUID:662->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesConfigFilters:1263->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesDataToRetrieve:1154->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesEventFilters:1640->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesInfoFilters:1380->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricFilters:1494->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricsTimeRange:1820->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesRelationFilters:1696->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesWithoutFlowInfo:1130->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntityDataToRetrieve:1905->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntityWithoutFlowInfo:1113->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowApps:2047->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsFilters:2153->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsNotPresent:2253->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowRun:443->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunApps:1984->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> 

[jira] [Updated] (YARN-9622) All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2

2019-06-13 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9622:
---
Affects Version/s: 3.2.0

> All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2
> -
>
> Key: YARN-9622
> URL: https://issues.apache.org/jira/browse/YARN-9622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver, timelineservice
>Affects Versions: 3.2.0
>Reporter: Peter Bacsko
>Priority: Major
>
> When you try to run all tests from TestTimelineReaderWebServicesHBaseStorage, 
> the result is the following:
> {noformat}
> [ERROR] Failures: 
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetAppNotPresent:->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
>  Response from server should have been Not Found
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunNotPresent:2192->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
>  Response from server should have been Not Found
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testUIDNotProperlyEscaped:905->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
>  Response from server should have been Bad Request
> [ERROR] Errors: 
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testForFlowAppsPagination:2375->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testForFlowRunAppsPagination:2420->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testForFlowRunsPagination:2465->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGenericEntitiesForPagination:2272->verifyEntitiesForPagination:2288->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetApp:1024->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetAppWithoutFlowInfo:1064->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetAppsMetricsRange:2516->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesByUID:662->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesConfigFilters:1263->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesDataToRetrieve:1154->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesEventFilters:1640->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesInfoFilters:1380->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricFilters:1494->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricsTimeRange:1820->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesRelationFilters:1696->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesWithoutFlowInfo:1130->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntityDataToRetrieve:1905->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntityWithoutFlowInfo:1113->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowApps:2047->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsFilters:2153->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsNotPresent:2253->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowRun:443->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunApps:1984->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> 

[jira] [Created] (YARN-9622) All testcase fails in TestTimelineReaderWebServicesHBaseStorage

2019-06-13 Thread Peter Bacsko (JIRA)
Peter Bacsko created YARN-9622:
--

 Summary: All testcase fails in 
TestTimelineReaderWebServicesHBaseStorage
 Key: YARN-9622
 URL: https://issues.apache.org/jira/browse/YARN-9622
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver, timelineservice
Reporter: Peter Bacsko


When you try to run all tests from TestTimelineReaderWebServicesHBaseStorage, 
the result is the following:

{noformat}
[ERROR] Failures: 
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetAppNotPresent:->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
 Response from server should have been Not Found
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunNotPresent:2192->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
 Response from server should have been Not Found
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testUIDNotProperlyEscaped:905->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
 Response from server should have been Bad Request
[ERROR] Errors: 
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testForFlowAppsPagination:2375->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testForFlowRunAppsPagination:2420->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testForFlowRunsPagination:2465->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGenericEntitiesForPagination:2272->verifyEntitiesForPagination:2288->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetApp:1024->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetAppWithoutFlowInfo:1064->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetAppsMetricsRange:2516->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesByUID:662->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesConfigFilters:1263->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesDataToRetrieve:1154->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesEventFilters:1640->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesInfoFilters:1380->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricFilters:1494->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricsTimeRange:1820->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesRelationFilters:1696->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesWithoutFlowInfo:1130->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntityDataToRetrieve:1905->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntityWithoutFlowInfo:1113->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetFlowApps:2047->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsFilters:2153->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsNotPresent:2253->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetFlowRun:443->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunApps:1984->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunAppsNotPresent:2235->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetFlowRuns:488->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunsMetricsToRetrieve:616->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   

[jira] [Updated] (YARN-9622) All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2

2019-06-13 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9622:
---
Summary: All testcase fails in TestTimelineReaderWebServicesHBaseStorage on 
branch-3.2  (was: All testcase fails in 
TestTimelineReaderWebServicesHBaseStorage)

> All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2
> -
>
> Key: YARN-9622
> URL: https://issues.apache.org/jira/browse/YARN-9622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver, timelineservice
>Reporter: Peter Bacsko
>Priority: Major
>
> When you try to run all tests from TestTimelineReaderWebServicesHBaseStorage, 
> the result is the following:
> {noformat}
> [ERROR] Failures: 
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetAppNotPresent:->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
>  Response from server should have been Not Found
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunNotPresent:2192->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
>  Response from server should have been Not Found
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testUIDNotProperlyEscaped:905->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
>  Response from server should have been Bad Request
> [ERROR] Errors: 
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testForFlowAppsPagination:2375->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testForFlowRunAppsPagination:2420->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testForFlowRunsPagination:2465->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGenericEntitiesForPagination:2272->verifyEntitiesForPagination:2288->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetApp:1024->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetAppWithoutFlowInfo:1064->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetAppsMetricsRange:2516->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesByUID:662->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesConfigFilters:1263->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesDataToRetrieve:1154->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesEventFilters:1640->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesInfoFilters:1380->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricFilters:1494->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricsTimeRange:1820->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesRelationFilters:1696->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesWithoutFlowInfo:1130->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntityDataToRetrieve:1905->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetEntityWithoutFlowInfo:1113->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowApps:2047->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsFilters:2153->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsNotPresent:2253->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> TestTimelineReaderWebServicesHBaseStorage.testGetFlowRun:443->AbstractTimelineReaderHBaseTestBase.getResponse:129
>  » IO
> [ERROR]   
> 

[jira] [Commented] (YARN-9209) When nodePartition is not set in Placement Constraints, containers are allocated only in default partition

2019-06-13 Thread Tarun Parimi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863066#comment-16863066
 ] 

Tarun Parimi commented on YARN-9209:


Hi [~cheersyang], [~leftnoteasy] ,
Any way to proceed further towards a proper fix on this pending jira. Current 
patch fixes the issue, but I guess additional checks are needed? 
Thanks.

> When nodePartition is not set in Placement Constraints, containers are 
> allocated only in default partition
> --
>
> Key: YARN-9209
> URL: https://issues.apache.org/jira/browse/YARN-9209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9209.001.patch
>
>
> When application sets a placement constraint without specifying a 
> nodePartition, the default partition is always chosen as the constraint when 
> allocating containers. This can be a problem. when an application is 
> submitted to a queue which has doesn't have enough capacity available on the 
> default partition.
>  This is a common scenario when node labels are configured for a particular 
> queue. The below sample sleeper service cannot get even a single container 
> allocated when it is submitted to a "labeled_queue", even though enough 
> capacity is available on the label/partition configured for the queue. Only 
> the AM container runs. 
> {code:java}{
> "name": "sleeper-service",
> "version": "1.0.0",
> "queue": "labeled_queue",
> "components": [
> {
> "name": "sleeper",
> "number_of_containers": 2,
> "launch_command": "sleep 9",
> "resource": {
> "cpus": 1,
> "memory": "4096"
> },
> "placement_policy": {
> "constraints": [
> {
> "type": "ANTI_AFFINITY",
> "scope": "NODE",
> "target_tags": [
> "sleeper"
> ]
> }
> ]
> }
> }
> ]
> }
> {code}
> It runs fine if I specify the node_partition explicitly in the constraints 
> like below. 
> {code:java}
> {
> "name": "sleeper-service",
> "version": "1.0.0",
> "queue": "labeled_queue",
> "components": [
> {
> "name": "sleeper",
> "number_of_containers": 2,
> "launch_command": "sleep 9",
> "resource": {
> "cpus": 1,
> "memory": "4096"
> },
> "placement_policy": {
> "constraints": [
> {
> "type": "ANTI_AFFINITY",
> "scope": "NODE",
> "target_tags": [
> "sleeper"
> ],
> "node_partitions": [
> "label"
> ]
> }
> ]
> }
> }
> ]
> }
> {code} 
> The problem seems to be because only the default partition "" is considered 
> when node_partition constraint is not specified as seen in below RM log. 
> {code:java}
> 2019-01-17 16:51:59,921 INFO placement.SingleConstraintAppPlacementAllocator 
> (SingleConstraintAppPlacementAllocator.java:validateAndSetSchedulingRequest(367))
>  - Successfully added SchedulingRequest to 
> app=appattempt_1547734161165_0010_01 targetAllocationTags=[sleeper]. 
> nodePartition= 
> {code} 
> However, I think it makes more sense to consider "*" or the 
> {{default-node-label-expression}} of the queue if configured, when no 
> node_partition is specified in the placement constraint. Since not specifying 
> any node_partition should ideally mean we don't enforce placement constraints 
> on any node_partition. However we are enforcing the default partition instead 
> now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9621) Test failure TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1

2019-06-13 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863059#comment-16863059
 ] 

Peter Bacsko commented on YARN-9621:


[~Prabhu Joseph] do you know how to fix this? Branch 3.1 is still active, so 
would be good to handle it.

> Test failure 
> TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on 
> branch-3.1
> 
>
> Key: YARN-9621
> URL: https://issues.apache.org/jira/browse/YARN-9621
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: Peter Bacsko
>Priority: Major
>
> Testcase 
> {{TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint}} 
> seems to constantly fail on branch 3.1. I believe it was introduced by 
> YARN-9253.
> {noformat}
> testDistributedShellWithPlacementConstraint(org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager)
>   Time elapsed: 24.636 s  <<< FAILURE!
> java.lang.AssertionError: expected:<1> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint(TestDSWithMultipleNodeManager.java:178)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9621) Test failure TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1

2019-06-13 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9621:
---
Component/s: test
 distributed-shell

> Test failure 
> TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on 
> branch-3.1
> 
>
> Key: YARN-9621
> URL: https://issues.apache.org/jira/browse/YARN-9621
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-shell, test
>Affects Versions: 3.1.2
>Reporter: Peter Bacsko
>Priority: Major
>
> Testcase 
> {{TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint}} 
> seems to constantly fail on branch 3.1. I believe it was introduced by 
> YARN-9253.
> {noformat}
> testDistributedShellWithPlacementConstraint(org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager)
>   Time elapsed: 24.636 s  <<< FAILURE!
> java.lang.AssertionError: expected:<1> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint(TestDSWithMultipleNodeManager.java:178)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9621) Test failure TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1

2019-06-13 Thread Peter Bacsko (JIRA)
Peter Bacsko created YARN-9621:
--

 Summary: Test failure 
TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on 
branch-3.1
 Key: YARN-9621
 URL: https://issues.apache.org/jira/browse/YARN-9621
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.1.2
Reporter: Peter Bacsko


Testcase 
{{TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint}} 
seems to constantly fail on branch 3.1. I believe it was introduced by 
YARN-9253.

{noformat}
testDistributedShellWithPlacementConstraint(org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager)
  Time elapsed: 24.636 s  <<< FAILURE!
java.lang.AssertionError: expected:<1> but was:<2>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint(TestDSWithMultipleNodeManager.java:178)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4042) YARN registry should handle the absence of ZK node

2019-06-13 Thread wangxiangchun (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863058#comment-16863058
 ] 

wangxiangchun commented on YARN-4042:
-

hi, can I ask that how you solve the problem? I encountered the same problem ,I 
follow the answer to delete the version-2 file in zkdata, but I didn't sovle 
the problem? Could I ask your experience?

> YARN registry should handle the absence of ZK node
> --
>
> Key: YARN-4042
> URL: https://issues.apache.org/jira/browse/YARN-4042
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> {noformat}
> 2015-08-10 11:33:46,931 WARN [LlapSchedulerNodeEnabler] 
> rm.LlapTaskSchedulerService: Could not refresh list of active instances
> org.apache.hadoop.fs.PathNotFoundException: 
> `/registry/users/huzheng/services/org-apache-hive/llap0/components/workers/worker-25':
>  No such file or directory: KeeperErrorCode = NoNode for 
> /registry/users/huzheng/services/org-apache-hive/llap0/components/workers/worker-25
>   at 
> org.apache.hadoop.registry.client.impl.zk.CuratorService.operationFailure(CuratorService.java:377)
>   at 
> org.apache.hadoop.registry.client.impl.zk.CuratorService.operationFailure(CuratorService.java:360)
>   at 
> org.apache.hadoop.registry.client.impl.zk.CuratorService.zkRead(CuratorService.java:720)
>   at 
> org.apache.hadoop.registry.client.impl.zk.RegistryOperationsService.resolve(RegistryOperationsService.java:120)
>   at 
> org.apache.hadoop.registry.client.binding.RegistryUtils.extractServiceRecords(RegistryUtils.java:321)
>   at 
> org.apache.hadoop.registry.client.binding.RegistryUtils.listServiceRecords(RegistryUtils.java:177)
>   at 
> org.apache.hadoop.hive.llap.daemon.registry.impl.LlapYarnRegistryImpl$DynamicServiceInstanceSet.refresh(LlapYarnRegistryImpl.java:278)
>   at 
> org.apache.tez.dag.app.rm.LlapTaskSchedulerService.refreshInstances(LlapTaskSchedulerService.java:584)
>   at 
> org.apache.tez.dag.app.rm.LlapTaskSchedulerService.access$900(LlapTaskSchedulerService.java:79)
>   at 
> org.apache.tez.dag.app.rm.LlapTaskSchedulerService$NodeEnablerCallable.call(LlapTaskSchedulerService.java:887)
>   at 
> org.apache.tez.dag.app.rm.LlapTaskSchedulerService$NodeEnablerCallable.call(LlapTaskSchedulerService.java:855)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for 
> /registry/users/huzheng/services/org-apache-hive/llap0/components/workers/worker-25
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>   at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
>   at 
> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:302)
>   at 
> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:291)
>   at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
>   at 
> org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:288)
>   at 
> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:279)
>   at 
> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:41)
>   at 
> org.apache.hadoop.registry.client.impl.zk.CuratorService.zkRead(CuratorService.java:718)
>   ... 12 more
> {noformat}
> ZK nodes can disappear after listing, for example ephemeral node can be 
> cleaned up. YARN registry should handle that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9607) Auto-configuring rollover-size of IFile format for non-appendable filesystems

2019-06-13 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862988#comment-16862988
 ] 

Szilard Nemeth commented on YARN-9607:
--

Oh, one more thing: 
As you are overriding the config in 
LogAggregationIndexedFileController.getRollOverLogMaxSize, at least you need to 
log a statement that you overridden the value so users wouldn't be confused by 
any chance why the value ended up to be zero.
Do you agree [~adam.antal]? 


> Auto-configuring rollover-size of IFile format for non-appendable filesystems
> -
>
> Key: YARN-9607
> URL: https://issues.apache.org/jira/browse/YARN-9607
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Attachments: YARN-9607.001.patch
>
>
> In YARN-9525, we made IFile format compatible with remote folders with s3a 
> scheme. In rolling fashioned log-aggregation IFile still fails with the 
> "append is not supported" error message, which is a known limitation of the 
> format by design. 
> There is a workaround though: setting the rollover size in the configuration 
> of the IFile format, in each rolling cycle a new aggregated log file will be 
> created, thus we eliminated the append from the process. Setting this config 
> globally would cause performance problems in the regular log-aggregation, so 
> I'm suggesting to enforcing this config to zero, if the scheme of the URI is 
> s3a (or any other non-appendable filesystem).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9607) Auto-configuring rollover-size of IFile format for non-appendable filesystems

2019-06-13 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862984#comment-16862984
 ] 

Szilard Nemeth commented on YARN-9607:
--

Hi [~adam.antal]!
2 comments: 

1. In the test code, you are testing IFile format but the error messages are 
wrong, like: "TFile controller... " Those should start with IFile, right?
2. With the code where you are checking for the non-appendable schemes: I would 
put the non-appendable scheme strings into a Set and simply check if the string 
is in the set or not, so the if condition could be more straightforward.

> Auto-configuring rollover-size of IFile format for non-appendable filesystems
> -
>
> Key: YARN-9607
> URL: https://issues.apache.org/jira/browse/YARN-9607
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Attachments: YARN-9607.001.patch
>
>
> In YARN-9525, we made IFile format compatible with remote folders with s3a 
> scheme. In rolling fashioned log-aggregation IFile still fails with the 
> "append is not supported" error message, which is a known limitation of the 
> format by design. 
> There is a workaround though: setting the rollover size in the configuration 
> of the IFile format, in each rolling cycle a new aggregated log file will be 
> created, thus we eliminated the append from the process. Setting this config 
> globally would cause performance problems in the regular log-aggregation, so 
> I'm suggesting to enforcing this config to zero, if the scheme of the URI is 
> s3a (or any other non-appendable filesystem).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-06-13 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8995:
---
Attachment: TestStreamPerf.java

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-06-13 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8995:
---
Attachment: TestStreamPerf.java

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-8995.001.patch, YARN-8995.002.patch, 
> YARN-8995.003.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-06-13 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8995:
---
Attachment: (was: TestStreamPerf.java)

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-8995.001.patch, YARN-8995.002.patch, 
> YARN-8995.003.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-06-13 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862866#comment-16862866
 ] 

Tao Yang edited comment on YARN-8995 at 6/13/19 9:29 AM:
-

I did a simple test (details in TestStreamPerf.java) on performance comparison 
between sequential stream and parallel stream in a similar scenario: count a 
blocking queue with 100 distinct keys and 1w/10w/100w/200w total length, it 
seems that parallel stream indeed lead to more overhead than sequential stream, 
results of this test are as follows (suffix "_S" refers to sequential stream 
and suffix "_PS" refers to parallel stream):
{noformat}
TestStreamPerf.test_100_100w_PS: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
 round: 0.03 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 1, GC.time: 0.01, time.total: 0.64, time.warmup: 0.31, time.bench: 
0.32
TestStreamPerf.test_100_100w_S: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
 round: 0.02 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.37, time.warmup: 0.15, time.bench: 
0.22
TestStreamPerf.test_100_10w_PS: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
 round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.08, time.warmup: 0.05, time.bench: 
0.04
TestStreamPerf.test_100_10w_S: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
 round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.04, time.warmup: 0.01, time.bench: 
0.03
TestStreamPerf.test_100_1w_PS: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
 round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.01, time.warmup: 0.00, time.bench: 
0.01
TestStreamPerf.test_100_1w_S: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
 round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.01, time.warmup: 0.00, time.bench: 
0.00
TestStreamPerf.test_100_200w_PS: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
 round: 0.07 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 1.03, time.warmup: 0.37, time.bench: 
0.66
TestStreamPerf.test_100_200w_S: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
 round: 0.04 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.70, time.warmup: 0.25, time.bench: 
0.45
{noformat}


was (Author: tao yang):
I did a simple test on performance comparison between sequential stream and 
parallel stream in a similar scenario: count a blocking queue with 100 distinct 
keys and 1w/10w/100w/200w total length, it seems that parallel stream indeed 
lead to more overhead than sequential stream, results of this test are as 
follows (suffix "_S" refers to sequential stream and suffix "_PS" refers to 
parallel stream):
{noformat}
TestStreamPerf.test_100_1w_S: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 
0.00
TestStreamPerf.test_100_1w_PS: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.01, time.warmup: 0.00, time.bench: 
0.01
TestStreamPerf.test_100_10w_S: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.04, time.warmup: 0.01, time.bench: 
0.03
TestStreamPerf.test_100_10w_PS: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.14, time.warmup: 0.09, time.bench: 
0.05
TestStreamPerf.test_100_100w_S: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.03 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.43, time.warmup: 0.17, time.bench: 
0.26
TestStreamPerf.test_100_100w_PS: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.04 [+- 0.01], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.56, time.warmup: 0.20, time.bench: 
0.36
TestStreamPerf.test_100_200w_S: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.05 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.75, time.warmup: 0.25, time.bench: 
0.50
TestStreamPerf.test_100_200w_PS: [measured 10 out of 

[jira] [Commented] (YARN-9608) DecommissioningNodesWatcher should get lists of running applications on node from RMNode.

2019-06-13 Thread Abhishek Modi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862867#comment-16862867
 ] 

Abhishek Modi commented on YARN-9608:
-

Thanks [~tangzhankun] for reviewing it.

> DecommissioningNodesWatcher should get lists of running applications on node 
> from RMNode.
> -
>
> Key: YARN-9608
> URL: https://issues.apache.org/jira/browse/YARN-9608
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9608.001.patch, YARN-9608.002.patch
>
>
> At present, DecommissioningNodesWatcher tracks list of running applications 
> and triggers decommission of nodes when all the applications that ran on the 
> node completes. This Jira proposes to solve following problem:
>  # DecommissioningNodesWatcher skips tracking application containers on a 
> particular node before the node is in DECOMMISSIONING state. It only tracks 
> containers once the node is in DECOMMISSIONING state. This can lead to 
> shuffle data loss of apps whose containers ran on this node before it was 
> moved to decommissioning state.
>  # It is keeping track of running apps. We can leverage this directly from 
> RMNode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-06-13 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862866#comment-16862866
 ] 

Tao Yang commented on YARN-8995:


I did a simple test on performance comparison between sequential stream and 
parallel stream in a similar scenario: count a blocking queue with 100 distinct 
keys and 1w/10w/100w/200w total length, it seems that parallel stream indeed 
lead to more overhead than sequential stream, results of this test are as 
follows (suffix "_S" refers to sequential stream and suffix "_PS" refers to 
parallel stream):
{noformat}
TestStreamPerf.test_100_1w_S: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 
0.00
TestStreamPerf.test_100_1w_PS: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.01, time.warmup: 0.00, time.bench: 
0.01
TestStreamPerf.test_100_10w_S: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.04, time.warmup: 0.01, time.bench: 
0.03
TestStreamPerf.test_100_10w_PS: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.14, time.warmup: 0.09, time.bench: 
0.05
TestStreamPerf.test_100_100w_S: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.03 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.43, time.warmup: 0.17, time.bench: 
0.26
TestStreamPerf.test_100_100w_PS: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.04 [+- 0.01], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.56, time.warmup: 0.20, time.bench: 
0.36
TestStreamPerf.test_100_200w_S: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.05 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 0.75, time.warmup: 0.25, time.bench: 
0.50
TestStreamPerf.test_100_200w_PS: [measured 10 out of 15 rounds, threads: 1 
(sequential)]
round: 0.07 [+- 0.01], round.block: 0.01 [+- 0.00], round.gc: 0.00 [+- 0.00], 
GC.calls: 0, GC.time: 0.00, time.total: 1.06, time.warmup: 0.35, time.bench: 
0.71
{noformat}

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-8995.001.patch, YARN-8995.002.patch, 
> YARN-8995.003.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9608) DecommissioningNodesWatcher should get lists of running applications on node from RMNode.

2019-06-13 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862826#comment-16862826
 ] 

Zhankun Tang commented on YARN-9608:


[~abmodi], Yeah. Thanks for the explanation! +1 from me. I can help to commit 
this if no one opposes.

> DecommissioningNodesWatcher should get lists of running applications on node 
> from RMNode.
> -
>
> Key: YARN-9608
> URL: https://issues.apache.org/jira/browse/YARN-9608
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9608.001.patch, YARN-9608.002.patch
>
>
> At present, DecommissioningNodesWatcher tracks list of running applications 
> and triggers decommission of nodes when all the applications that ran on the 
> node completes. This Jira proposes to solve following problem:
>  # DecommissioningNodesWatcher skips tracking application containers on a 
> particular node before the node is in DECOMMISSIONING state. It only tracks 
> containers once the node is in DECOMMISSIONING state. This can lead to 
> shuffle data loss of apps whose containers ran on this node before it was 
> moved to decommissioning state.
>  # It is keeping track of running apps. We can leverage this directly from 
> RMNode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-06-13 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862821#comment-16862821
 ] 

Tao Yang edited comment on YARN-8995 at 6/13/19 8:27 AM:
-

Thanks [~zhuqi] for updating the patch.
Comments for the new patch:
* Sorry to have made a mistake in my last comment, serviceInit is a more proper 
place to initialize conf, then you can remove the initial value for 
detailsInterval field.
* There's no need to separate name with double "\_" for "...EVENTS__INFO...", 
"...EVENTS_INFO..." is ok. The annotation "The interval thousands of queue 
size" can be replaced as "The interval of queue size (in thousands)".
* For parallelStream, overhead is involved in splitting the work among several 
threads and joining or merging the results, I prefer using sequential stream in 
this scenario which has no I/O operations and only need to count for event 
types. Moreover, we can use groupingBy API like this: 
{{eventQueue.stream().collect(Collectors.groupingBy(e -> e.getType(), 
Collectors.counting()))}}, instead of calling Collectors#toConcurrentMap or 
Collectors#toMap.


was (Author: tao yang):
Thanks [~zhuqi] for updating the patch.
Comments for the new patch:
* Sorry to have made a mistake in my last comment, serviceInit is a more proper 
place to initialize conf, then you can remove the initial value for 
detailsInterval field.
* There's no need to separate name with double "_" for "...EVENTS__INFO...", 
"...EVENTS_INFO..." is ok. The annotation "The interval thousands of ..." can 
be replaced as "The interval of ... (in thousands)".
* For parallelStream, overhead is involved in splitting the work among several 
threads and joining or merging the results, I prefer using sequential stream in 
this scenario which has no I/O operations and only need to count for event 
types. Moreover, we can use groupingBy API like this: 
{{eventQueue.stream().collect(Collectors.groupingBy(e -> e.getType(), 
Collectors.counting()))}}, instead of calling Collectors#toConcurrentMap or 
Collectors#toMap.

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-8995.001.patch, YARN-8995.002.patch, 
> YARN-8995.003.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-06-13 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862821#comment-16862821
 ] 

Tao Yang commented on YARN-8995:


Thanks [~zhuqi] for updating the patch.
Comments for the new patch:
* Sorry to have made a mistake in my last comment, serviceInit is a more proper 
place to initialize conf, then you can remove the initial value for 
detailsInterval field.
* There's no need to separate name with double "_" for "...EVENTS__INFO...", 
"...EVENTS_INFO..." is ok. The annotation "The interval thousands of ..." can 
be replaced as "The interval of ... (in thousands)".
* For parallelStream, overhead is involved in splitting the work among several 
threads and joining or merging the results, I prefer using sequential stream in 
this scenario which has no I/O operations and only need to count for event 
types. Moreover, we can use groupingBy API like this: 
{{eventQueue.stream().collect(Collectors.groupingBy(e -> e.getType(), 
Collectors.counting()))}}, instead of calling Collectors#toConcurrentMap or 
Collectors#toMap.

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-8995.001.patch, YARN-8995.002.patch, 
> YARN-8995.003.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9608) DecommissioningNodesWatcher should get lists of running applications on node from RMNode.

2019-06-13 Thread Abhishek Modi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862793#comment-16862793
 ] 

Abhishek Modi commented on YARN-9608:
-

Thanks [~tangzhankun] for going through patch:
{quote} # If there's a long-running Spark shell application A of YARN cluster 
mode, only can the timeout cause the decommissioning node 1 (app A's container 
ran on it previously, but A's AM running on node 2) to shut down, right?{quote}
Yes, in this case only timeout or application finish can cause the 
decommissioning to complete. This behavior would be similar to the behavior in 
case this node was put in decommissioning state when container for app A was 
running on the node.
{quote} And if node 1 is shut down due to timeout, and when node 1 is 
re-registered in the future, will the node 1 still be considered belongs to 
running application A?
{quote}
  No, if node was shut down when no container was running on the node it won't 
be considered belonging to app A. But in case, work preserving node manager was 
enabled and  a container was recovered on that node for app A, it will be 
considered to be running app A.

> DecommissioningNodesWatcher should get lists of running applications on node 
> from RMNode.
> -
>
> Key: YARN-9608
> URL: https://issues.apache.org/jira/browse/YARN-9608
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9608.001.patch, YARN-9608.002.patch
>
>
> At present, DecommissioningNodesWatcher tracks list of running applications 
> and triggers decommission of nodes when all the applications that ran on the 
> node completes. This Jira proposes to solve following problem:
>  # DecommissioningNodesWatcher skips tracking application containers on a 
> particular node before the node is in DECOMMISSIONING state. It only tracks 
> containers once the node is in DECOMMISSIONING state. This can lead to 
> shuffle data loss of apps whose containers ran on this node before it was 
> moved to decommissioning state.
>  # It is keeping track of running apps. We can leverage this directly from 
> RMNode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9608) DecommissioningNodesWatcher should get lists of running applications on node from RMNode.

2019-06-13 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862789#comment-16862789
 ] 

Zhankun Tang commented on YARN-9608:


[~abmodi], Thanks. Just read through the whole patch. Two questions:

1. If there's a long-running Spark shell application A of YARN cluster mode, 
only can the timeout cause the decommissioning node 1 (app A's container ran on 
it previously, but A's AM running on node 2) to shut down, right?

2. And if node 1 is shut down due to timeout, and when node 1 is re-registered 
in the future, will the node 1 still be considered belongs to running 
application A?

> DecommissioningNodesWatcher should get lists of running applications on node 
> from RMNode.
> -
>
> Key: YARN-9608
> URL: https://issues.apache.org/jira/browse/YARN-9608
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9608.001.patch, YARN-9608.002.patch
>
>
> At present, DecommissioningNodesWatcher tracks list of running applications 
> and triggers decommission of nodes when all the applications that ran on the 
> node completes. This Jira proposes to solve following problem:
>  # DecommissioningNodesWatcher skips tracking application containers on a 
> particular node before the node is in DECOMMISSIONING state. It only tracks 
> containers once the node is in DECOMMISSIONING state. This can lead to 
> shuffle data loss of apps whose containers ran on this node before it was 
> moved to decommissioning state.
>  # It is keeping track of running apps. We can leverage this directly from 
> RMNode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org