[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863712#comment-16863712 ] zhuqi commented on YARN-8995: - cc [~Tao Yang] Thanks [~Tao Yang] for your comment and persuasive test result. Now i have changed my code in new patch, but there is no serviceInit method, i init my conf in the construct method. > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-8995: Attachment: YARN-8995.004.patch > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9623) Auto adjust max queue length of app activities to make sure activities on all nodes can be covered
[ https://issues.apache.org/jira/browse/YARN-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9623: --- Summary: Auto adjust max queue length of app activities to make sure activities on all nodes can be covered (was: Auto adjust queue length of app activities to make sure activities on all nodes can be covered) > Auto adjust max queue length of app activities to make sure activities on all > nodes can be covered > -- > > Key: YARN-9623 > URL: https://issues.apache.org/jira/browse/YARN-9623 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > > Currently we can use configuration entry > "yarn.resourcemanager.activities-manager.app-activities.max-queue-length" to > control max queue length of app activities, but in some scenarios , this > configuration may need to be updated in a growing cluster. Moreover, it's > better for users to ignore that conf therefor it should be auto adjusted > internally. > There are some differences among different scheduling modes: > * multi-node placement disabled > ** Heartbeat driven scheduling: max queue length of app activities should > not less than the number of nodes, considering nodes can not be always in > order, we should make some room for misorder, for example, we can guarantee > that max queue length should not be less than 1.2 * numNodes > ** Async scheduling: every async scheduling thread goes through all nodes in > order, in this mode, we should guarantee that max queue length should be > numThreads * numNodes. > * multi-node placement enabled: activities on all nodes can be involved in a > single app allocation, therefor there's no need to adjust for this mode. > To sum up, we can adjust the max queue length of app activities like this: > {code} > int configuredMaxQueueLength; > int maxQueueLength; > serviceInit(){ > ... > configuredMaxQueueLength = ...; //read configured max queue length > maxQueueLength = configuredMaxQueueLength; //take configured value as > default > } > CleanupThread#run(){ > ... > if (multiNodeDisabled) { > if (asyncSchedulingEnabled) { >maxQueueLength = max(configuredMaxQueueLength, numSchedulingThreads * > numNodes); > } else { >maxQueueLength = max(configuredMaxQueueLength, 1.2 * numNodes); > } > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9623) Auto adjust queue length of app activities to make sure activities on all nodes can be covered
Tao Yang created YARN-9623: -- Summary: Auto adjust queue length of app activities to make sure activities on all nodes can be covered Key: YARN-9623 URL: https://issues.apache.org/jira/browse/YARN-9623 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tao Yang Assignee: Tao Yang Currently we can use configuration entry "yarn.resourcemanager.activities-manager.app-activities.max-queue-length" to control max queue length of app activities, but in some scenarios , this configuration may need to be updated in a growing cluster. Moreover, it's better for users to ignore that conf therefor it should be auto adjusted internally. There are some differences among different scheduling modes: * multi-node placement disabled ** Heartbeat driven scheduling: max queue length of app activities should not less than the number of nodes, considering nodes can not be always in order, we should make some room for misorder, for example, we can guarantee that max queue length should not be less than 1.2 * numNodes ** Async scheduling: every async scheduling thread goes through all nodes in order, in this mode, we should guarantee that max queue length should be numThreads * numNodes. * multi-node placement enabled: activities on all nodes can be involved in a single app allocation, therefor there's no need to adjust for this mode. To sum up, we can adjust the max queue length of app activities like this: {code} int configuredMaxQueueLength; int maxQueueLength; serviceInit(){ ... configuredMaxQueueLength = ...; //read configured max queue length maxQueueLength = configuredMaxQueueLength; //take configured value as default } CleanupThread#run(){ ... if (multiNodeDisabled) { if (asyncSchedulingEnabled) { maxQueueLength = max(configuredMaxQueueLength, numSchedulingThreads * numNodes); } else { maxQueueLength = max(configuredMaxQueueLength, 1.2 * numNodes); } } } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863691#comment-16863691 ] Tao Yang commented on YARN-9567: Thanks [~cheersyang]. Attached v1 patch for review. > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9567.001.patch, image-2019-06-04-17-29-29-368.png, > image-2019-06-04-17-31-31-820.png, image-2019-06-04-17-58-11-886.png, > image-2019-06-14-11-21-41-066.png, no_diagnostic_at_first.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9567: --- Attachment: YARN-9567.001.patch > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9567.001.patch, image-2019-06-04-17-29-29-368.png, > image-2019-06-04-17-31-31-820.png, image-2019-06-04-17-58-11-886.png, > image-2019-06-14-11-21-41-066.png, no_diagnostic_at_first.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855530#comment-16855530 ] Tao Yang edited comment on YARN-9567 at 6/14/19 3:24 AM: - Some updates about this issue: # Support summarizing app activities on nodes in multiple scheduling processes to get the comprehensive information for better debugging, based on YARN-9578. # Support partial refresh on app attempt page, so that we have two ways to get diagnostics: ** When refresh the app attempt page, query and show activities directly from cache. ** When click the refresh button, update activities immediately and get activities and show them after about 2 seconds. # Diagnostics information can be classified to 3 levels (request, app and scheduler activities). ** Request level !image-2019-06-04-17-29-29-368.png|width=1287,height=90! ** App level !image-2019-06-04-17-31-31-820.png|width=648,height=63! ** Scheduler activities level (If can't found app diagnostics, will show all nodes in scheduling process from scheduler activities for debugging) !image-2019-06-14-11-21-41-066.png|width=891,height=159! Please feel free to give your suggestions! I will attach the patch after its dependency issue YARN-9578 resolved. was (Author: tao yang): Some updates about this issue: # Support summarizing app activities on nodes in multiple scheduling processes to get the comprehensive information for better debugging, based on YARN-9578. # Support partial refresh on app attempt page, so that we have two ways to get diagnostics: ** When refresh the app attempt page, query and show activities directly from cache. ** When click the refresh button, update activities immediately and get activities and show them after about 2 seconds. # Diagnostics information can be classified to 3 levels (request, app and scheduler activities). ** Request level !image-2019-06-04-17-29-29-368.png|width=1287,height=90! ** App level !image-2019-06-04-17-31-31-820.png|width=648,height=63! ** Scheduler activities level !image-2019-06-04-17-58-11-886.png|width=731,height=121! Please feel free to give your suggestions! I will attach the patch after its dependency issue YARN-9578 resolved. > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: image-2019-06-04-17-29-29-368.png, > image-2019-06-04-17-31-31-820.png, image-2019-06-04-17-58-11-886.png, > image-2019-06-14-11-21-41-066.png, no_diagnostic_at_first.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9567: --- Attachment: (was: image-2019-06-14-11-14-31-874.png) > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: image-2019-06-04-17-29-29-368.png, > image-2019-06-04-17-31-31-820.png, image-2019-06-04-17-58-11-886.png, > image-2019-06-14-11-21-41-066.png, no_diagnostic_at_first.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9619) Transfer error AM host/ip when launching app using docker container with bridge network
[ https://issues.apache.org/jira/browse/YARN-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863616#comment-16863616 ] caozhiqiang commented on YARN-9619: --- Thanks for your comments, [~eyang]. Launching application with docker cantainer allow several kinds networks. In document it has declared that it both support allowed host network and bridge network.[launch with docker|[https://hadoop.apache.org/docs/r3.1.2/hadoop-yarn/hadoop-yarn-site/DockerContainers.html]] {code:java} // yarn.nodemanager.runtime.linux.docker.allowed-container-networks host,none,bridge Optional. A comma-separated set of networks allowed when launching containers. Valid values are determined by Docker networks available from `docker network ls` {code} With host network, AM running in docker can work well because the AM's IP is the same with NM's. With bridge network, I think if AM register correct host/IP(the real docker container IP, not nodemanager IP) to RM, and all hadoop components running in overlay network, such as deploying flannel, it should also work well. In overlay network, docker can bi-directional communication with any other docker or other nodes. So RM and NMs can also bi-directional communication with AM running in docker. I have verified these. > Transfer error AM host/ip when launching app using docker container with > bridge network > --- > > Key: YARN-9619 > URL: https://issues.apache.org/jira/browse/YARN-9619 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.3.0 >Reporter: caozhiqiang >Priority: Major > > When launching application using docker container with bridge network in > overlay networks, client will polling the rate of application process from > ApplicationMaster with error host/IP. client also polling from the > nodemanager's hostname/IP, but not from the docker's IP which AM real running > in. The error message is below(the server hadoop3-1/192.168.2.105 is NM's, > not AM's docker IP, so it can't be accessed): > 2019-05-11 08:28:46,361 INFO ipc.Client: Retrying connect to server: > hadoop3-1/192.168.2.105:37963. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS) > 2019-05-11 08:28:47,363 INFO ipc.Client: Retrying connect to server: > hadoop3-1/192.168.2.105:37963. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS) > 2019-05-11 08:28:48,365 INFO ipc.Client: Retrying connect to server: > hadoop3-1/192.168.2.105:37963. Already tried 2 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS) > 2019-05-10 08:34:40,235 INFO mapred.ClientServiceDelegate: Application state > is completed. FinalApplicationStatus=FAILED. Redirecting to job history server > 2019-05-10 08:35:00,408 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:12020. Already tried 8 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2019-05-10 08:35:00,408 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:12020. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > java.io.IOException: java.net.ConnectException: Your endpoint configuration > is wrong; For more details see: > http://wiki.apache.org/hadoop/UnsetHostnameOrPort > at > org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:345) > at > org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:430) > at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:871) > at org.apache.hadoop.mapreduce.Job$1.run(Job.java:331) > at org.apache.hadoop.mapreduce.Job$1.run(Job.java:328) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:328) > at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:612) > at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1629) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591) > at > org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:307) > at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:360) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:368) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[jira] [Commented] (YARN-9619) Transfer error AM host/ip when launching app using docker container with bridge network
[ https://issues.apache.org/jira/browse/YARN-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863509#comment-16863509 ] Eric Yang commented on YARN-9619: - [~caozhiqiang] Sorry, I am not entirely sure that I understand the description of this problem. This seems to indicate mapreduce workload doesn't work with bridge network in overlay network. YARN framework requires application master to run in the same flat network as resource manager and node manager. This ensures bi-directional communication between application master and YARN framework are not blocked. Overlay network implies some level of privacy from host network level. Overlay network often allows only outbound network access. By running application master in overlay network, resource manager and node manager can not have bi-directional communication with application master. I don't think it is possible to run AM in docker in current implementation of YARN. > Transfer error AM host/ip when launching app using docker container with > bridge network > --- > > Key: YARN-9619 > URL: https://issues.apache.org/jira/browse/YARN-9619 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.3.0 >Reporter: caozhiqiang >Priority: Major > > When launching application using docker container with bridge network in > overlay networks, client will polling the rate of application process from > ApplicationMaster with error host/IP. client also polling from the > nodemanager's hostname/IP, but not from the docker's IP which AM real running > in. The error message is below(the server hadoop3-1/192.168.2.105 is NM's, > not AM's docker IP, so it can't be accessed): > 2019-05-11 08:28:46,361 INFO ipc.Client: Retrying connect to server: > hadoop3-1/192.168.2.105:37963. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS) > 2019-05-11 08:28:47,363 INFO ipc.Client: Retrying connect to server: > hadoop3-1/192.168.2.105:37963. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS) > 2019-05-11 08:28:48,365 INFO ipc.Client: Retrying connect to server: > hadoop3-1/192.168.2.105:37963. Already tried 2 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS) > 2019-05-10 08:34:40,235 INFO mapred.ClientServiceDelegate: Application state > is completed. FinalApplicationStatus=FAILED. Redirecting to job history server > 2019-05-10 08:35:00,408 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:12020. Already tried 8 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2019-05-10 08:35:00,408 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:12020. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > java.io.IOException: java.net.ConnectException: Your endpoint configuration > is wrong; For more details see: > http://wiki.apache.org/hadoop/UnsetHostnameOrPort > at > org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:345) > at > org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:430) > at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:871) > at org.apache.hadoop.mapreduce.Job$1.run(Job.java:331) > at org.apache.hadoop.mapreduce.Job$1.run(Job.java:328) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:328) > at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:612) > at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1629) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591) > at > org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:307) > at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:360) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:368) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71) > at
[jira] [Commented] (YARN-9621) FIX TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1
[ https://issues.apache.org/jira/browse/YARN-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863472#comment-16863472 ] Hadoop QA commented on YARN-9621: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 25m 17s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} branch-3.1 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 28m 11s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 21s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 17s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 48s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} branch-3.1 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 23s{color} | {color:red} hadoop-yarn-applications-distributedshell in the patch failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 24s{color} | {color:red} hadoop-yarn-applications-distributedshell in the patch failed. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 24s{color} | {color:red} hadoop-yarn-applications-distributedshell in the patch failed. {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 13s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 24s{color} | {color:red} hadoop-yarn-applications-distributedshell in the patch failed. {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 35s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 24s{color} | {color:red} hadoop-yarn-applications-distributedshell in the patch failed. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 25s{color} | {color:red} hadoop-yarn-applications-distributedshell in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 89m 31s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=18.09.5 Server=18.09.5 Image:yetus/hadoop:080e9d0f9b3 | | JIRA Issue | YARN-9621 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12971721/YARN-9621-branch-3.1.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 5f9d4c9da06f 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:28:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | branch-3.1 / fee1e67 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_212 | | findbugs | v3.1.0-RC1 | | mvninstall | https://builds.apache.org/job/PreCommit-YARN-Build/24266/artifact/out/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-applications-distributedshell.txt | | compile |
[jira] [Commented] (YARN-8856) TestTimelineReaderWebServicesHBaseStorage tests failing with NoClassDefFoundError
[ https://issues.apache.org/jira/browse/YARN-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863471#comment-16863471 ] Íñigo Goiri commented on YARN-8856: --- [~Prabhu Joseph], backported to branch-3.2. > TestTimelineReaderWebServicesHBaseStorage tests failing with > NoClassDefFoundError > - > > Key: YARN-8856 > URL: https://issues.apache.org/jira/browse/YARN-8856 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jason Lowe >Assignee: Sushil Ks >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-8856.001.patch > > > TestTimelineReaderWebServicesHBaseStorage has been failing in nightly builds > with NoClassDefFoundError in the tests. Sample error and stacktrace to > follow. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9621) FIX TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1
[ https://issues.apache.org/jira/browse/YARN-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863409#comment-16863409 ] Prabhu Joseph commented on YARN-9621: - {{TestDSWithMultipleNodeManager}} does not have tearDown method. This causes all testcases to use same {{MiniYarnCluster}} and having conflicts while calculating the actual containers launched on a node by {{NMContainerMonitor}}. This issue is present only in branch-3.1 as YARN-9252 has added the tearDown to branch-3.2 and 3.3. > FIX TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint > on branch-3.1 > --- > > Key: YARN-9621 > URL: https://issues.apache.org/jira/browse/YARN-9621 > Project: Hadoop YARN > Issue Type: Bug > Components: distributed-shell, test >Affects Versions: 3.1.2 >Reporter: Peter Bacsko >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-9621-branch-3.1.001.patch > > > Testcase > {{TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint}} > seems to constantly fail on branch 3.1. I believe it was introduced by > YARN-9253. > {noformat} > testDistributedShellWithPlacementConstraint(org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager) > Time elapsed: 24.636 s <<< FAILURE! > java.lang.AssertionError: expected:<1> but was:<2> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint(TestDSWithMultipleNodeManager.java:178) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9621) FIX TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1
[ https://issues.apache.org/jira/browse/YARN-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-9621: Attachment: YARN-9621-branch-3.1.001.patch > FIX TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint > on branch-3.1 > --- > > Key: YARN-9621 > URL: https://issues.apache.org/jira/browse/YARN-9621 > Project: Hadoop YARN > Issue Type: Bug > Components: distributed-shell, test >Affects Versions: 3.1.2 >Reporter: Peter Bacsko >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-9621-branch-3.1.001.patch > > > Testcase > {{TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint}} > seems to constantly fail on branch 3.1. I believe it was introduced by > YARN-9253. > {noformat} > testDistributedShellWithPlacementConstraint(org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager) > Time elapsed: 24.636 s <<< FAILURE! > java.lang.AssertionError: expected:<1> but was:<2> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint(TestDSWithMultipleNodeManager.java:178) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9621) FIX TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1
[ https://issues.apache.org/jira/browse/YARN-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-9621: Summary: FIX TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1 (was: Test failure TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1) > FIX TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint > on branch-3.1 > --- > > Key: YARN-9621 > URL: https://issues.apache.org/jira/browse/YARN-9621 > Project: Hadoop YARN > Issue Type: Bug > Components: distributed-shell, test >Affects Versions: 3.1.2 >Reporter: Peter Bacsko >Assignee: Prabhu Joseph >Priority: Major > > Testcase > {{TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint}} > seems to constantly fail on branch 3.1. I believe it was introduced by > YARN-9253. > {noformat} > testDistributedShellWithPlacementConstraint(org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager) > Time elapsed: 24.636 s <<< FAILURE! > java.lang.AssertionError: expected:<1> but was:<2> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint(TestDSWithMultipleNodeManager.java:178) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9599) TestContainerSchedulerQueuing#testQueueShedding fails intermittently.
[ https://issues.apache.org/jira/browse/YARN-9599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863345#comment-16863345 ] Giovanni Matteo Fumarola commented on YARN-9599: Committed to trunk. Thanks [~elgoiri] for the review and [~abmodi] for the patch. > TestContainerSchedulerQueuing#testQueueShedding fails intermittently. > - > > Key: YARN-9599 > URL: https://issues.apache.org/jira/browse/YARN-9599 > Project: Hadoop YARN > Issue Type: Task >Reporter: Abhishek Modi >Assignee: Abhishek Modi >Priority: Minor > Fix For: 3.3.0 > > Attachments: YARN-9599.001.patch, YARN-9599.002.patch, > YARN-9599.003.patch, YARN-9599.004.patch > > > TestQueueShedding fails intermittently. > java.lang.AssertionError: expected:<6> but was:<5> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.TestContainerSchedulerQueuing.testQueueShedding(TestContainerSchedulerQueuing.java:775) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9599) TestContainerSchedulerQueuing#testQueueShedding fails intermittently.
[ https://issues.apache.org/jira/browse/YARN-9599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giovanni Matteo Fumarola updated YARN-9599: --- Fix Version/s: 3.3.0 > TestContainerSchedulerQueuing#testQueueShedding fails intermittently. > - > > Key: YARN-9599 > URL: https://issues.apache.org/jira/browse/YARN-9599 > Project: Hadoop YARN > Issue Type: Task >Reporter: Abhishek Modi >Assignee: Abhishek Modi >Priority: Minor > Fix For: 3.3.0 > > Attachments: YARN-9599.001.patch, YARN-9599.002.patch, > YARN-9599.003.patch, YARN-9599.004.patch > > > TestQueueShedding fails intermittently. > java.lang.AssertionError: expected:<6> but was:<5> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.TestContainerSchedulerQueuing.testQueueShedding(TestContainerSchedulerQueuing.java:775) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9599) TestContainerSchedulerQueuing#testQueueShedding fails intermittently.
[ https://issues.apache.org/jira/browse/YARN-9599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863343#comment-16863343 ] Hudson commented on YARN-9599: -- FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #16739 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/16739/]) YARN-9599. TestContainerSchedulerQueuing#testQueueShedding fails (gifuma: rev bcfd22833633e24881891208503971c8ef59d63c) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/scheduler/TestContainerSchedulerQueuing.java > TestContainerSchedulerQueuing#testQueueShedding fails intermittently. > - > > Key: YARN-9599 > URL: https://issues.apache.org/jira/browse/YARN-9599 > Project: Hadoop YARN > Issue Type: Task >Reporter: Abhishek Modi >Assignee: Abhishek Modi >Priority: Minor > Attachments: YARN-9599.001.patch, YARN-9599.002.patch, > YARN-9599.003.patch, YARN-9599.004.patch > > > TestQueueShedding fails intermittently. > java.lang.AssertionError: expected:<6> but was:<5> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.TestContainerSchedulerQueuing.testQueueShedding(TestContainerSchedulerQueuing.java:775) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8499) ATS v2 Generic TimelineStorageMonitor
[ https://issues.apache.org/jira/browse/YARN-8499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863324#comment-16863324 ] Prabhu Joseph commented on YARN-8499: - Thanks [~snemeth] for the review. [~eyang] Can you review this Jira when you get time. This makes the timeline storage monitor generic. > ATS v2 Generic TimelineStorageMonitor > - > > Key: YARN-8499 > URL: https://issues.apache.org/jira/browse/YARN-8499 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Reporter: Sunil Govindan >Assignee: Prabhu Joseph >Priority: Major > Labels: atsv2 > Attachments: YARN-8499-001.patch, YARN-8499-002.patch, > YARN-8499-003.patch, YARN-8499-004.patch, YARN-8499-005.patch, > YARN-8499-006.patch, YARN-8499-007.patch, YARN-8499-008.patch, > YARN-8499-009.patch, YARN-8499-010.patch, YARN-8499-011.patch, > YARN-8499-012.patch > > > Post YARN-8302, Hbase connection issues are handled in ATSv2. However this > could be made general by introducing an api in storage interface and > implementing in each of the storage as per the store semantics. > > cc [~rohithsharma] [~vinodkv] [~vrushalic] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9525) IFile format is not working against s3a remote folder
[ https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko reassigned YARN-9525: -- Assignee: Adam Antal (was: Peter Bacsko) > IFile format is not working against s3a remote folder > - > > Key: YARN-9525 > URL: https://issues.apache.org/jira/browse/YARN-9525 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation >Affects Versions: 3.1.2 >Reporter: Adam Antal >Assignee: Adam Antal >Priority: Major > Attachments: IFile-S3A-POC01.patch, YARN-9525-001.patch, > YARN-9525.002.patch, YARN-9525.003.patch, YARN-9525.004.patch > > > Using the IndexedFileFormat {{yarn.nodemanager.remote-app-log-dir}} > configured to an s3a URI throws the following exception during log > aggregation: > {noformat} > Cannot create writer for app application_1556199768861_0001. Skip log upload > this time. > java.io.IOException: java.io.FileNotFoundException: No such file or > directory: > s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041 > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:247) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:306) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:464) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:420) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: No such file or directory: > s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041 > at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2488) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2382) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2321) > at > org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128) > at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1244) > at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1240) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1246) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController$1.run(LogAggregationIndexedFileController.java:228) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:195) > ... 7 more > {noformat} > This stack trace point to > {{LogAggregationIndexedFileController$initializeWriter}} where we do the > following steps (in a non-rolling log aggregation setup): > - create FSDataOutputStream > - writing out a UUID > - flushing > - immediately after that we call a GetFileStatus to get the length of the log > file (the bytes we just wrote out), and that's where the failures happens: > the file is not there yet due to eventual consistency. > Maybe we can get rid of that, so we can use IFile format against a s3a target. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8856) TestTimelineReaderWebServicesHBaseStorage tests failing with NoClassDefFoundError
[ https://issues.apache.org/jira/browse/YARN-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863287#comment-16863287 ] Prabhu Joseph commented on YARN-8856: - [~Sushil-K-S] [~elgoiri] The testcases are failing in branch-3.2 as well, can we backport this patch to branch-3.2. This patch works fine in branch-3.2. > TestTimelineReaderWebServicesHBaseStorage tests failing with > NoClassDefFoundError > - > > Key: YARN-8856 > URL: https://issues.apache.org/jira/browse/YARN-8856 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jason Lowe >Assignee: Sushil Ks >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-8856.001.patch > > > TestTimelineReaderWebServicesHBaseStorage has been failing in nightly builds > with NoClassDefFoundError in the tests. Sample error and stacktrace to > follow. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9622) All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2
[ https://issues.apache.org/jira/browse/YARN-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph resolved YARN-9622. - Resolution: Duplicate > All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2 > - > > Key: YARN-9622 > URL: https://issues.apache.org/jira/browse/YARN-9622 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver, timelineservice >Affects Versions: 3.2.0 >Reporter: Peter Bacsko >Assignee: Prabhu Joseph >Priority: Major > > When you try to run all tests from TestTimelineReaderWebServicesHBaseStorage, > the result is the following: > {noformat} > [ERROR] Failures: > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetAppNotPresent:->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 > Response from server should have been Not Found > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunNotPresent:2192->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 > Response from server should have been Not Found > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testUIDNotProperlyEscaped:905->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 > Response from server should have been Bad Request > [ERROR] Errors: > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testForFlowAppsPagination:2375->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testForFlowRunAppsPagination:2420->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testForFlowRunsPagination:2465->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGenericEntitiesForPagination:2272->verifyEntitiesForPagination:2288->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetApp:1024->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetAppWithoutFlowInfo:1064->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetAppsMetricsRange:2516->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesByUID:662->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesConfigFilters:1263->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesDataToRetrieve:1154->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesEventFilters:1640->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesInfoFilters:1380->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricFilters:1494->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricsTimeRange:1820->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesRelationFilters:1696->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesWithoutFlowInfo:1130->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntityDataToRetrieve:1905->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntityWithoutFlowInfo:1113->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowApps:2047->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsFilters:2153->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsNotPresent:2253->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowRun:443->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunApps:1984->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] >
[jira] [Commented] (YARN-9622) All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2
[ https://issues.apache.org/jira/browse/YARN-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863280#comment-16863280 ] Prabhu Joseph commented on YARN-9622: - [~pbacsko] This issues is fixed by YARN-8856 in trunk. It also needs to be backported to branch-3.2. > All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2 > - > > Key: YARN-9622 > URL: https://issues.apache.org/jira/browse/YARN-9622 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver, timelineservice >Affects Versions: 3.2.0 >Reporter: Peter Bacsko >Assignee: Prabhu Joseph >Priority: Major > > When you try to run all tests from TestTimelineReaderWebServicesHBaseStorage, > the result is the following: > {noformat} > [ERROR] Failures: > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetAppNotPresent:->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 > Response from server should have been Not Found > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunNotPresent:2192->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 > Response from server should have been Not Found > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testUIDNotProperlyEscaped:905->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 > Response from server should have been Bad Request > [ERROR] Errors: > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testForFlowAppsPagination:2375->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testForFlowRunAppsPagination:2420->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testForFlowRunsPagination:2465->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGenericEntitiesForPagination:2272->verifyEntitiesForPagination:2288->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetApp:1024->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetAppWithoutFlowInfo:1064->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetAppsMetricsRange:2516->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesByUID:662->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesConfigFilters:1263->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesDataToRetrieve:1154->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesEventFilters:1640->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesInfoFilters:1380->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricFilters:1494->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricsTimeRange:1820->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesRelationFilters:1696->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesWithoutFlowInfo:1130->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntityDataToRetrieve:1905->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntityWithoutFlowInfo:1113->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowApps:2047->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsFilters:2153->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsNotPresent:2253->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowRun:443->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] >
[jira] [Commented] (YARN-9525) IFile format is not working against s3a remote folder
[ https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863230#comment-16863230 ] Hadoop QA commented on YARN-9525: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 38s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 20s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 37s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 45s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 21s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common: The patch generated 2 new + 9 unchanged - 0 fixed = 11 total (was 9) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 14s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 2s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 37s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 59m 31s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=18.09.5 Server=18.09.5 Image:yetus/hadoop:bdbca0e53b4 | | JIRA Issue | YARN-9525 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12971702/YARN-9525.004.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux a3fa58851c51 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:28:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 940bcf0 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_212 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/24265/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/24265/testReport/ | | Max. process+thread count | 309 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common U:
[jira] [Commented] (YARN-6055) ContainersMonitorImpl need be adjusted when NM resource changed.
[ https://issues.apache.org/jira/browse/YARN-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863228#comment-16863228 ] Pradeep Ambati commented on YARN-6055: -- Patch looks good. +1 > ContainersMonitorImpl need be adjusted when NM resource changed. > > > Key: YARN-6055 > URL: https://issues.apache.org/jira/browse/YARN-6055 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, scheduler >Reporter: Junping Du >Assignee: Íñigo Goiri >Priority: Major > Attachments: YARN-6055.000.patch, YARN-6055.001.patch, > YARN-6055.002.patch, YARN-6055.003.patch, YARN-6055.004.patch > > > Per Ravi's comments in YARN-4832, we need to check some limits in > containerMonitorImpl to make sure it get updated also when Resource updated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9525) IFile format is not working against s3a remote folder
[ https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863207#comment-16863207 ] Peter Bacsko commented on YARN-9525: That's some nice finding [~adam.antal]. Looks like I oversimplified it a bit with the POC. > IFile format is not working against s3a remote folder > - > > Key: YARN-9525 > URL: https://issues.apache.org/jira/browse/YARN-9525 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation >Affects Versions: 3.1.2 >Reporter: Adam Antal >Assignee: Peter Bacsko >Priority: Major > Attachments: IFile-S3A-POC01.patch, YARN-9525-001.patch, > YARN-9525.002.patch, YARN-9525.003.patch, YARN-9525.004.patch > > > Using the IndexedFileFormat {{yarn.nodemanager.remote-app-log-dir}} > configured to an s3a URI throws the following exception during log > aggregation: > {noformat} > Cannot create writer for app application_1556199768861_0001. Skip log upload > this time. > java.io.IOException: java.io.FileNotFoundException: No such file or > directory: > s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041 > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:247) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:306) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:464) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:420) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: No such file or directory: > s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041 > at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2488) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2382) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2321) > at > org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128) > at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1244) > at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1240) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1246) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController$1.run(LogAggregationIndexedFileController.java:228) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:195) > ... 7 more > {noformat} > This stack trace point to > {{LogAggregationIndexedFileController$initializeWriter}} where we do the > following steps (in a non-rolling log aggregation setup): > - create FSDataOutputStream > - writing out a UUID > - flushing > - immediately after that we call a GetFileStatus to get the length of the log > file (the bytes we just wrote out), and that's where the failures happens: > the file is not there yet due to eventual consistency. > Maybe we can get rid of that, so we can use IFile format against a s3a target. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9209) When nodePartition is not set in Placement Constraints, containers are allocated only in default partition
[ https://issues.apache.org/jira/browse/YARN-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863205#comment-16863205 ] Hadoop QA commented on YARN-9209: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 34s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 17s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 33s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 30s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 27s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 11 unchanged - 0 fixed = 12 total (was 11) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 32s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 87m 51s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}141m 36s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=18.09.5 Server=18.09.5 Image:yetus/hadoop:bdbca0e53b4 | | JIRA Issue | YARN-9209 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12956277/YARN-9209.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 633fbf6b1513 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:28:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 940bcf0 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_212 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/24264/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/24264/testReport/ | | Max. process+thread count | 868
[jira] [Updated] (YARN-9525) IFile format is not working against s3a remote folder
[ https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Antal updated YARN-9525: - Attachment: YARN-9525.004.patch > IFile format is not working against s3a remote folder > - > > Key: YARN-9525 > URL: https://issues.apache.org/jira/browse/YARN-9525 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation >Affects Versions: 3.1.2 >Reporter: Adam Antal >Assignee: Peter Bacsko >Priority: Major > Attachments: IFile-S3A-POC01.patch, YARN-9525-001.patch, > YARN-9525.002.patch, YARN-9525.003.patch, YARN-9525.004.patch > > > Using the IndexedFileFormat {{yarn.nodemanager.remote-app-log-dir}} > configured to an s3a URI throws the following exception during log > aggregation: > {noformat} > Cannot create writer for app application_1556199768861_0001. Skip log upload > this time. > java.io.IOException: java.io.FileNotFoundException: No such file or > directory: > s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041 > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:247) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:306) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:464) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:420) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: No such file or directory: > s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041 > at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2488) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2382) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2321) > at > org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128) > at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1244) > at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1240) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1246) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController$1.run(LogAggregationIndexedFileController.java:228) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:195) > ... 7 more > {noformat} > This stack trace point to > {{LogAggregationIndexedFileController$initializeWriter}} where we do the > following steps (in a non-rolling log aggregation setup): > - create FSDataOutputStream > - writing out a UUID > - flushing > - immediately after that we call a GetFileStatus to get the length of the log > file (the bytes we just wrote out), and that's where the failures happens: > the file is not there yet due to eventual consistency. > Maybe we can get rid of that, so we can use IFile format against a s3a target. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9525) IFile format is not working against s3a remote folder
[ https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863169#comment-16863169 ] Adam Antal commented on YARN-9525: -- Sorry for the delayed answer, let me recap my current progress. I run integration tests multiple times in every scenario just to have a decent knowledge about what we're dealing with. The tests were passing against remote folder in s3, so I thought the patch was ok, but checked the existing behaviour (HDFS remote app dir's case) as well - according to [~wangda]'s last comment. Though IFile is reported to succeed in aggregating logs in those scenarios, during rolling log aggregation I have problems trying to access the logs through the logs CLI (reading through the associated file controller). It does not display any error, it just returns bad parts of the log - in my case, I ran a sleep job in the child container and its logs are mixed up with the AM's logs when I try to read it. I compiled some debug messages into the hadoop-yarn-common jar, and run the tests again. It seems that the offset was miscalculated (due to the patch obviously), and in case of the regular HDFS remote dir when we read back the logs, we try to read it with wrong offset in the aggregated file, thus the logs get messed up. Although the length were ok. (it tried to read the correct number of bits, but starting from a bad position) The funny thing is that the patch works excellently against s3a, so I had to dig a bit further, and found the following: Pre-patch when: - HDFS path is set as remote app folder - we're in rolling log aggregation situation - there was already a rolling session during the next rolling session there is no rollover (if the file is not big enough), so there won't be any new file generated. Meanwhile new OutputStream will be created targeting the existing file in append mode, but this time the "cursor" will point to the end of the file. Detecting this (after writing the dummyBytes, flushing, and checking the just written bits) the currentOffset will be set to 0. After applying the patch: Again, there is no rollover, hence the local bool variable createdNew will be set to false. Thus the currentOffset will be set according to the following piece of code: {noformat} currentOffSet = fc.getFileStatus(aggregatedLogFile).getLen(); {noformat} which is wrong - it has to be zero, as before. The "cursor" still points to the end of the file, while the code thinks that it also has to be pushed/offset by the current length of the file. That information will be written to the index part, so when we read the file back, we will display bad bits, pushed away by that many bits. The solution is simple: for cloud remote app folders rollover will be set to 0 (see related jira: YARN-9607), so there will always be created a new file. (This is unavoidable as no append is not available.) So we should first check whether createdNew is true and we should only touch getFileStatus if it's false: - if there's no append we're fine, because a new file will always be created, thus the boolean will always be true, and the offset will always be zero (starting write from the beginning of the new empty file every rollover session) - if there is append, we fallback to the currently existing behaviour: if createdNew is true, then we're good. if it's not, then we're defaulting to the existing behaviour. Uploaded new patch which addresses the comment above (actually it's just an extra if), and I also hope that this investigation is clear and it makes sense. Setting rollover to zero for non-appendable filesystems will be addressed in YARN-9607, but this patch makes sense without that, so the issues are not depending on each other. Reacting to the [~ste...@apache.org]'s and [~tmarquardt]'s comments: {quote}Good point. Would it actually be possible to pull this out into something you could actually make a standalone test against a filesystem?{quote} Well it seems that it can hardly be modularised that way - so a simple "extracting a few lines of code" for test is not really applicable. I can see a possible solution though, re-reading the code and collecting all the prerequisites or implicit things that the IFile is using, and putting it into a FSContract-based test. Is that what you were originally thinking? {quote}getPos does seem a better strategy here. Adam: what do you think?{quote} It makes sense to change this (use getPos), but I don't know how the existing behaviour (HDFS) would alter. I will test that as well, but was pretty occupied figuring out the above. It seems HDFS is a bit hardwired into this, but at this point my integration tests are passing, which is a good sign. Please review, if you can spare some time, and ask any questions that you may have - I will make an attempt to clarify it. > IFile format is not working against s3a remote folder >
[jira] [Assigned] (YARN-9622) All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2
[ https://issues.apache.org/jira/browse/YARN-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph reassigned YARN-9622: --- Assignee: Prabhu Joseph > All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2 > - > > Key: YARN-9622 > URL: https://issues.apache.org/jira/browse/YARN-9622 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver, timelineservice >Affects Versions: 3.2.0 >Reporter: Peter Bacsko >Assignee: Prabhu Joseph >Priority: Major > > When you try to run all tests from TestTimelineReaderWebServicesHBaseStorage, > the result is the following: > {noformat} > [ERROR] Failures: > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetAppNotPresent:->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 > Response from server should have been Not Found > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunNotPresent:2192->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 > Response from server should have been Not Found > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testUIDNotProperlyEscaped:905->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 > Response from server should have been Bad Request > [ERROR] Errors: > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testForFlowAppsPagination:2375->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testForFlowRunAppsPagination:2420->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testForFlowRunsPagination:2465->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGenericEntitiesForPagination:2272->verifyEntitiesForPagination:2288->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetApp:1024->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetAppWithoutFlowInfo:1064->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetAppsMetricsRange:2516->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesByUID:662->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesConfigFilters:1263->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesDataToRetrieve:1154->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesEventFilters:1640->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesInfoFilters:1380->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricFilters:1494->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricsTimeRange:1820->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesRelationFilters:1696->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesWithoutFlowInfo:1130->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntityDataToRetrieve:1905->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntityWithoutFlowInfo:1113->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowApps:2047->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsFilters:2153->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsNotPresent:2253->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowRun:443->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunApps:1984->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] >
[jira] [Assigned] (YARN-9621) Test failure TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1
[ https://issues.apache.org/jira/browse/YARN-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph reassigned YARN-9621: --- Assignee: Prabhu Joseph > Test failure > TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on > branch-3.1 > > > Key: YARN-9621 > URL: https://issues.apache.org/jira/browse/YARN-9621 > Project: Hadoop YARN > Issue Type: Bug > Components: distributed-shell, test >Affects Versions: 3.1.2 >Reporter: Peter Bacsko >Assignee: Prabhu Joseph >Priority: Major > > Testcase > {{TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint}} > seems to constantly fail on branch 3.1. I believe it was introduced by > YARN-9253. > {noformat} > testDistributedShellWithPlacementConstraint(org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager) > Time elapsed: 24.636 s <<< FAILURE! > java.lang.AssertionError: expected:<1> but was:<2> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint(TestDSWithMultipleNodeManager.java:178) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9622) All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2
[ https://issues.apache.org/jira/browse/YARN-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863128#comment-16863128 ] Prabhu Joseph commented on YARN-9622: - Will work on this, assigning to me. > All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2 > - > > Key: YARN-9622 > URL: https://issues.apache.org/jira/browse/YARN-9622 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver, timelineservice >Affects Versions: 3.2.0 >Reporter: Peter Bacsko >Priority: Major > > When you try to run all tests from TestTimelineReaderWebServicesHBaseStorage, > the result is the following: > {noformat} > [ERROR] Failures: > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetAppNotPresent:->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 > Response from server should have been Not Found > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunNotPresent:2192->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 > Response from server should have been Not Found > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testUIDNotProperlyEscaped:905->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 > Response from server should have been Bad Request > [ERROR] Errors: > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testForFlowAppsPagination:2375->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testForFlowRunAppsPagination:2420->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testForFlowRunsPagination:2465->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGenericEntitiesForPagination:2272->verifyEntitiesForPagination:2288->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetApp:1024->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetAppWithoutFlowInfo:1064->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetAppsMetricsRange:2516->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesByUID:662->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesConfigFilters:1263->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesDataToRetrieve:1154->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesEventFilters:1640->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesInfoFilters:1380->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricFilters:1494->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricsTimeRange:1820->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesRelationFilters:1696->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesWithoutFlowInfo:1130->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntityDataToRetrieve:1905->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntityWithoutFlowInfo:1113->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowApps:2047->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsFilters:2153->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsNotPresent:2253->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowRun:443->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunApps:1984->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] >
[jira] [Updated] (YARN-9622) All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2
[ https://issues.apache.org/jira/browse/YARN-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9622: --- Affects Version/s: 3.2.0 > All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2 > - > > Key: YARN-9622 > URL: https://issues.apache.org/jira/browse/YARN-9622 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver, timelineservice >Affects Versions: 3.2.0 >Reporter: Peter Bacsko >Priority: Major > > When you try to run all tests from TestTimelineReaderWebServicesHBaseStorage, > the result is the following: > {noformat} > [ERROR] Failures: > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetAppNotPresent:->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 > Response from server should have been Not Found > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunNotPresent:2192->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 > Response from server should have been Not Found > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testUIDNotProperlyEscaped:905->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 > Response from server should have been Bad Request > [ERROR] Errors: > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testForFlowAppsPagination:2375->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testForFlowRunAppsPagination:2420->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testForFlowRunsPagination:2465->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGenericEntitiesForPagination:2272->verifyEntitiesForPagination:2288->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetApp:1024->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetAppWithoutFlowInfo:1064->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetAppsMetricsRange:2516->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesByUID:662->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesConfigFilters:1263->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesDataToRetrieve:1154->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesEventFilters:1640->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesInfoFilters:1380->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricFilters:1494->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricsTimeRange:1820->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesRelationFilters:1696->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesWithoutFlowInfo:1130->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntityDataToRetrieve:1905->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntityWithoutFlowInfo:1113->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowApps:2047->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsFilters:2153->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsNotPresent:2253->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowRun:443->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunApps:1984->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] >
[jira] [Created] (YARN-9622) All testcase fails in TestTimelineReaderWebServicesHBaseStorage
Peter Bacsko created YARN-9622: -- Summary: All testcase fails in TestTimelineReaderWebServicesHBaseStorage Key: YARN-9622 URL: https://issues.apache.org/jira/browse/YARN-9622 Project: Hadoop YARN Issue Type: Bug Components: timelineserver, timelineservice Reporter: Peter Bacsko When you try to run all tests from TestTimelineReaderWebServicesHBaseStorage, the result is the following: {noformat} [ERROR] Failures: [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetAppNotPresent:->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 Response from server should have been Not Found [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunNotPresent:2192->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 Response from server should have been Not Found [ERROR] TestTimelineReaderWebServicesHBaseStorage.testUIDNotProperlyEscaped:905->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 Response from server should have been Bad Request [ERROR] Errors: [ERROR] TestTimelineReaderWebServicesHBaseStorage.testForFlowAppsPagination:2375->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testForFlowRunAppsPagination:2420->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testForFlowRunsPagination:2465->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGenericEntitiesForPagination:2272->verifyEntitiesForPagination:2288->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetApp:1024->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetAppWithoutFlowInfo:1064->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetAppsMetricsRange:2516->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesByUID:662->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesConfigFilters:1263->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesDataToRetrieve:1154->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesEventFilters:1640->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesInfoFilters:1380->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricFilters:1494->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricsTimeRange:1820->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesRelationFilters:1696->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesWithoutFlowInfo:1130->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntityDataToRetrieve:1905->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntityWithoutFlowInfo:1113->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetFlowApps:2047->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsFilters:2153->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsNotPresent:2253->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetFlowRun:443->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunApps:1984->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunAppsNotPresent:2235->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetFlowRuns:488->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunsMetricsToRetrieve:616->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR]
[jira] [Updated] (YARN-9622) All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2
[ https://issues.apache.org/jira/browse/YARN-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9622: --- Summary: All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2 (was: All testcase fails in TestTimelineReaderWebServicesHBaseStorage) > All testcase fails in TestTimelineReaderWebServicesHBaseStorage on branch-3.2 > - > > Key: YARN-9622 > URL: https://issues.apache.org/jira/browse/YARN-9622 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver, timelineservice >Reporter: Peter Bacsko >Priority: Major > > When you try to run all tests from TestTimelineReaderWebServicesHBaseStorage, > the result is the following: > {noformat} > [ERROR] Failures: > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetAppNotPresent:->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 > Response from server should have been Not Found > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunNotPresent:2192->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 > Response from server should have been Not Found > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testUIDNotProperlyEscaped:905->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 > Response from server should have been Bad Request > [ERROR] Errors: > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testForFlowAppsPagination:2375->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testForFlowRunAppsPagination:2420->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testForFlowRunsPagination:2465->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGenericEntitiesForPagination:2272->verifyEntitiesForPagination:2288->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetApp:1024->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetAppWithoutFlowInfo:1064->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetAppsMetricsRange:2516->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesByUID:662->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesConfigFilters:1263->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesDataToRetrieve:1154->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesEventFilters:1640->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesInfoFilters:1380->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricFilters:1494->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricsTimeRange:1820->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesRelationFilters:1696->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesWithoutFlowInfo:1130->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntityDataToRetrieve:1905->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetEntityWithoutFlowInfo:1113->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowApps:2047->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsFilters:2153->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsNotPresent:2253->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] > TestTimelineReaderWebServicesHBaseStorage.testGetFlowRun:443->AbstractTimelineReaderHBaseTestBase.getResponse:129 > » IO > [ERROR] >
[jira] [Commented] (YARN-9209) When nodePartition is not set in Placement Constraints, containers are allocated only in default partition
[ https://issues.apache.org/jira/browse/YARN-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863066#comment-16863066 ] Tarun Parimi commented on YARN-9209: Hi [~cheersyang], [~leftnoteasy] , Any way to proceed further towards a proper fix on this pending jira. Current patch fixes the issue, but I guess additional checks are needed? Thanks. > When nodePartition is not set in Placement Constraints, containers are > allocated only in default partition > -- > > Key: YARN-9209 > URL: https://issues.apache.org/jira/browse/YARN-9209 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-9209.001.patch > > > When application sets a placement constraint without specifying a > nodePartition, the default partition is always chosen as the constraint when > allocating containers. This can be a problem. when an application is > submitted to a queue which has doesn't have enough capacity available on the > default partition. > This is a common scenario when node labels are configured for a particular > queue. The below sample sleeper service cannot get even a single container > allocated when it is submitted to a "labeled_queue", even though enough > capacity is available on the label/partition configured for the queue. Only > the AM container runs. > {code:java}{ > "name": "sleeper-service", > "version": "1.0.0", > "queue": "labeled_queue", > "components": [ > { > "name": "sleeper", > "number_of_containers": 2, > "launch_command": "sleep 9", > "resource": { > "cpus": 1, > "memory": "4096" > }, > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "sleeper" > ] > } > ] > } > } > ] > } > {code} > It runs fine if I specify the node_partition explicitly in the constraints > like below. > {code:java} > { > "name": "sleeper-service", > "version": "1.0.0", > "queue": "labeled_queue", > "components": [ > { > "name": "sleeper", > "number_of_containers": 2, > "launch_command": "sleep 9", > "resource": { > "cpus": 1, > "memory": "4096" > }, > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "sleeper" > ], > "node_partitions": [ > "label" > ] > } > ] > } > } > ] > } > {code} > The problem seems to be because only the default partition "" is considered > when node_partition constraint is not specified as seen in below RM log. > {code:java} > 2019-01-17 16:51:59,921 INFO placement.SingleConstraintAppPlacementAllocator > (SingleConstraintAppPlacementAllocator.java:validateAndSetSchedulingRequest(367)) > - Successfully added SchedulingRequest to > app=appattempt_1547734161165_0010_01 targetAllocationTags=[sleeper]. > nodePartition= > {code} > However, I think it makes more sense to consider "*" or the > {{default-node-label-expression}} of the queue if configured, when no > node_partition is specified in the placement constraint. Since not specifying > any node_partition should ideally mean we don't enforce placement constraints > on any node_partition. However we are enforcing the default partition instead > now. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9621) Test failure TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1
[ https://issues.apache.org/jira/browse/YARN-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863059#comment-16863059 ] Peter Bacsko commented on YARN-9621: [~Prabhu Joseph] do you know how to fix this? Branch 3.1 is still active, so would be good to handle it. > Test failure > TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on > branch-3.1 > > > Key: YARN-9621 > URL: https://issues.apache.org/jira/browse/YARN-9621 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: Peter Bacsko >Priority: Major > > Testcase > {{TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint}} > seems to constantly fail on branch 3.1. I believe it was introduced by > YARN-9253. > {noformat} > testDistributedShellWithPlacementConstraint(org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager) > Time elapsed: 24.636 s <<< FAILURE! > java.lang.AssertionError: expected:<1> but was:<2> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint(TestDSWithMultipleNodeManager.java:178) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9621) Test failure TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1
[ https://issues.apache.org/jira/browse/YARN-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9621: --- Component/s: test distributed-shell > Test failure > TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on > branch-3.1 > > > Key: YARN-9621 > URL: https://issues.apache.org/jira/browse/YARN-9621 > Project: Hadoop YARN > Issue Type: Bug > Components: distributed-shell, test >Affects Versions: 3.1.2 >Reporter: Peter Bacsko >Priority: Major > > Testcase > {{TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint}} > seems to constantly fail on branch 3.1. I believe it was introduced by > YARN-9253. > {noformat} > testDistributedShellWithPlacementConstraint(org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager) > Time elapsed: 24.636 s <<< FAILURE! > java.lang.AssertionError: expected:<1> but was:<2> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint(TestDSWithMultipleNodeManager.java:178) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9621) Test failure TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1
Peter Bacsko created YARN-9621: -- Summary: Test failure TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1 Key: YARN-9621 URL: https://issues.apache.org/jira/browse/YARN-9621 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.1.2 Reporter: Peter Bacsko Testcase {{TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint}} seems to constantly fail on branch 3.1. I believe it was introduced by YARN-9253. {noformat} testDistributedShellWithPlacementConstraint(org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager) Time elapsed: 24.636 s <<< FAILURE! java.lang.AssertionError: expected:<1> but was:<2> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint(TestDSWithMultipleNodeManager.java:178) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4042) YARN registry should handle the absence of ZK node
[ https://issues.apache.org/jira/browse/YARN-4042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863058#comment-16863058 ] wangxiangchun commented on YARN-4042: - hi, can I ask that how you solve the problem? I encountered the same problem ,I follow the answer to delete the version-2 file in zkdata, but I didn't sovle the problem? Could I ask your experience? > YARN registry should handle the absence of ZK node > -- > > Key: YARN-4042 > URL: https://issues.apache.org/jira/browse/YARN-4042 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sergey Shelukhin >Priority: Major > > {noformat} > 2015-08-10 11:33:46,931 WARN [LlapSchedulerNodeEnabler] > rm.LlapTaskSchedulerService: Could not refresh list of active instances > org.apache.hadoop.fs.PathNotFoundException: > `/registry/users/huzheng/services/org-apache-hive/llap0/components/workers/worker-25': > No such file or directory: KeeperErrorCode = NoNode for > /registry/users/huzheng/services/org-apache-hive/llap0/components/workers/worker-25 > at > org.apache.hadoop.registry.client.impl.zk.CuratorService.operationFailure(CuratorService.java:377) > at > org.apache.hadoop.registry.client.impl.zk.CuratorService.operationFailure(CuratorService.java:360) > at > org.apache.hadoop.registry.client.impl.zk.CuratorService.zkRead(CuratorService.java:720) > at > org.apache.hadoop.registry.client.impl.zk.RegistryOperationsService.resolve(RegistryOperationsService.java:120) > at > org.apache.hadoop.registry.client.binding.RegistryUtils.extractServiceRecords(RegistryUtils.java:321) > at > org.apache.hadoop.registry.client.binding.RegistryUtils.listServiceRecords(RegistryUtils.java:177) > at > org.apache.hadoop.hive.llap.daemon.registry.impl.LlapYarnRegistryImpl$DynamicServiceInstanceSet.refresh(LlapYarnRegistryImpl.java:278) > at > org.apache.tez.dag.app.rm.LlapTaskSchedulerService.refreshInstances(LlapTaskSchedulerService.java:584) > at > org.apache.tez.dag.app.rm.LlapTaskSchedulerService.access$900(LlapTaskSchedulerService.java:79) > at > org.apache.tez.dag.app.rm.LlapTaskSchedulerService$NodeEnablerCallable.call(LlapTaskSchedulerService.java:887) > at > org.apache.tez.dag.app.rm.LlapTaskSchedulerService$NodeEnablerCallable.call(LlapTaskSchedulerService.java:855) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.zookeeper.KeeperException$NoNodeException: > KeeperErrorCode = NoNode for > /registry/users/huzheng/services/org-apache-hive/llap0/components/workers/worker-25 > at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) > at > org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:302) > at > org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:291) > at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) > at > org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:288) > at > org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:279) > at > org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:41) > at > org.apache.hadoop.registry.client.impl.zk.CuratorService.zkRead(CuratorService.java:718) > ... 12 more > {noformat} > ZK nodes can disappear after listing, for example ephemeral node can be > cleaned up. YARN registry should handle that. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9607) Auto-configuring rollover-size of IFile format for non-appendable filesystems
[ https://issues.apache.org/jira/browse/YARN-9607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862988#comment-16862988 ] Szilard Nemeth commented on YARN-9607: -- Oh, one more thing: As you are overriding the config in LogAggregationIndexedFileController.getRollOverLogMaxSize, at least you need to log a statement that you overridden the value so users wouldn't be confused by any chance why the value ended up to be zero. Do you agree [~adam.antal]? > Auto-configuring rollover-size of IFile format for non-appendable filesystems > - > > Key: YARN-9607 > URL: https://issues.apache.org/jira/browse/YARN-9607 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation, yarn >Affects Versions: 3.3.0 >Reporter: Adam Antal >Assignee: Adam Antal >Priority: Major > Attachments: YARN-9607.001.patch > > > In YARN-9525, we made IFile format compatible with remote folders with s3a > scheme. In rolling fashioned log-aggregation IFile still fails with the > "append is not supported" error message, which is a known limitation of the > format by design. > There is a workaround though: setting the rollover size in the configuration > of the IFile format, in each rolling cycle a new aggregated log file will be > created, thus we eliminated the append from the process. Setting this config > globally would cause performance problems in the regular log-aggregation, so > I'm suggesting to enforcing this config to zero, if the scheme of the URI is > s3a (or any other non-appendable filesystem). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9607) Auto-configuring rollover-size of IFile format for non-appendable filesystems
[ https://issues.apache.org/jira/browse/YARN-9607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862984#comment-16862984 ] Szilard Nemeth commented on YARN-9607: -- Hi [~adam.antal]! 2 comments: 1. In the test code, you are testing IFile format but the error messages are wrong, like: "TFile controller... " Those should start with IFile, right? 2. With the code where you are checking for the non-appendable schemes: I would put the non-appendable scheme strings into a Set and simply check if the string is in the set or not, so the if condition could be more straightforward. > Auto-configuring rollover-size of IFile format for non-appendable filesystems > - > > Key: YARN-9607 > URL: https://issues.apache.org/jira/browse/YARN-9607 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation, yarn >Affects Versions: 3.3.0 >Reporter: Adam Antal >Assignee: Adam Antal >Priority: Major > Attachments: YARN-9607.001.patch > > > In YARN-9525, we made IFile format compatible with remote folders with s3a > scheme. In rolling fashioned log-aggregation IFile still fails with the > "append is not supported" error message, which is a known limitation of the > format by design. > There is a workaround though: setting the rollover size in the configuration > of the IFile format, in each rolling cycle a new aggregated log file will be > created, thus we eliminated the append from the process. Setting this config > globally would cause performance problems in the regular log-aggregation, so > I'm suggesting to enforcing this config to zero, if the scheme of the URI is > s3a (or any other non-appendable filesystem). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-8995: --- Attachment: TestStreamPerf.java > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-8995: --- Attachment: TestStreamPerf.java > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-8995.001.patch, YARN-8995.002.patch, > YARN-8995.003.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-8995: --- Attachment: (was: TestStreamPerf.java) > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-8995.001.patch, YARN-8995.002.patch, > YARN-8995.003.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862866#comment-16862866 ] Tao Yang edited comment on YARN-8995 at 6/13/19 9:29 AM: - I did a simple test (details in TestStreamPerf.java) on performance comparison between sequential stream and parallel stream in a similar scenario: count a blocking queue with 100 distinct keys and 1w/10w/100w/200w total length, it seems that parallel stream indeed lead to more overhead than sequential stream, results of this test are as follows (suffix "_S" refers to sequential stream and suffix "_PS" refers to parallel stream): {noformat} TestStreamPerf.test_100_100w_PS: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.03 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 1, GC.time: 0.01, time.total: 0.64, time.warmup: 0.31, time.bench: 0.32 TestStreamPerf.test_100_100w_S: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.02 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.37, time.warmup: 0.15, time.bench: 0.22 TestStreamPerf.test_100_10w_PS: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.08, time.warmup: 0.05, time.bench: 0.04 TestStreamPerf.test_100_10w_S: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.04, time.warmup: 0.01, time.bench: 0.03 TestStreamPerf.test_100_1w_PS: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.01, time.warmup: 0.00, time.bench: 0.01 TestStreamPerf.test_100_1w_S: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.01, time.warmup: 0.00, time.bench: 0.00 TestStreamPerf.test_100_200w_PS: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.07 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 1.03, time.warmup: 0.37, time.bench: 0.66 TestStreamPerf.test_100_200w_S: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.04 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.70, time.warmup: 0.25, time.bench: 0.45 {noformat} was (Author: tao yang): I did a simple test on performance comparison between sequential stream and parallel stream in a similar scenario: count a blocking queue with 100 distinct keys and 1w/10w/100w/200w total length, it seems that parallel stream indeed lead to more overhead than sequential stream, results of this test are as follows (suffix "_S" refers to sequential stream and suffix "_PS" refers to parallel stream): {noformat} TestStreamPerf.test_100_1w_S: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00 TestStreamPerf.test_100_1w_PS: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.01, time.warmup: 0.00, time.bench: 0.01 TestStreamPerf.test_100_10w_S: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.04, time.warmup: 0.01, time.bench: 0.03 TestStreamPerf.test_100_10w_PS: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.14, time.warmup: 0.09, time.bench: 0.05 TestStreamPerf.test_100_100w_S: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.03 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.43, time.warmup: 0.17, time.bench: 0.26 TestStreamPerf.test_100_100w_PS: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.04 [+- 0.01], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.56, time.warmup: 0.20, time.bench: 0.36 TestStreamPerf.test_100_200w_S: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.05 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.75, time.warmup: 0.25, time.bench: 0.50 TestStreamPerf.test_100_200w_PS: [measured 10 out of
[jira] [Commented] (YARN-9608) DecommissioningNodesWatcher should get lists of running applications on node from RMNode.
[ https://issues.apache.org/jira/browse/YARN-9608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862867#comment-16862867 ] Abhishek Modi commented on YARN-9608: - Thanks [~tangzhankun] for reviewing it. > DecommissioningNodesWatcher should get lists of running applications on node > from RMNode. > - > > Key: YARN-9608 > URL: https://issues.apache.org/jira/browse/YARN-9608 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Abhishek Modi >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-9608.001.patch, YARN-9608.002.patch > > > At present, DecommissioningNodesWatcher tracks list of running applications > and triggers decommission of nodes when all the applications that ran on the > node completes. This Jira proposes to solve following problem: > # DecommissioningNodesWatcher skips tracking application containers on a > particular node before the node is in DECOMMISSIONING state. It only tracks > containers once the node is in DECOMMISSIONING state. This can lead to > shuffle data loss of apps whose containers ran on this node before it was > moved to decommissioning state. > # It is keeping track of running apps. We can leverage this directly from > RMNode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862866#comment-16862866 ] Tao Yang commented on YARN-8995: I did a simple test on performance comparison between sequential stream and parallel stream in a similar scenario: count a blocking queue with 100 distinct keys and 1w/10w/100w/200w total length, it seems that parallel stream indeed lead to more overhead than sequential stream, results of this test are as follows (suffix "_S" refers to sequential stream and suffix "_PS" refers to parallel stream): {noformat} TestStreamPerf.test_100_1w_S: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00 TestStreamPerf.test_100_1w_PS: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.01, time.warmup: 0.00, time.bench: 0.01 TestStreamPerf.test_100_10w_S: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.04, time.warmup: 0.01, time.bench: 0.03 TestStreamPerf.test_100_10w_PS: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.00 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.14, time.warmup: 0.09, time.bench: 0.05 TestStreamPerf.test_100_100w_S: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.03 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.43, time.warmup: 0.17, time.bench: 0.26 TestStreamPerf.test_100_100w_PS: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.04 [+- 0.01], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.56, time.warmup: 0.20, time.bench: 0.36 TestStreamPerf.test_100_200w_S: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.05 [+- 0.00], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.75, time.warmup: 0.25, time.bench: 0.50 TestStreamPerf.test_100_200w_PS: [measured 10 out of 15 rounds, threads: 1 (sequential)] round: 0.07 [+- 0.01], round.block: 0.01 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 1.06, time.warmup: 0.35, time.bench: 0.71 {noformat} > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-8995.001.patch, YARN-8995.002.patch, > YARN-8995.003.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9608) DecommissioningNodesWatcher should get lists of running applications on node from RMNode.
[ https://issues.apache.org/jira/browse/YARN-9608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862826#comment-16862826 ] Zhankun Tang commented on YARN-9608: [~abmodi], Yeah. Thanks for the explanation! +1 from me. I can help to commit this if no one opposes. > DecommissioningNodesWatcher should get lists of running applications on node > from RMNode. > - > > Key: YARN-9608 > URL: https://issues.apache.org/jira/browse/YARN-9608 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Abhishek Modi >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-9608.001.patch, YARN-9608.002.patch > > > At present, DecommissioningNodesWatcher tracks list of running applications > and triggers decommission of nodes when all the applications that ran on the > node completes. This Jira proposes to solve following problem: > # DecommissioningNodesWatcher skips tracking application containers on a > particular node before the node is in DECOMMISSIONING state. It only tracks > containers once the node is in DECOMMISSIONING state. This can lead to > shuffle data loss of apps whose containers ran on this node before it was > moved to decommissioning state. > # It is keeping track of running apps. We can leverage this directly from > RMNode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862821#comment-16862821 ] Tao Yang edited comment on YARN-8995 at 6/13/19 8:27 AM: - Thanks [~zhuqi] for updating the patch. Comments for the new patch: * Sorry to have made a mistake in my last comment, serviceInit is a more proper place to initialize conf, then you can remove the initial value for detailsInterval field. * There's no need to separate name with double "\_" for "...EVENTS__INFO...", "...EVENTS_INFO..." is ok. The annotation "The interval thousands of queue size" can be replaced as "The interval of queue size (in thousands)". * For parallelStream, overhead is involved in splitting the work among several threads and joining or merging the results, I prefer using sequential stream in this scenario which has no I/O operations and only need to count for event types. Moreover, we can use groupingBy API like this: {{eventQueue.stream().collect(Collectors.groupingBy(e -> e.getType(), Collectors.counting()))}}, instead of calling Collectors#toConcurrentMap or Collectors#toMap. was (Author: tao yang): Thanks [~zhuqi] for updating the patch. Comments for the new patch: * Sorry to have made a mistake in my last comment, serviceInit is a more proper place to initialize conf, then you can remove the initial value for detailsInterval field. * There's no need to separate name with double "_" for "...EVENTS__INFO...", "...EVENTS_INFO..." is ok. The annotation "The interval thousands of ..." can be replaced as "The interval of ... (in thousands)". * For parallelStream, overhead is involved in splitting the work among several threads and joining or merging the results, I prefer using sequential stream in this scenario which has no I/O operations and only need to count for event types. Moreover, we can use groupingBy API like this: {{eventQueue.stream().collect(Collectors.groupingBy(e -> e.getType(), Collectors.counting()))}}, instead of calling Collectors#toConcurrentMap or Collectors#toMap. > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-8995.001.patch, YARN-8995.002.patch, > YARN-8995.003.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862821#comment-16862821 ] Tao Yang commented on YARN-8995: Thanks [~zhuqi] for updating the patch. Comments for the new patch: * Sorry to have made a mistake in my last comment, serviceInit is a more proper place to initialize conf, then you can remove the initial value for detailsInterval field. * There's no need to separate name with double "_" for "...EVENTS__INFO...", "...EVENTS_INFO..." is ok. The annotation "The interval thousands of ..." can be replaced as "The interval of ... (in thousands)". * For parallelStream, overhead is involved in splitting the work among several threads and joining or merging the results, I prefer using sequential stream in this scenario which has no I/O operations and only need to count for event types. Moreover, we can use groupingBy API like this: {{eventQueue.stream().collect(Collectors.groupingBy(e -> e.getType(), Collectors.counting()))}}, instead of calling Collectors#toConcurrentMap or Collectors#toMap. > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-8995.001.patch, YARN-8995.002.patch, > YARN-8995.003.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9608) DecommissioningNodesWatcher should get lists of running applications on node from RMNode.
[ https://issues.apache.org/jira/browse/YARN-9608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862793#comment-16862793 ] Abhishek Modi commented on YARN-9608: - Thanks [~tangzhankun] for going through patch: {quote} # If there's a long-running Spark shell application A of YARN cluster mode, only can the timeout cause the decommissioning node 1 (app A's container ran on it previously, but A's AM running on node 2) to shut down, right?{quote} Yes, in this case only timeout or application finish can cause the decommissioning to complete. This behavior would be similar to the behavior in case this node was put in decommissioning state when container for app A was running on the node. {quote} And if node 1 is shut down due to timeout, and when node 1 is re-registered in the future, will the node 1 still be considered belongs to running application A? {quote} No, if node was shut down when no container was running on the node it won't be considered belonging to app A. But in case, work preserving node manager was enabled and a container was recovered on that node for app A, it will be considered to be running app A. > DecommissioningNodesWatcher should get lists of running applications on node > from RMNode. > - > > Key: YARN-9608 > URL: https://issues.apache.org/jira/browse/YARN-9608 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Abhishek Modi >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-9608.001.patch, YARN-9608.002.patch > > > At present, DecommissioningNodesWatcher tracks list of running applications > and triggers decommission of nodes when all the applications that ran on the > node completes. This Jira proposes to solve following problem: > # DecommissioningNodesWatcher skips tracking application containers on a > particular node before the node is in DECOMMISSIONING state. It only tracks > containers once the node is in DECOMMISSIONING state. This can lead to > shuffle data loss of apps whose containers ran on this node before it was > moved to decommissioning state. > # It is keeping track of running apps. We can leverage this directly from > RMNode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9608) DecommissioningNodesWatcher should get lists of running applications on node from RMNode.
[ https://issues.apache.org/jira/browse/YARN-9608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862789#comment-16862789 ] Zhankun Tang commented on YARN-9608: [~abmodi], Thanks. Just read through the whole patch. Two questions: 1. If there's a long-running Spark shell application A of YARN cluster mode, only can the timeout cause the decommissioning node 1 (app A's container ran on it previously, but A's AM running on node 2) to shut down, right? 2. And if node 1 is shut down due to timeout, and when node 1 is re-registered in the future, will the node 1 still be considered belongs to running application A? > DecommissioningNodesWatcher should get lists of running applications on node > from RMNode. > - > > Key: YARN-9608 > URL: https://issues.apache.org/jira/browse/YARN-9608 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Abhishek Modi >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-9608.001.patch, YARN-9608.002.patch > > > At present, DecommissioningNodesWatcher tracks list of running applications > and triggers decommission of nodes when all the applications that ran on the > node completes. This Jira proposes to solve following problem: > # DecommissioningNodesWatcher skips tracking application containers on a > particular node before the node is in DECOMMISSIONING state. It only tracks > containers once the node is in DECOMMISSIONING state. This can lead to > shuffle data loss of apps whose containers ran on this node before it was > moved to decommissioning state. > # It is keeping track of running apps. We can leverage this directly from > RMNode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org