[jira] [Created] (MAPREDUCE-7473) Entity id/type not updated for HistoryEvent NORMALIZED_RESOURCE
Bilwa S T created MAPREDUCE-7473: Summary: Entity id/type not updated for HistoryEvent NORMALIZED_RESOURCE Key: MAPREDUCE-7473 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7473 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Bilwa S T Assignee: Bilwa S T Getting below exception in MR AM logs: 2024-03-09 16:23:30,329 ERROR [Job ATS Event Dispatcher] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error putting entity null to TimelineServer org.apache.hadoop.yarn.exceptions.YarnException: Incomplete entity without entity id/type at org.apache.hadoop.yarn.client.api.impl.TimelineWriter.putEntities(TimelineWriter.java:88) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:187) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processEventForTimelineServer(JobHistoryEventHandler.java:1129) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleTimelineEvent(JobHistoryEventHandler.java:745) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.access$1200(JobHistoryEventHandler.java:93) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1795) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1791) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:241) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:156) at java.base/java.lang.Thread.run(Thread.java:840) 2024-03-09 16:23:30,332 ERROR [Job ATS Event Dispatcher] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error putting entity null to TimelineServer org.apache.hadoop.yarn.exceptions.YarnException: Incomplete entity without entity id/type at org.apache.hadoop.yarn.client.api.impl.TimelineWriter.putEntities(TimelineWriter.java:88) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:187) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processEventForTimelineServer(JobHistoryEventHandler.java:1129) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleTimelineEvent(JobHistoryEventHandler.java:745) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.access$1200(JobHistoryEventHandler.java:93) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1795) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1791) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:241) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:156) at java.base/java.lang.Thread.run(Thread.java:840) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6784) JobImpl state changes for containers reuse
[ https://issues.apache.org/jira/browse/MAPREDUCE-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-6784: - Status: Patch Available (was: Open) > JobImpl state changes for containers reuse > -- > > Key: MAPREDUCE-6784 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6784 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-6784-v0.patch > > > Add JobImpl state changes for supporting reusing of containers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408773#comment-17408773 ] Bilwa S T commented on MAPREDUCE-7169: -- Hi [~jeagles] Thanks for your review . * Why is denying racks and hosts should be enabled separately? Can you please elaborate? Currently we try to avoid launching on same rack as old attempt if there are no containers on diff rack then we try choosing node other than old attempt node. * I will update patch by changing blacklist to denylist. > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > MAPREDUCE-7169-003.patch, MAPREDUCE-7169.004.patch, MAPREDUCE-7169.005.patch, > MAPREDUCE-7169.006.patch, MAPREDUCE-7169.007.patch, > image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped
[ https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377054#comment-17377054 ] Bilwa S T commented on MAPREDUCE-7353: -- Thank you [~epayne] > Mapreduce job fails when NM is stopped > -- > > Key: MAPREDUCE-7353 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Fix For: 3.4.0, 2.10.2, 3.2.3, 3.3.2 > > Attachments: MAPREDUCE-7353.001.patch, MAPREDUCE-7353.002.patch > > > Job fails as task fail due to too many fetch failures > {code:java} > Line 48048: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container > container_e03_1622107691213_1054_01_05 taskAttempt > attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394 > Line 48053: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209 > Line 58026: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | TaskAttempt killed because it ran on unusable node > node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | > JobImpl.java:1401 > Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58035: 2021-06-02 16:26:34,034 | INFO | RMCommunicator Allocator > | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is > running on unusable node:node-group-1ZYEq0002:26009 | > RMContainerAllocator.java:1066 > Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 58055: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > Container released on a *lost* node | TaskAttemptImpl.java:2649 > Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 60317: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | Too many fetch-failures for output of task attempt: > attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | > JobImpl.java:2005 > Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390 > Line 60320: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | attempt_1622107691213_1054_m_00_0 transitioned from state > SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE > and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411 > Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 69527: 2021-06-02 16:30:02,002 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > cleanup failed for container container_e03_1622107691213_1054_01_05 : > java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to > node-group-1ZYEq0002:26009 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 > going to fetch from node-group-1ZYEq0002:26008 for: > [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318 > Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL > for node-group-1ZYEq0002:26008 -> > http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0 > | Fetcher.java:686
[jira] [Commented] (MAPREDUCE-6786) TaskAttemptImpl state changes for containers reuse
[ https://issues.apache.org/jira/browse/MAPREDUCE-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17373949#comment-17373949 ] Bilwa S T commented on MAPREDUCE-6786: -- Hi [~devaraj] [~brahma] Can you please take a look at this whenever you get time? Thank you > TaskAttemptImpl state changes for containers reuse > -- > > Key: MAPREDUCE-6786 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6786 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-6786-MR-6749.001.patch, > MAPREDUCE-6786-MR-6749.002.patch, MAPREDUCE-6786-v0.patch, > MAPREDUCE-6786.001.patch, MAPREDUCE-6786.002.patch > > > Update TaskAttemptImpl to support the reuse of containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped
[ https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371932#comment-17371932 ] Bilwa S T commented on MAPREDUCE-7353: -- Hi [~epayne] can you please check updated patch? Thanks > Mapreduce job fails when NM is stopped > -- > > Key: MAPREDUCE-7353 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7353.001.patch, MAPREDUCE-7353.002.patch > > > Job fails as task fail due to too many fetch failures > {code:java} > Line 48048: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container > container_e03_1622107691213_1054_01_05 taskAttempt > attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394 > Line 48053: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209 > Line 58026: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | TaskAttempt killed because it ran on unusable node > node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | > JobImpl.java:1401 > Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58035: 2021-06-02 16:26:34,034 | INFO | RMCommunicator Allocator > | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is > running on unusable node:node-group-1ZYEq0002:26009 | > RMContainerAllocator.java:1066 > Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 58055: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > Container released on a *lost* node | TaskAttemptImpl.java:2649 > Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 60317: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | Too many fetch-failures for output of task attempt: > attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | > JobImpl.java:2005 > Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390 > Line 60320: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | attempt_1622107691213_1054_m_00_0 transitioned from state > SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE > and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411 > Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 69527: 2021-06-02 16:30:02,002 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > cleanup failed for container container_e03_1622107691213_1054_01_05 : > java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to > node-group-1ZYEq0002:26009 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 > going to fetch from node-group-1ZYEq0002:26008 for: > [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318 > Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL > for node-group-1ZYEq0002:26008 -> > http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0 > | Fetcher.java:686 > Line
[jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped
[ https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368812#comment-17368812 ] Bilwa S T commented on MAPREDUCE-7353: -- Thanks [~epayne] for your review. I have added UT . Please take a look > Mapreduce job fails when NM is stopped > -- > > Key: MAPREDUCE-7353 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7353.001.patch, MAPREDUCE-7353.002.patch > > > Job fails as task fail due to too many fetch failures > {code:java} > Line 48048: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container > container_e03_1622107691213_1054_01_05 taskAttempt > attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394 > Line 48053: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209 > Line 58026: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | TaskAttempt killed because it ran on unusable node > node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | > JobImpl.java:1401 > Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58035: 2021-06-02 16:26:34,034 | INFO | RMCommunicator Allocator > | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is > running on unusable node:node-group-1ZYEq0002:26009 | > RMContainerAllocator.java:1066 > Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 58055: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > Container released on a *lost* node | TaskAttemptImpl.java:2649 > Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 60317: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | Too many fetch-failures for output of task attempt: > attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | > JobImpl.java:2005 > Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390 > Line 60320: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | attempt_1622107691213_1054_m_00_0 transitioned from state > SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE > and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411 > Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 69527: 2021-06-02 16:30:02,002 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > cleanup failed for container container_e03_1622107691213_1054_01_05 : > java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to > node-group-1ZYEq0002:26009 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 > going to fetch from node-group-1ZYEq0002:26008 for: > [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318 > Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL > for node-group-1ZYEq0002:26008 -> > http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0 > | Fetcher.java:686 >
[jira] [Updated] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped
[ https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7353: - Attachment: MAPREDUCE-7353.002.patch > Mapreduce job fails when NM is stopped > -- > > Key: MAPREDUCE-7353 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7353.001.patch, MAPREDUCE-7353.002.patch > > > Job fails as task fail due to too many fetch failures > {code:java} > Line 48048: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container > container_e03_1622107691213_1054_01_05 taskAttempt > attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394 > Line 48053: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209 > Line 58026: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | TaskAttempt killed because it ran on unusable node > node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | > JobImpl.java:1401 > Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58035: 2021-06-02 16:26:34,034 | INFO | RMCommunicator Allocator > | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is > running on unusable node:node-group-1ZYEq0002:26009 | > RMContainerAllocator.java:1066 > Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 58055: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > Container released on a *lost* node | TaskAttemptImpl.java:2649 > Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 60317: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | Too many fetch-failures for output of task attempt: > attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | > JobImpl.java:2005 > Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390 > Line 60320: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | attempt_1622107691213_1054_m_00_0 transitioned from state > SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE > and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411 > Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 69527: 2021-06-02 16:30:02,002 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > cleanup failed for container container_e03_1622107691213_1054_01_05 : > java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to > node-group-1ZYEq0002:26009 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 > going to fetch from node-group-1ZYEq0002:26008 for: > [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318 > Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL > for node-group-1ZYEq0002:26008 -> > http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0 > | Fetcher.java:686 > Line 74093: 2021-06-02 16:26:56,056 | INFO | fetcher#9 |
[jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped
[ https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368087#comment-17368087 ] Bilwa S T commented on MAPREDUCE-7353: -- Hi [~epayne] Can you please take a look at this today if possible? Thanks > Mapreduce job fails when NM is stopped > -- > > Key: MAPREDUCE-7353 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7353.001.patch > > > Job fails as task fail due to too many fetch failures > {code:java} > Line 48048: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container > container_e03_1622107691213_1054_01_05 taskAttempt > attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394 > Line 48053: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209 > Line 58026: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | TaskAttempt killed because it ran on unusable node > node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | > JobImpl.java:1401 > Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58035: 2021-06-02 16:26:34,034 | INFO | RMCommunicator Allocator > | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is > running on unusable node:node-group-1ZYEq0002:26009 | > RMContainerAllocator.java:1066 > Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 58055: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > Container released on a *lost* node | TaskAttemptImpl.java:2649 > Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 60317: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | Too many fetch-failures for output of task attempt: > attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | > JobImpl.java:2005 > Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390 > Line 60320: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | attempt_1622107691213_1054_m_00_0 transitioned from state > SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE > and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411 > Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 69527: 2021-06-02 16:30:02,002 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > cleanup failed for container container_e03_1622107691213_1054_01_05 : > java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to > node-group-1ZYEq0002:26009 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 > going to fetch from node-group-1ZYEq0002:26008 for: > [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318 > Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL > for node-group-1ZYEq0002:26008 -> > http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0 > | Fetcher.java:686 > Line 74093:
[jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped
[ https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364993#comment-17364993 ] Bilwa S T commented on MAPREDUCE-7353: -- Ok Thanks [~epayne] > Mapreduce job fails when NM is stopped > -- > > Key: MAPREDUCE-7353 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7353.001.patch > > > Job fails as task fail due to too many fetch failures > {code:java} > Line 48048: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container > container_e03_1622107691213_1054_01_05 taskAttempt > attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394 > Line 48053: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209 > Line 58026: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | TaskAttempt killed because it ran on unusable node > node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | > JobImpl.java:1401 > Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58035: 2021-06-02 16:26:34,034 | INFO | RMCommunicator Allocator > | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is > running on unusable node:node-group-1ZYEq0002:26009 | > RMContainerAllocator.java:1066 > Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 58055: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > Container released on a *lost* node | TaskAttemptImpl.java:2649 > Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 60317: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | Too many fetch-failures for output of task attempt: > attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | > JobImpl.java:2005 > Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390 > Line 60320: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | attempt_1622107691213_1054_m_00_0 transitioned from state > SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE > and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411 > Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 69527: 2021-06-02 16:30:02,002 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > cleanup failed for container container_e03_1622107691213_1054_01_05 : > java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to > node-group-1ZYEq0002:26009 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 > going to fetch from node-group-1ZYEq0002:26008 for: > [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318 > Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL > for node-group-1ZYEq0002:26008 -> > http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0 > | Fetcher.java:686 > Line 74093: 2021-06-02 16:26:56,056 | INFO | fetcher#9 | Reporting >
[jira] [Updated] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped
[ https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7353: - Status: Patch Available (was: Open) > Mapreduce job fails when NM is stopped > -- > > Key: MAPREDUCE-7353 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7353.001.patch > > > Job fails as task fail due to too many fetch failures > {code:java} > Line 48048: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container > container_e03_1622107691213_1054_01_05 taskAttempt > attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394 > Line 48053: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209 > Line 58026: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | TaskAttempt killed because it ran on unusable node > node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | > JobImpl.java:1401 > Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58035: 2021-06-02 16:26:34,034 | INFO | RMCommunicator Allocator > | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is > running on unusable node:node-group-1ZYEq0002:26009 | > RMContainerAllocator.java:1066 > Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 58055: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > Container released on a *lost* node | TaskAttemptImpl.java:2649 > Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 60317: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | Too many fetch-failures for output of task attempt: > attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | > JobImpl.java:2005 > Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390 > Line 60320: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | attempt_1622107691213_1054_m_00_0 transitioned from state > SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE > and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411 > Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 69527: 2021-06-02 16:30:02,002 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > cleanup failed for container container_e03_1622107691213_1054_01_05 : > java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to > node-group-1ZYEq0002:26009 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 > going to fetch from node-group-1ZYEq0002:26008 for: > [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318 > Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL > for node-group-1ZYEq0002:26008 -> > http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0 > | Fetcher.java:686 > Line 74093: 2021-06-02 16:26:56,056 | INFO | fetcher#9 | Reporting > fetch failure for
[jira] [Updated] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped
[ https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7353: - Attachment: MAPREDUCE-7353.001.patch > Mapreduce job fails when NM is stopped > -- > > Key: MAPREDUCE-7353 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7353.001.patch > > > Job fails as task fail due to too many fetch failures > {code:java} > Line 48048: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container > container_e03_1622107691213_1054_01_05 taskAttempt > attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394 > Line 48053: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209 > Line 58026: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | TaskAttempt killed because it ran on unusable node > node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | > JobImpl.java:1401 > Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58035: 2021-06-02 16:26:34,034 | INFO | RMCommunicator Allocator > | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is > running on unusable node:node-group-1ZYEq0002:26009 | > RMContainerAllocator.java:1066 > Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 58055: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > Container released on a *lost* node | TaskAttemptImpl.java:2649 > Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 60317: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | Too many fetch-failures for output of task attempt: > attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | > JobImpl.java:2005 > Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390 > Line 60320: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | attempt_1622107691213_1054_m_00_0 transitioned from state > SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE > and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411 > Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 69527: 2021-06-02 16:30:02,002 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > cleanup failed for container container_e03_1622107691213_1054_01_05 : > java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to > node-group-1ZYEq0002:26009 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 > going to fetch from node-group-1ZYEq0002:26008 for: > [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318 > Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL > for node-group-1ZYEq0002:26008 -> > http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0 > | Fetcher.java:686 > Line 74093: 2021-06-02 16:26:56,056 | INFO | fetcher#9 | Reporting > fetch failure for
[jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped
[ https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364233#comment-17364233 ] Bilwa S T commented on MAPREDUCE-7353: -- cc [~epayne] [~jbrennan] > Mapreduce job fails when NM is stopped > -- > > Key: MAPREDUCE-7353 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7353.001.patch > > > Job fails as task fail due to too many fetch failures > {code:java} > Line 48048: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container > container_e03_1622107691213_1054_01_05 taskAttempt > attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394 > Line 48053: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209 > Line 58026: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | TaskAttempt killed because it ran on unusable node > node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | > JobImpl.java:1401 > Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58035: 2021-06-02 16:26:34,034 | INFO | RMCommunicator Allocator > | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is > running on unusable node:node-group-1ZYEq0002:26009 | > RMContainerAllocator.java:1066 > Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 58055: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > Container released on a *lost* node | TaskAttemptImpl.java:2649 > Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 60317: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | Too many fetch-failures for output of task attempt: > attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | > JobImpl.java:2005 > Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390 > Line 60320: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | attempt_1622107691213_1054_m_00_0 transitioned from state > SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE > and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411 > Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 69527: 2021-06-02 16:30:02,002 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > cleanup failed for container container_e03_1622107691213_1054_01_05 : > java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to > node-group-1ZYEq0002:26009 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 > going to fetch from node-group-1ZYEq0002:26008 for: > [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318 > Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL > for node-group-1ZYEq0002:26008 -> > http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0 > | Fetcher.java:686 > Line 74093: 2021-06-02 16:26:56,056 | INFO | fetcher#9 |
[jira] [Created] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped
Bilwa S T created MAPREDUCE-7353: Summary: Mapreduce job fails when NM is stopped Key: MAPREDUCE-7353 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Bilwa S T Assignee: Bilwa S T Job fails as task fail due to too many fetch failures {code:java} Line 48048: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_e03_1622107691213_1054_01_05 taskAttempt attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394 Line 48053: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209 Line 58026: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event handler | TaskAttempt killed because it ran on unusable node node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | JobImpl.java:1401 Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | TaskAttemptImpl.java:1390 Line 58035: 2021-06-02 16:26:34,034 | INFO | RMCommunicator Allocator | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is running on unusable node:node-group-1ZYEq0002:26009 | RMContainerAllocator.java:1066 Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | TaskAttemptImpl.java:1390 Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 Line 58055: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: Container released on a *lost* node | TaskAttemptImpl.java:2649 Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | TaskAttemptImpl.java:1390 Line 60317: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event handler | Too many fetch-failures for output of task attempt: attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | JobImpl.java:2005 Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390 Line 60320: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event handler | attempt_1622107691213_1054_m_00_0 transitioned from state SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411 Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 Line 69527: 2021-06-02 16:30:02,002 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: cleanup failed for container container_e03_1622107691213_1054_01_05 : java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to node-group-1ZYEq0002:26009 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 going to fetch from node-group-1ZYEq0002:26008 for: [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318 Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL for node-group-1ZYEq0002:26008 -> http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0 | Fetcher.java:686 Line 74093: 2021-06-02 16:26:56,056 | INFO | fetcher#9 | Reporting fetch failure for attempt_1622107691213_1054_m_00_0 to MRAppMaster. | ShuffleSchedulerImpl.java:349 {code} As we can see from logs that RM reported AM about node update at 16:26:34 but event was skipped as KILL event is ignored when TaskAttemptImpl is in SUCCESS_CONTAINER_CLEANUP state. So next we receive TA_TOO_MANY_FETCH_FAILURE event which will lead to task fail. -- This
[jira] [Commented] (MAPREDUCE-6786) TaskAttemptImpl state changes for containers reuse
[ https://issues.apache.org/jira/browse/MAPREDUCE-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17353932#comment-17353932 ] Bilwa S T commented on MAPREDUCE-6786: -- Hi [~brahma] [~devaraj] I have rebased this branch. Can you please help me in reviewing this patch? Thanks > TaskAttemptImpl state changes for containers reuse > -- > > Key: MAPREDUCE-6786 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6786 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-6786-MR-6749.001.patch, > MAPREDUCE-6786-MR-6749.002.patch, MAPREDUCE-6786-v0.patch, > MAPREDUCE-6786.001.patch, MAPREDUCE-6786.002.patch > > > Update TaskAttemptImpl to support the reuse of containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6786) TaskAttemptImpl state changes for containers reuse
[ https://issues.apache.org/jira/browse/MAPREDUCE-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-6786: - Attachment: MAPREDUCE-6786-MR-6749.002.patch > TaskAttemptImpl state changes for containers reuse > -- > > Key: MAPREDUCE-6786 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6786 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-6786-MR-6749.001.patch, > MAPREDUCE-6786-MR-6749.002.patch, MAPREDUCE-6786-v0.patch, > MAPREDUCE-6786.001.patch, MAPREDUCE-6786.002.patch > > > Update TaskAttemptImpl to support the reuse of containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7199) HsJobsBlock reuse JobACLsManager for checkAccess
[ https://issues.apache.org/jira/browse/MAPREDUCE-7199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313912#comment-17313912 ] Bilwa S T commented on MAPREDUCE-7199: -- Hi [~brahmareddy] can we backport it to branch-3.3? > HsJobsBlock reuse JobACLsManager for checkAccess > > > Key: MAPREDUCE-7199 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7199 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Bibin Chundatt >Assignee: Bilwa S T >Priority: Minor > Fix For: 3.4.0 > > Attachments: MAPREDUCE-7199-001.patch, MAPREDUCE-7199.002.patch, > MAPREDUCE-7199.003.patch > > > Reuse JobAclManager.checkAccess > {code} > private boolean checkAccess(String userName) { > if(!areAclsEnabled) { > return true; > } > // User could see its own job. > if (ugi.getShortUserName().equals(userName)) { > return true; > } > // Admin could also see all jobs > if (adminAclList != null && adminAclList.isUserAllowed(ugi)) { > return true; > } > return false; > } > {code} > {code} > jobACLsManager > .checkAccess(ugi, JobACL.VIEW_JOB, .. > new AccessControlList())) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6826) Job fails with InvalidStateTransitonException: Invalid event: JOB_TASK_COMPLETED at SUCCEEDED/COMMITTING
[ https://issues.apache.org/jira/browse/MAPREDUCE-6826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312094#comment-17312094 ] Bilwa S T commented on MAPREDUCE-6826: -- [~brahmareddy] can you please backport this to branch-3.3? Thanks > Job fails with InvalidStateTransitonException: Invalid event: > JOB_TASK_COMPLETED at SUCCEEDED/COMMITTING > > > Key: MAPREDUCE-6826 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6826 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Varun Saxena >Assignee: Bilwa S T >Priority: Major > Fix For: 3.4.0 > > Attachments: MAPREDUCE-6826-001.patch, MAPREDUCE-6826-002.patch, > MAPREDUCE-6826-003.patch > > > This happens if a container is preempted by scheduler after job starts > committing. > And this exception in turn leads to application being marked as FAILED in > YARN. > I think we can probably ignore JOB_TASK_COMPLETED event while JobImpl state > is COMMITTING or SUCCEEDED as job is in the process of finishing. > Also is there any point in attempting to scheduler another task attempt if > job is already in COMMITTING or SUCCEEDED state. > {noformat} > 2016-12-23 09:10:38,642 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: > task_1482404625971_23910_m_04 Task Transitioned from RUNNING to SUCCEEDED > 2016-12-23 09:10:38,642 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 5 > 2016-12-23 09:10:38,643 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: > job_1482404625971_23910Job Transitioned from RUNNING to COMMITTING > 2016-12-23 09:10:38,644 INFO [ContainerLauncher #5] > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing > the event EventType: CONTAINER_REMOTE_CLEANUP for container > container_e55_1482404625971_23910_01_10 taskAttempt > attempt_1482404625971_23910_m_04_1 > 2016-12-23 09:10:38,644 INFO [ContainerLauncher #5] > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING > attempt_1482404625971_23910_m_04_1 > 2016-12-23 09:10:38,644 INFO [ContainerLauncher #5] > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: > Opening proxy : linux-19:26009 > 2016-12-23 09:10:38,644 INFO [CommitterEvent Processor #4] > org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing > the event EventType: JOB_COMMIT > 2016-12-23 09:10:38,724 INFO [IPC Server handler 0 on 27113] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID : > jvm_1482404625971_23910_m_60473139527690 asked for a task > 2016-12-23 09:10:38,724 INFO [IPC Server handler 0 on 27113] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: > jvm_1482404625971_23910_m_60473139527690 is invalid and will be killed. > 2016-12-23 09:10:38,797 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Calling handler for > JobFinishedEvent > 2016-12-23 09:10:38,797 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: > job_1482404625971_23910Job Transitioned from COMMITTING to SUCCEEDED > 2016-12-23 09:10:38,798 INFO [Thread-93] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Job finished cleanly, > recording last MRAppMaster retry > 2016-12-23 09:10:38,798 INFO [Thread-93] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator > isAMLastRetry: true > 2016-12-23 09:10:38,798 INFO [Thread-93] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: RMCommunicator notified > that shouldUnregistered is: true > 2016-12-23 09:10:38,799 INFO [Thread-93] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: > true > 2016-12-23 09:10:38,799 INFO [Thread-93] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: > JobHistoryEventHandler notified that forceJobCompletion is true > 2016-12-23 09:10:38,799 INFO [Thread-93] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Calling stop for all the > services > 2016-12-23 09:10:38,800 INFO [Thread-93] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping > JobHistoryEventHandler. Size of the outstanding queue size is 1 > 2016-12-23 09:10:38,989 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before > Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:1 > AssignedReds:0 CompletedMaps:5 CompletedReds:0 ContAlloc:8 ContRel:0 > HostLocal:0 RackLocal:0 > 2016-12-23 09:10:38,993 INFO [RMCommunicator Allocator] >
[jira] [Commented] (MAPREDUCE-6809) Create ContainerRequestor interface and refactor RMContainerRequestor to use it
[ https://issues.apache.org/jira/browse/MAPREDUCE-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289220#comment-17289220 ] Bilwa S T commented on MAPREDUCE-6809: -- Attached rebased patch > Create ContainerRequestor interface and refactor RMContainerRequestor to use > it > --- > > Key: MAPREDUCE-6809 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6809 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Devaraj Kavali >Priority: Major > Fix For: MR-6749 > > Attachments: MAPREDUCE-6809-MR-6749.001.patch, > MAPREDUCE-6809-MR-6749.002.patch, MAPREDUCE-6809.001.patch > > > As per the discussion in MAPREDUCE-6773, create a ContainerRequestor > interface and refactor RMContainerRequestor to use this interface. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6809) Create ContainerRequestor interface and refactor RMContainerRequestor to use it
[ https://issues.apache.org/jira/browse/MAPREDUCE-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-6809: - Attachment: MAPREDUCE-6809.001.patch > Create ContainerRequestor interface and refactor RMContainerRequestor to use > it > --- > > Key: MAPREDUCE-6809 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6809 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Devaraj Kavali >Priority: Major > Fix For: MR-6749 > > Attachments: MAPREDUCE-6809-MR-6749.001.patch, > MAPREDUCE-6809-MR-6749.002.patch, MAPREDUCE-6809.001.patch > > > As per the discussion in MAPREDUCE-6773, create a ContainerRequestor > interface and refactor RMContainerRequestor to use this interface. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6772) Add MR Job Configurations for Containers reuse
[ https://issues.apache.org/jira/browse/MAPREDUCE-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289191#comment-17289191 ] Bilwa S T commented on MAPREDUCE-6772: -- MAPREDUCE-6772.001 Rebased patch with trunk code > Add MR Job Configurations for Containers reuse > -- > > Key: MAPREDUCE-6772 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6772 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Devaraj Kavali >Priority: Major > Fix For: MR-6749 > > Attachments: MAPREDUCE-6772-MR-6749.004.patch, > MAPREDUCE-6772-v0.patch, MAPREDUCE-6772-v1.patch, MAPREDUCE-6772.001.patch, > MR-6749-MAPREDUCE-6772.003.patch > > > This task adds configurations required for MR AM Container reuse feature. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6772) Add MR Job Configurations for Containers reuse
[ https://issues.apache.org/jira/browse/MAPREDUCE-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-6772: - Attachment: (was: LICENSE-binary) > Add MR Job Configurations for Containers reuse > -- > > Key: MAPREDUCE-6772 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6772 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Devaraj Kavali >Priority: Major > Fix For: MR-6749 > > Attachments: MAPREDUCE-6772-MR-6749.004.patch, > MAPREDUCE-6772-v0.patch, MAPREDUCE-6772-v1.patch, MAPREDUCE-6772.001.patch, > MR-6749-MAPREDUCE-6772.003.patch > > > This task adds configurations required for MR AM Container reuse feature. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6772) Add MR Job Configurations for Containers reuse
[ https://issues.apache.org/jira/browse/MAPREDUCE-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-6772: - Attachment: MAPREDUCE-6772.001.patch > Add MR Job Configurations for Containers reuse > -- > > Key: MAPREDUCE-6772 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6772 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Devaraj Kavali >Priority: Major > Fix For: MR-6749 > > Attachments: MAPREDUCE-6772-MR-6749.004.patch, > MAPREDUCE-6772-v0.patch, MAPREDUCE-6772-v1.patch, MAPREDUCE-6772.001.patch, > MR-6749-MAPREDUCE-6772.003.patch > > > This task adds configurations required for MR AM Container reuse feature. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6772) Add MR Job Configurations for Containers reuse
[ https://issues.apache.org/jira/browse/MAPREDUCE-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-6772: - Attachment: LICENSE-binary > Add MR Job Configurations for Containers reuse > -- > > Key: MAPREDUCE-6772 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6772 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Devaraj Kavali >Priority: Major > Fix For: MR-6749 > > Attachments: MAPREDUCE-6772-MR-6749.004.patch, > MAPREDUCE-6772-v0.patch, MAPREDUCE-6772-v1.patch, MAPREDUCE-6772.001.patch, > MR-6749-MAPREDUCE-6772.003.patch > > > This task adds configurations required for MR AM Container reuse feature. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6749) MR AM should reuse containers for Map/Reduce Tasks
[ https://issues.apache.org/jira/browse/MAPREDUCE-6749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17285715#comment-17285715 ] Bilwa S T commented on MAPREDUCE-6749: -- Attached test report for this feature. > MR AM should reuse containers for Map/Reduce Tasks > -- > > Key: MAPREDUCE-6749 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6749 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Devaraj Kavali >Priority: Major > Attachments: Container Reuse Performance Report.pdf, > MAPREDUCE-6749-Container Reuse-v0.pdf > > > It is with the continuation of MAPREDUCE-3902, MR AM should reuse containers > for Map/Reduce Tasks similar to the JVM Reuse feature we had in MRv1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6749) MR AM should reuse containers for Map/Reduce Tasks
[ https://issues.apache.org/jira/browse/MAPREDUCE-6749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-6749: - Attachment: Container Reuse Performance Report.pdf > MR AM should reuse containers for Map/Reduce Tasks > -- > > Key: MAPREDUCE-6749 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6749 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Devaraj Kavali >Priority: Major > Attachments: Container Reuse Performance Report.pdf, > MAPREDUCE-6749-Container Reuse-v0.pdf > > > It is with the continuation of MAPREDUCE-3902, MR AM should reuse containers > for Map/Reduce Tasks similar to the JVM Reuse feature we had in MRv1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17271291#comment-17271291 ] Bilwa S T commented on MAPREDUCE-7169: -- Hi [~epayne] Can you please help in reviewing this patch? > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > MAPREDUCE-7169-003.patch, MAPREDUCE-7169.004.patch, MAPREDUCE-7169.005.patch, > MAPREDUCE-7169.006.patch, MAPREDUCE-7169.007.patch, > image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7314) Job will hang if NM is restarted while its running
[ https://issues.apache.org/jira/browse/MAPREDUCE-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7314: - Attachment: MAPREDUCE-7314-MR-6749.001.patch > Job will hang if NM is restarted while its running > -- > > Key: MAPREDUCE-7314 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7314 > Project: Hadoop Map/Reduce > Issue Type: Sub-task >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7314-MR-6749.001.patch > > > This is due to three different reasons > # PRIORITY_FAST_FAIL_MAP priority containers should be considered for reuse. > # Whenever CONTAINER_REMOTE_CLEANUP is fired for task attempt, it wont kill > current attempt which is assigned to container. That is because task attempt > is not updated in ContainerLauncherImpl#Container class. > # Container gets assigned to task attempt even when container has stopped > running ie Container completed event is processed. This is because we add > reuse container map to allocated list. Makeremoterequest gets the same > container in allocationResponse whereas RM has sent same container in > finished container list. To avoid this we need to make sure allocated list > doesnt have any containers which are finished. > Test credits : [~Rajshree] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7314) Job will hang if NM is restarted while its running
[ https://issues.apache.org/jira/browse/MAPREDUCE-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7314: - Status: Patch Available (was: Open) > Job will hang if NM is restarted while its running > -- > > Key: MAPREDUCE-7314 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7314 > Project: Hadoop Map/Reduce > Issue Type: Sub-task >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7314-MR-6749.001.patch > > > This is due to three different reasons > # PRIORITY_FAST_FAIL_MAP priority containers should be considered for reuse. > # Whenever CONTAINER_REMOTE_CLEANUP is fired for task attempt, it wont kill > current attempt which is assigned to container. That is because task attempt > is not updated in ContainerLauncherImpl#Container class. > # Container gets assigned to task attempt even when container has stopped > running ie Container completed event is processed. This is because we add > reuse container map to allocated list. Makeremoterequest gets the same > container in allocationResponse whereas RM has sent same container in > finished container list. To avoid this we need to make sure allocated list > doesnt have any containers which are finished. > Test credits : [~Rajshree] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7314) Job will hang if NM is restarted while its running
[ https://issues.apache.org/jira/browse/MAPREDUCE-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268769#comment-17268769 ] Bilwa S T commented on MAPREDUCE-7314: -- Hi [~epayne] This is in branch MR-6749 when container reuse is enabled > Job will hang if NM is restarted while its running > -- > > Key: MAPREDUCE-7314 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7314 > Project: Hadoop Map/Reduce > Issue Type: Sub-task >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > > This is due to three different reasons > # PRIORITY_FAST_FAIL_MAP priority containers should be considered for reuse. > # Whenever CONTAINER_REMOTE_CLEANUP is fired for task attempt, it wont kill > current attempt which is assigned to container. That is because task attempt > is not updated in ContainerLauncherImpl#Container class. > # Container gets assigned to task attempt even when container has stopped > running ie Container completed event is processed. This is because we add > reuse container map to allocated list. Makeremoterequest gets the same > container in allocationResponse whereas RM has sent same container in > finished container list. To avoid this we need to make sure allocated list > doesnt have any containers which are finished. > Test credits : [~Rajshree] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267823#comment-17267823 ] Bilwa S T commented on MAPREDUCE-7169: -- Hi [~ahussein] Sorry i missed your comment. All changes have been done to this. This is good to go > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > MAPREDUCE-7169-003.patch, MAPREDUCE-7169.004.patch, MAPREDUCE-7169.005.patch, > MAPREDUCE-7169.006.patch, MAPREDUCE-7169.007.patch, > image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7314) Job will hang if NM is restarted while its running
[ https://issues.apache.org/jira/browse/MAPREDUCE-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7314: - Description: This is due to three different reasons # PRIORITY_FAST_FAIL_MAP priority containers should be considered for reuse. # Whenever CONTAINER_REMOTE_CLEANUP is fired for task attempt, it wont kill current attempt which is assigned to container. That is because task attempt is not updated in ContainerLauncherImpl#Container class. # Container gets assigned to task attempt even when container has stopped running ie Container completed event is processed. This is because we add reuse container map to allocated list. Makeremoterequest gets the same container in allocationResponse whereas RM has sent same container in finished container list. To avoid this we need to make sure allocated list doesnt have any containers which are finished. Test credits : [~Rajshree] was: This is due to three different reasons # PRIORITY_FAST_FAIL_MAP priority containers should be considered for reuse. # Whenever CONTAINER_REMOTE_CLEANUP is fired for task attempt, it wont kill current attempt which is assigned to container. That is because task attempt is not updated in ContainerLauncherImpl#Container class. # Container gets assigned to task attempt even when container has stopped running ie Container completed event is processed. This is because we add reuse container map to allocated list. Makeremoterequest gets the same container in allocationResponse whereas RM has sent same container in finished container list. To avoid this we need to make sure allocated list doesnt have any containers which are finished. > Job will hang if NM is restarted while its running > -- > > Key: MAPREDUCE-7314 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7314 > Project: Hadoop Map/Reduce > Issue Type: Sub-task >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > > This is due to three different reasons > # PRIORITY_FAST_FAIL_MAP priority containers should be considered for reuse. > # Whenever CONTAINER_REMOTE_CLEANUP is fired for task attempt, it wont kill > current attempt which is assigned to container. That is because task attempt > is not updated in ContainerLauncherImpl#Container class. > # Container gets assigned to task attempt even when container has stopped > running ie Container completed event is processed. This is because we add > reuse container map to allocated list. Makeremoterequest gets the same > container in allocationResponse whereas RM has sent same container in > finished container list. To avoid this we need to make sure allocated list > doesnt have any containers which are finished. > Test credits : [~Rajshree] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Created] (MAPREDUCE-7314) Job will hang if NM is restarted while its running
Bilwa S T created MAPREDUCE-7314: Summary: Job will hang if NM is restarted while its running Key: MAPREDUCE-7314 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7314 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Bilwa S T Assignee: Bilwa S T This is due to three different reasons # PRIORITY_FAST_FAIL_MAP priority containers should be considered for reuse. # Whenever CONTAINER_REMOTE_CLEANUP is fired for task attempt, it wont kill current attempt which is assigned to container. That is because task attempt is not updated in ContainerLauncherImpl#Container class. # Container gets assigned to task attempt even when container has stopped running ie Container completed event is processed. This is because we add reuse container map to allocated list. Makeremoterequest gets the same container in allocationResponse whereas RM has sent same container in finished container list. To avoid this we need to make sure allocated list doesnt have any containers which are finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6773) Implement RM Container Reuse Requestor to handle the reuse containers for resource requests
[ https://issues.apache.org/jira/browse/MAPREDUCE-6773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246463#comment-17246463 ] Bilwa S T commented on MAPREDUCE-6773: -- Hi [~devaraj] I see that containers are reused only if priority is PRIORITY_MAP and PRIORITY_REUDCE. currently containers are not even cleaned up if its not any of these priority due to which job hangs whenever containers are killed as NM restarted[as tasks are not assigned to container]. Is there any reason why containers with priority PRIORITY_FAST_FAIL_MAP are not being reused? {quote}RMContainerReuseRequestor ln no 133, we are checking for PRIORITY_MAP and PRIORITY_REDUCE what about other priorities ? do we need to send CONTAINER_REMOTE_CLEANUP, {quote} > Implement RM Container Reuse Requestor to handle the reuse containers for > resource requests > --- > > Key: MAPREDUCE-6773 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6773 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Affects Versions: MR-6749 >Reporter: Devaraj Kavali >Assignee: Devaraj Kavali >Priority: Major > Fix For: MR-6749 > > Attachments: MAPREDUCE-6773-MR-6749.003.patch, > MAPREDUCE-6773-MR-6749.004.patch, MAPREDUCE-6773-MR-6749.005.patch, > MAPREDUCE-6773-v0.patch, MAPREDUCE-6773-v1.patch, MAPREDUCE-6773-v2.patch > > > Add RM Container Reuse Requestor which handles the reuse containers against > the Job reource requests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7308) Containers never get reused as containersToReuse map gets cleared on makeRemoteRequest
[ https://issues.apache.org/jira/browse/MAPREDUCE-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7308: - Attachment: MAPREDUCE-7308-MR-6749.001.patch > Containers never get reused as containersToReuse map gets cleared on > makeRemoteRequest > -- > > Key: MAPREDUCE-7308 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7308 > Project: Hadoop Map/Reduce > Issue Type: Sub-task >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7308-MR-6749.001.patch > > > In RMContainerReuseRequestor whenever containerAssigned is called it checks > if allocated container can be reused. This always returns false as the map is > getting cleared on makeRemoteRequest. I think container can be removed from > containersToReuse map once its used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7308) Containers never get reused as containersToReuse map gets cleared on makeRemoteRequest
[ https://issues.apache.org/jira/browse/MAPREDUCE-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7308: - Status: Patch Available (was: Open) > Containers never get reused as containersToReuse map gets cleared on > makeRemoteRequest > -- > > Key: MAPREDUCE-7308 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7308 > Project: Hadoop Map/Reduce > Issue Type: Sub-task >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7308-MR-6749.001.patch > > > In RMContainerReuseRequestor whenever containerAssigned is called it checks > if allocated container can be reused. This always returns false as the map is > getting cleared on makeRemoteRequest. I think container can be removed from > containersToReuse map once its used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Assigned] (MAPREDUCE-6784) JobImpl state changes for containers reuse
[ https://issues.apache.org/jira/browse/MAPREDUCE-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T reassigned MAPREDUCE-6784: Assignee: Bilwa S T (was: Devaraj Kavali) > JobImpl state changes for containers reuse > -- > > Key: MAPREDUCE-6784 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6784 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-6784-v0.patch > > > Add JobImpl state changes for supporting reusing of containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6786) TaskAttemptImpl state changes for containers reuse
[ https://issues.apache.org/jira/browse/MAPREDUCE-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17237293#comment-17237293 ] Bilwa S T commented on MAPREDUCE-6786: -- Hi [~devaraj] [~brahmareddy] Can you please review this patch? > TaskAttemptImpl state changes for containers reuse > -- > > Key: MAPREDUCE-6786 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6786 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-6786-MR-6749.001.patch, > MAPREDUCE-6786-v0.patch, MAPREDUCE-6786.001.patch, MAPREDUCE-6786.002.patch > > > Update TaskAttemptImpl to support the reuse of containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7308) Containers never get reused as containersToReuse map gets cleared on makeRemoteRequest
[ https://issues.apache.org/jira/browse/MAPREDUCE-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7308: - Description: In RMContainerReuseRequestor whenever containerAssigned is called it checks if allocated container can be reused. This always returns false as the map is getting cleared on makeRemoteRequest. I think container can be removed from containersToReuse map once its used. (was: In RMContainerReuseRequestor whenever containerAssigned is called it checks if allocated container can be reused. This always returns false as the map is getting cleared on makeRemoteRequest. I think there is no need to clear the map as container will be removed from containersToReuse map once its used.) > Containers never get reused as containersToReuse map gets cleared on > makeRemoteRequest > -- > > Key: MAPREDUCE-7308 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7308 > Project: Hadoop Map/Reduce > Issue Type: Sub-task >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > > In RMContainerReuseRequestor whenever containerAssigned is called it checks > if allocated container can be reused. This always returns false as the map is > getting cleared on makeRemoteRequest. I think container can be removed from > containersToReuse map once its used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Created] (MAPREDUCE-7308) Containers never get reused as containersToReuse map gets cleared on makeRemoteRequest
Bilwa S T created MAPREDUCE-7308: Summary: Containers never get reused as containersToReuse map gets cleared on makeRemoteRequest Key: MAPREDUCE-7308 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7308 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Bilwa S T Assignee: Bilwa S T In RMContainerReuseRequestor whenever containerAssigned is called it checks if allocated container can be reused. This always returns false as the map is getting cleared on makeRemoteRequest. I think there is no need to clear the map as container will be removed from containersToReuse map once its used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7293) All pages in JHS should honor yarn.webapp.filter-entity-list-by-user
[ https://issues.apache.org/jira/browse/MAPREDUCE-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17230764#comment-17230764 ] Bilwa S T commented on MAPREDUCE-7293: -- Resolving this issue as jobACL info was null because client didn't enable acl. > All pages in JHS should honor yarn.webapp.filter-entity-list-by-user > > > Key: MAPREDUCE-7293 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7293 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > > Currently only HsJobsBlock checks for the access. If user who doesn't have > permission to access job page is able to do it which is wrong. So we need to > have below check in HsJobBlock,HsTasksBlock and HsTaskPage > {code:java} > if (isFilterAppListByUserEnabled && ugi != null && !aclsManager > .checkAccess(ugi, JobACL.VIEW_JOB, job.getUserName(), null)) { > > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Resolved] (MAPREDUCE-7293) All pages in JHS should honor yarn.webapp.filter-entity-list-by-user
[ https://issues.apache.org/jira/browse/MAPREDUCE-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T resolved MAPREDUCE-7293. -- Resolution: Not A Problem > All pages in JHS should honor yarn.webapp.filter-entity-list-by-user > > > Key: MAPREDUCE-7293 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7293 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > > Currently only HsJobsBlock checks for the access. If user who doesn't have > permission to access job page is able to do it which is wrong. So we need to > have below check in HsJobBlock,HsTasksBlock and HsTaskPage > {code:java} > if (isFilterAppListByUserEnabled && ugi != null && !aclsManager > .checkAccess(ugi, JobACL.VIEW_JOB, job.getUserName(), null)) { > > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6781) YarnChild should wait for another task when reuse is enabled
[ https://issues.apache.org/jira/browse/MAPREDUCE-6781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-6781: - Attachment: MAPREDUCE-6781-MR-6749.001.patch > YarnChild should wait for another task when reuse is enabled > > > Key: MAPREDUCE-6781 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6781 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-6781-MR-6749.001.patch, MAPREDUCE-6781-v0.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6781) YarnChild should wait for another task when reuse is enabled
[ https://issues.apache.org/jira/browse/MAPREDUCE-6781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-6781: - Status: Patch Available (was: Open) > YarnChild should wait for another task when reuse is enabled > > > Key: MAPREDUCE-6781 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6781 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-6781-MR-6749.001.patch, MAPREDUCE-6781-v0.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Assigned] (MAPREDUCE-6781) YarnChild should wait for another task when reuse is enabled
[ https://issues.apache.org/jira/browse/MAPREDUCE-6781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T reassigned MAPREDUCE-6781: Assignee: Bilwa S T (was: Devaraj Kavali) > YarnChild should wait for another task when reuse is enabled > > > Key: MAPREDUCE-6781 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6781 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-6781-v0.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7297) Exception thrown in log when trying to download conf from JHS
[ https://issues.apache.org/jira/browse/MAPREDUCE-7297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7297: - Status: Patch Available (was: Open) > Exception thrown in log when trying to download conf from JHS > - > > Key: MAPREDUCE-7297 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7297 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Minor > Attachments: MAPREDUCE-7297.001.patch > > > Below exception is thrown in JHS logs > {code:java} > 2020-09-23 21:53:07,437 | ERROR | qtp1635772897-51 | error handling URI: > /jobhistory/downloadconf/job_1600668504751_0001 | Dispatcher.java:175 > java.lang.IllegalStateException: No view rendered for 200 > at > org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:171) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:287) > at > com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:277) > at > com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:182) > at > com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119) > at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133) > at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130) > at > com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203) > at > com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1591) > at > org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1591) > at > com.huawei.hadoop.adapter.sso.LogoutFilter.doFilter(LogoutFilter.java:62) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1591) > at > com.huawei.hadoop.adapter.sso.RefererCheckFilter.doFilter(RefererCheckFilter.java:76) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1591) > at > org.jasig.cas.client.util.HttpServletRequestWrapperFilter.doFilter(HttpServletRequestWrapperFilter.java:70) > at > com.huawei.hadoop.adapter.sso.HttpServletRequestWrapperFilterWrapper.doFilter(HttpServletRequestWrapperFilterWrapper.java:75) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1591) > at > org.jasig.cas.client.validation.AbstractTicketValidationFilter.doFilter(AbstractTicketValidationFilter.java:238) > at > com.huawei.hadoop.adapter.sso.Cas20ProxyReceivingTicketValidationFilterWrapper.doFilter(Cas20ProxyReceivingTicketValidationFilterWrapper.java:40) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7297) Exception thrown in log when trying to download conf from JHS
[ https://issues.apache.org/jira/browse/MAPREDUCE-7297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7297: - Attachment: MAPREDUCE-7297.001.patch > Exception thrown in log when trying to download conf from JHS > - > > Key: MAPREDUCE-7297 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7297 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Minor > Attachments: MAPREDUCE-7297.001.patch > > > Below exception is thrown in JHS logs > {code:java} > 2020-09-23 21:53:07,437 | ERROR | qtp1635772897-51 | error handling URI: > /jobhistory/downloadconf/job_1600668504751_0001 | Dispatcher.java:175 > java.lang.IllegalStateException: No view rendered for 200 > at > org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:171) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:287) > at > com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:277) > at > com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:182) > at > com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119) > at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133) > at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130) > at > com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203) > at > com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1591) > at > org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1591) > at > com.huawei.hadoop.adapter.sso.LogoutFilter.doFilter(LogoutFilter.java:62) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1591) > at > com.huawei.hadoop.adapter.sso.RefererCheckFilter.doFilter(RefererCheckFilter.java:76) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1591) > at > org.jasig.cas.client.util.HttpServletRequestWrapperFilter.doFilter(HttpServletRequestWrapperFilter.java:70) > at > com.huawei.hadoop.adapter.sso.HttpServletRequestWrapperFilterWrapper.doFilter(HttpServletRequestWrapperFilterWrapper.java:75) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1591) > at > org.jasig.cas.client.validation.AbstractTicketValidationFilter.doFilter(AbstractTicketValidationFilter.java:238) > at > com.huawei.hadoop.adapter.sso.Cas20ProxyReceivingTicketValidationFilterWrapper.doFilter(Cas20ProxyReceivingTicketValidationFilterWrapper.java:40) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6786) TaskAttemptImpl state changes for containers reuse
[ https://issues.apache.org/jira/browse/MAPREDUCE-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-6786: - Attachment: MAPREDUCE-6786-MR-6749.001.patch > TaskAttemptImpl state changes for containers reuse > -- > > Key: MAPREDUCE-6786 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6786 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-6786-MR-6749.001.patch, > MAPREDUCE-6786-v0.patch, MAPREDUCE-6786.001.patch, MAPREDUCE-6786.002.patch > > > Update TaskAttemptImpl to support the reuse of containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6786) TaskAttemptImpl state changes for containers reuse
[ https://issues.apache.org/jira/browse/MAPREDUCE-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-6786: - Attachment: MAPREDUCE-6786.002.patch > TaskAttemptImpl state changes for containers reuse > -- > > Key: MAPREDUCE-6786 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6786 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-6786-v0.patch, MAPREDUCE-6786.001.patch, > MAPREDUCE-6786.002.patch > > > Update TaskAttemptImpl to support the reuse of containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6786) TaskAttemptImpl state changes for containers reuse
[ https://issues.apache.org/jira/browse/MAPREDUCE-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-6786: - Status: Patch Available (was: Open) > TaskAttemptImpl state changes for containers reuse > -- > > Key: MAPREDUCE-6786 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6786 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-6786-v0.patch, MAPREDUCE-6786.001.patch > > > Update TaskAttemptImpl to support the reuse of containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6786) TaskAttemptImpl state changes for containers reuse
[ https://issues.apache.org/jira/browse/MAPREDUCE-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-6786: - Attachment: MAPREDUCE-6786.001.patch > TaskAttemptImpl state changes for containers reuse > -- > > Key: MAPREDUCE-6786 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6786 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-6786-v0.patch, MAPREDUCE-6786.001.patch > > > Update TaskAttemptImpl to support the reuse of containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6786) TaskAttemptImpl state changes for containers reuse
[ https://issues.apache.org/jira/browse/MAPREDUCE-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202846#comment-17202846 ] Bilwa S T commented on MAPREDUCE-6786: -- Hi [~devaraj] [~Naganarasimha] i would like to work on this. So assigning it . > TaskAttemptImpl state changes for containers reuse > -- > > Key: MAPREDUCE-6786 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6786 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-6786-v0.patch > > > Update TaskAttemptImpl to support the reuse of containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Assigned] (MAPREDUCE-6786) TaskAttemptImpl state changes for containers reuse
[ https://issues.apache.org/jira/browse/MAPREDUCE-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T reassigned MAPREDUCE-6786: Assignee: Bilwa S T (was: Naganarasimha G R) > TaskAttemptImpl state changes for containers reuse > -- > > Key: MAPREDUCE-6786 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6786 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 >Reporter: Devaraj Kavali >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-6786-v0.patch > > > Update TaskAttemptImpl to support the reuse of containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Created] (MAPREDUCE-7297) Exception thrown in log when trying to download conf from JHS
Bilwa S T created MAPREDUCE-7297: Summary: Exception thrown in log when trying to download conf from JHS Key: MAPREDUCE-7297 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7297 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 3.1.1 Reporter: Bilwa S T Assignee: Bilwa S T Below exception is thrown in JHS logs {code:java} 2020-09-23 21:53:07,437 | ERROR | qtp1635772897-51 | error handling URI: /jobhistory/downloadconf/job_1600668504751_0001 | Dispatcher.java:175 java.lang.IllegalStateException: No view rendered for 200 at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:171) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:287) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:277) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:182) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119) at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133) at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130) at com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1591) at org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1591) at com.huawei.hadoop.adapter.sso.LogoutFilter.doFilter(LogoutFilter.java:62) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1591) at com.huawei.hadoop.adapter.sso.RefererCheckFilter.doFilter(RefererCheckFilter.java:76) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1591) at org.jasig.cas.client.util.HttpServletRequestWrapperFilter.doFilter(HttpServletRequestWrapperFilter.java:70) at com.huawei.hadoop.adapter.sso.HttpServletRequestWrapperFilterWrapper.doFilter(HttpServletRequestWrapperFilterWrapper.java:75) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1591) at org.jasig.cas.client.validation.AbstractTicketValidationFilter.doFilter(AbstractTicketValidationFilter.java:238) at com.huawei.hadoop.adapter.sso.Cas20ProxyReceivingTicketValidationFilterWrapper.doFilter(Cas20ProxyReceivingTicketValidationFilterWrapper.java:40) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7293) All pages in JHS should honor yarn.webapp.filter-entity-list-by-user
[ https://issues.apache.org/jira/browse/MAPREDUCE-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182270#comment-17182270 ] Bilwa S T commented on MAPREDUCE-7293: -- Hi [~sunilg] I analysed this again and found that there is already checkAccess happening for all this pages in AppController#checkAccess. this will be called whenever getJob, getTask, getTasks calls are made. This works fine when AM is running but for JHS this is currently not working as jobACL info in below code is coming as null. This is an issue even in case of REST api calls. As checkAccess calls are being made from REST call too. I think solving jobACL issue would solve this problem. Hence no need to again add this to all other pages {code:java} @Override public boolean checkAccess(UserGroupInformation callerUGI, JobACL jobOperation) { Map jobACLs = jobInfo.getJobACLs(); AccessControlList jobACL = jobACLs.get(jobOperation); if (jobACL == null) { return true; } return aclsMgr.checkAccess(callerUGI, jobOperation, jobInfo.getUsername(), jobACL); }{code} Thanks for taking a look at this. > All pages in JHS should honor yarn.webapp.filter-entity-list-by-user > > > Key: MAPREDUCE-7293 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7293 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > > Currently only HsJobsBlock checks for the access. If user who doesn't have > permission to access job page is able to do it which is wrong. So we need to > have below check in HsJobBlock,HsTasksBlock and HsTaskPage > {code:java} > if (isFilterAppListByUserEnabled && ugi != null && !aclsManager > .checkAccess(ugi, JobACL.VIEW_JOB, job.getUserName(), null)) { > > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7293) All pages in JHS should honor yarn.webapp.filter-entity-list-by-user
[ https://issues.apache.org/jira/browse/MAPREDUCE-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178760#comment-17178760 ] Bilwa S T commented on MAPREDUCE-7293: -- Hi [~sunilg] In MAPREDUCE-7097 i see that checkAccess was called from HsJobBlock before and then it was moved to HsJobsBlock. But HsJobBlock can be accessed by other user. i think it should be added in all places. any suggestions? > All pages in JHS should honor yarn.webapp.filter-entity-list-by-user > > > Key: MAPREDUCE-7293 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7293 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > > Currently only HsJobsBlock checks for the access. If user who doesn't have > permission to access job page is able to do it which is wrong. So we need to > have below check in HsJobBlock,HsTasksBlock and HsTaskPage > {code:java} > if (isFilterAppListByUserEnabled && ugi != null && !aclsManager > .checkAccess(ugi, JobACL.VIEW_JOB, job.getUserName(), null)) { > > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Created] (MAPREDUCE-7293) All pages in JHS should honor yarn.webapp.filter-entity-list-by-user
Bilwa S T created MAPREDUCE-7293: Summary: All pages in JHS should honor yarn.webapp.filter-entity-list-by-user Key: MAPREDUCE-7293 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7293 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Bilwa S T Assignee: Bilwa S T Currently only HsJobsBlock checks for the access. If user who doesn't have permission to access job page is able to do it which is wrong. So we need to have below check in HsJobBlock,HsTasksBlock and HsTaskPage {code:java} if (isFilterAppListByUserEnabled && ugi != null && !aclsManager .checkAccess(ugi, JobACL.VIEW_JOB, job.getUserName(), null)) { } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Assigned] (MAPREDUCE-6726) YARN Registry based AM discovery with retry and in-flight task persistent via JHS
[ https://issues.apache.org/jira/browse/MAPREDUCE-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T reassigned MAPREDUCE-6726: Assignee: Bilwa S T (was: Srikanth Sampath) > YARN Registry based AM discovery with retry and in-flight task persistent via > JHS > - > > Key: MAPREDUCE-6726 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6726 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster >Reporter: Junping Du >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-6726-MAPREDUCE-6608.001.patch, > MAPREDUCE-6726-MAPREDUCE-6608.001.patch, > MAPREDUCE-6726-MAPREDUCE-6608.002.patch, > MAPREDUCE-6726-MAPREDUCE-6608.003.patch, WorkPreservingMRAppMaster.pdf > > > Several tasks will be achieved in this JIRA based on the demo patch in > MAPREDUCE-6608: > 1. AM discovery base on YARN register service. Could be replaced by YARN-4758 > later due to scale up issue. > 2. Retry logic for TaskUmbilicalProtocol RPC connection > 3. In-flight task recover after AM restart via JHS > 4. Configuration to control the behavior compatible with previous when not > enable this feature (by default). > All security related issues and other concerns discussed in MAPREDUCE-6608 > will be addressed in follow up JIRAs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6726) YARN Registry based AM discovery with retry and in-flight task persistent via JHS
[ https://issues.apache.org/jira/browse/MAPREDUCE-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170590#comment-17170590 ] Bilwa S T commented on MAPREDUCE-6726: -- Hi [~srikanth.sampath] me and [~dmmkr] would like to work on this further if you are okay with it > YARN Registry based AM discovery with retry and in-flight task persistent via > JHS > - > > Key: MAPREDUCE-6726 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6726 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster >Reporter: Junping Du >Assignee: Srikanth Sampath >Priority: Major > Attachments: MAPREDUCE-6726-MAPREDUCE-6608.001.patch, > MAPREDUCE-6726-MAPREDUCE-6608.001.patch, > MAPREDUCE-6726-MAPREDUCE-6608.002.patch, > MAPREDUCE-6726-MAPREDUCE-6608.003.patch, WorkPreservingMRAppMaster.pdf > > > Several tasks will be achieved in this JIRA based on the demo patch in > MAPREDUCE-6608: > 1. AM discovery base on YARN register service. Could be replaced by YARN-4758 > later due to scale up issue. > 2. Retry logic for TaskUmbilicalProtocol RPC connection > 3. In-flight task recover after AM restart via JHS > 4. Configuration to control the behavior compatible with previous when not > enable this feature (by default). > All security related issues and other concerns discussed in MAPREDUCE-6608 > will be addressed in follow up JIRAs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169240#comment-17169240 ] Bilwa S T commented on MAPREDUCE-7169: -- Hi [~ahussein] I have handled all review comments and fixed checkstyle issues which can be fixed. > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > MAPREDUCE-7169-003.patch, MAPREDUCE-7169.004.patch, MAPREDUCE-7169.005.patch, > MAPREDUCE-7169.006.patch, MAPREDUCE-7169.007.patch, > image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7169: - Attachment: MAPREDUCE-7169.007.patch > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > MAPREDUCE-7169-003.patch, MAPREDUCE-7169.004.patch, MAPREDUCE-7169.005.patch, > MAPREDUCE-7169.006.patch, MAPREDUCE-7169.007.patch, > image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7169: - Attachment: MAPREDUCE-7169.006.patch > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > MAPREDUCE-7169-003.patch, MAPREDUCE-7169.004.patch, MAPREDUCE-7169.005.patch, > MAPREDUCE-7169.006.patch, image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Assigned] (MAPREDUCE-6976) mapred job -set-priority claims to set priority higher than yarn.cluster.max-application-priority
[ https://issues.apache.org/jira/browse/MAPREDUCE-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T reassigned MAPREDUCE-6976: Assignee: (was: Bilwa S T) > mapred job -set-priority claims to set priority higher than > yarn.cluster.max-application-priority > - > > Key: MAPREDUCE-6976 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6976 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.9.0, 2.8.1, 3.1.0 >Reporter: Eric Payne >Priority: Minor > > With {{yarn.cluster.max-application-priority}} set to 20 and > {{job_1507226760578_0002}} running at priority 0, run the following command: > {noformat} > $ mapred job -set-priority job_1507226760578_0002 21 > Changed job priority. > {noformat} > The above commands sets {{job_1507226760578_0002}} to priority 20. If > {{job_1507226760578_0002}} is already at 20, the command does nothing. > Compare this behavior to running the {{yarn application -updatePriority}} > command: > {code} > $ yarn application -updatePriority 21 -appId application_1507226760578_0002 > Updating priority of an aplication application_1507226760578_0002 > Updated priority of an application application_1507226760578_0002 to cluster > max priority OR keeping old priority as application is in final states > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Assigned] (MAPREDUCE-6976) mapred job -set-priority claims to set priority higher than yarn.cluster.max-application-priority
[ https://issues.apache.org/jira/browse/MAPREDUCE-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T reassigned MAPREDUCE-6976: Assignee: Bilwa S T > mapred job -set-priority claims to set priority higher than > yarn.cluster.max-application-priority > - > > Key: MAPREDUCE-6976 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6976 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.9.0, 2.8.1, 3.1.0 >Reporter: Eric Payne >Assignee: Bilwa S T >Priority: Minor > > With {{yarn.cluster.max-application-priority}} set to 20 and > {{job_1507226760578_0002}} running at priority 0, run the following command: > {noformat} > $ mapred job -set-priority job_1507226760578_0002 21 > Changed job priority. > {noformat} > The above commands sets {{job_1507226760578_0002}} to priority 20. If > {{job_1507226760578_0002}} is already at 20, the command does nothing. > Compare this behavior to running the {{yarn application -updatePriority}} > command: > {code} > $ yarn application -updatePriority 21 -appId application_1507226760578_0002 > Updating priority of an aplication application_1507226760578_0002 > Updated priority of an application application_1507226760578_0002 to cluster > max priority OR keeping old priority as application is in final states > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6826) Job fails with InvalidStateTransitonException: Invalid event: JOB_TASK_COMPLETED at SUCCEEDED/COMMITTING
[ https://issues.apache.org/jira/browse/MAPREDUCE-6826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17106907#comment-17106907 ] Bilwa S T commented on MAPREDUCE-6826: -- cc [~inigoiri] [~brahmareddy] > Job fails with InvalidStateTransitonException: Invalid event: > JOB_TASK_COMPLETED at SUCCEEDED/COMMITTING > > > Key: MAPREDUCE-6826 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6826 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Varun Saxena >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-6826-001.patch, MAPREDUCE-6826-002.patch, > MAPREDUCE-6826-003.patch > > > This happens if a container is preempted by scheduler after job starts > committing. > And this exception in turn leads to application being marked as FAILED in > YARN. > I think we can probably ignore JOB_TASK_COMPLETED event while JobImpl state > is COMMITTING or SUCCEEDED as job is in the process of finishing. > Also is there any point in attempting to scheduler another task attempt if > job is already in COMMITTING or SUCCEEDED state. > {noformat} > 2016-12-23 09:10:38,642 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: > task_1482404625971_23910_m_04 Task Transitioned from RUNNING to SUCCEEDED > 2016-12-23 09:10:38,642 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 5 > 2016-12-23 09:10:38,643 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: > job_1482404625971_23910Job Transitioned from RUNNING to COMMITTING > 2016-12-23 09:10:38,644 INFO [ContainerLauncher #5] > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing > the event EventType: CONTAINER_REMOTE_CLEANUP for container > container_e55_1482404625971_23910_01_10 taskAttempt > attempt_1482404625971_23910_m_04_1 > 2016-12-23 09:10:38,644 INFO [ContainerLauncher #5] > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING > attempt_1482404625971_23910_m_04_1 > 2016-12-23 09:10:38,644 INFO [ContainerLauncher #5] > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: > Opening proxy : linux-19:26009 > 2016-12-23 09:10:38,644 INFO [CommitterEvent Processor #4] > org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing > the event EventType: JOB_COMMIT > 2016-12-23 09:10:38,724 INFO [IPC Server handler 0 on 27113] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID : > jvm_1482404625971_23910_m_60473139527690 asked for a task > 2016-12-23 09:10:38,724 INFO [IPC Server handler 0 on 27113] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: > jvm_1482404625971_23910_m_60473139527690 is invalid and will be killed. > 2016-12-23 09:10:38,797 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Calling handler for > JobFinishedEvent > 2016-12-23 09:10:38,797 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: > job_1482404625971_23910Job Transitioned from COMMITTING to SUCCEEDED > 2016-12-23 09:10:38,798 INFO [Thread-93] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Job finished cleanly, > recording last MRAppMaster retry > 2016-12-23 09:10:38,798 INFO [Thread-93] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator > isAMLastRetry: true > 2016-12-23 09:10:38,798 INFO [Thread-93] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: RMCommunicator notified > that shouldUnregistered is: true > 2016-12-23 09:10:38,799 INFO [Thread-93] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: > true > 2016-12-23 09:10:38,799 INFO [Thread-93] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: > JobHistoryEventHandler notified that forceJobCompletion is true > 2016-12-23 09:10:38,799 INFO [Thread-93] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Calling stop for all the > services > 2016-12-23 09:10:38,800 INFO [Thread-93] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping > JobHistoryEventHandler. Size of the outstanding queue size is 1 > 2016-12-23 09:10:38,989 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before > Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:1 > AssignedReds:0 CompletedMaps:5 CompletedReds:0 ContAlloc:8 ContRel:0 > HostLocal:0 RackLocal:0 > 2016-12-23 09:10:38,993 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received > completed container
[jira] [Updated] (MAPREDUCE-6826) Job fails with InvalidStateTransitonException: Invalid event: JOB_TASK_COMPLETED at SUCCEEDED/COMMITTING
[ https://issues.apache.org/jira/browse/MAPREDUCE-6826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-6826: - Attachment: MAPREDUCE-6826-003.patch > Job fails with InvalidStateTransitonException: Invalid event: > JOB_TASK_COMPLETED at SUCCEEDED/COMMITTING > > > Key: MAPREDUCE-6826 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6826 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Varun Saxena >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-6826-001.patch, MAPREDUCE-6826-002.patch, > MAPREDUCE-6826-003.patch > > > This happens if a container is preempted by scheduler after job starts > committing. > And this exception in turn leads to application being marked as FAILED in > YARN. > I think we can probably ignore JOB_TASK_COMPLETED event while JobImpl state > is COMMITTING or SUCCEEDED as job is in the process of finishing. > Also is there any point in attempting to scheduler another task attempt if > job is already in COMMITTING or SUCCEEDED state. > {noformat} > 2016-12-23 09:10:38,642 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: > task_1482404625971_23910_m_04 Task Transitioned from RUNNING to SUCCEEDED > 2016-12-23 09:10:38,642 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 5 > 2016-12-23 09:10:38,643 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: > job_1482404625971_23910Job Transitioned from RUNNING to COMMITTING > 2016-12-23 09:10:38,644 INFO [ContainerLauncher #5] > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing > the event EventType: CONTAINER_REMOTE_CLEANUP for container > container_e55_1482404625971_23910_01_10 taskAttempt > attempt_1482404625971_23910_m_04_1 > 2016-12-23 09:10:38,644 INFO [ContainerLauncher #5] > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING > attempt_1482404625971_23910_m_04_1 > 2016-12-23 09:10:38,644 INFO [ContainerLauncher #5] > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: > Opening proxy : linux-19:26009 > 2016-12-23 09:10:38,644 INFO [CommitterEvent Processor #4] > org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing > the event EventType: JOB_COMMIT > 2016-12-23 09:10:38,724 INFO [IPC Server handler 0 on 27113] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID : > jvm_1482404625971_23910_m_60473139527690 asked for a task > 2016-12-23 09:10:38,724 INFO [IPC Server handler 0 on 27113] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: > jvm_1482404625971_23910_m_60473139527690 is invalid and will be killed. > 2016-12-23 09:10:38,797 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Calling handler for > JobFinishedEvent > 2016-12-23 09:10:38,797 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: > job_1482404625971_23910Job Transitioned from COMMITTING to SUCCEEDED > 2016-12-23 09:10:38,798 INFO [Thread-93] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Job finished cleanly, > recording last MRAppMaster retry > 2016-12-23 09:10:38,798 INFO [Thread-93] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator > isAMLastRetry: true > 2016-12-23 09:10:38,798 INFO [Thread-93] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: RMCommunicator notified > that shouldUnregistered is: true > 2016-12-23 09:10:38,799 INFO [Thread-93] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: > true > 2016-12-23 09:10:38,799 INFO [Thread-93] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: > JobHistoryEventHandler notified that forceJobCompletion is true > 2016-12-23 09:10:38,799 INFO [Thread-93] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Calling stop for all the > services > 2016-12-23 09:10:38,800 INFO [Thread-93] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping > JobHistoryEventHandler. Size of the outstanding queue size is 1 > 2016-12-23 09:10:38,989 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before > Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:1 > AssignedReds:0 CompletedMaps:5 CompletedReds:0 ContAlloc:8 ContRel:0 > HostLocal:0 RackLocal:0 > 2016-12-23 09:10:38,993 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received > completed container
[jira] [Resolved] (MAPREDUCE-7116) UT failure in TestMRTimelineEventHandling#testMRNewTimelineServiceEventHandling -Hadoop-3.1
[ https://issues.apache.org/jira/browse/MAPREDUCE-7116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T resolved MAPREDUCE-7116. -- Resolution: Invalid > UT failure in > TestMRTimelineEventHandling#testMRNewTimelineServiceEventHandling -Hadoop-3.1 > --- > > Key: MAPREDUCE-7116 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7116 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: K G Bakthavachalam >Assignee: Bilwa S T >Priority: Major > > unit test failure in > TestMRTimelineEventHandling#testMRNewTimelineServiceEventHandling due to > timeline server issue..due to Java.io.IOException : Job din't finish in > 30 seconds at > org.apache.hadoop.mapred.UtilsForTests.runJobSucceed(UtilsForTests.java:659) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103381#comment-17103381 ] Bilwa S T edited comment on MAPREDUCE-7169 at 5/9/20, 5:21 PM: --- Hi [~ahussein] What we are trying to achieve here is speculative attempt shouldn't be launched on faulty node. So even if task gets killed there is no point launching it on that node as it will slow.This is expected behaviour {quote} * Assuming that a new speculative attempt is created. Following the implementation, the new attempt X will have blacklisted nodes and skipped racks relevant to the original taskAttempt Y * Assuming taskAttempt Y is killed before attempt X gets assigned. * The RMContainerAllocator would still assign a host to attemptX based on the dated blacklists. Is this the expected behavior? or it is supposed to clear attemptX' blacklisted nodes?{quote} Yes i think these two cases should be handled {quote} * Should that object be synchronized? I believe there are more than one thread reading/writing to that object. Perhaps changing {{taskAttemptToEventMapping}} to {{concurrentHashMap}} would be sufficient. What do you think? {quote}* In {{taskAttemptToEventMapping}}, the data is only removed when the taskAttempt is assigned. If taskAttempt is killed before being assigned, {{taskAttemptToEventMapping}} would still have the taskAttempt. {quote}{quote} Will update this {quote} * Racks are going to be black listed too. Not just nodes. I believe that the javadoc and description in default.xml should emphasize that enabling the flag also avoids the local rack unless no other rack is available for scheduling.{quote} Actually when task attempt is killed by default Avataar is VIRGIN. this is defect which needs to be addressed. If speculative task attempt is killed it is launched as normal task attempt {quote} * why do we need {{mapTaskAttemptToAvataar}} when each taskAttempt has a field called {{avataar}} ?{quote} How do you get taskattempt details in RMContainerAllocator?? {quote} - That's a design issue. One would expect that RequestEvent's lifetime should not survive {{handle()}} call. Therefore, the metadata should be consumed by the handlers. In the patch, {{ContainerRequestEvent.blacklistedNodes}} could be a field in taskAttempt. Then you won't need {{TaskAttemptBlacklistManager}} class.{quote} Thanks was (Author: bilwast): Hi [~ahussein] What we are trying to achieve here is speculative attempt shouldn't be launched on faulty node. So even if task gets killed there is no point launching it on that node as it will slow.This is expected behaviour {quote} * Assuming that a new speculative attempt is created. Following the implementation, the new attempt X will have blacklisted nodes and skipped racks relevant to the original taskAttempt Y * Assuming taskAttempt Y is killed before attempt X gets assigned. * The RMContainerAllocator would still assign a host to attemptX based on the dated blacklists. Is this the expected behavior? or it is supposed to clear attemptX' blacklisted nodes?{quote} Yes i think these two cases should be handled {quote} * Should that object be synchronized? I believe there are more than one thread reading/writing to that object. Perhaps changing {{taskAttemptToEventMapping}} to {{concurrentHashMap}} would be sufficient. What do you think? {quote}* In {{taskAttemptToEventMapping}}, the data is only removed when the taskAttempt is assigned. If taskAttempt is killed before being assigned, {{taskAttemptToEventMapping}} would still have the taskAttempt. {quote}{quote} Will update this {quote} * Racks are going to be black listed too. Not just nodes. I believe that the javadoc and description in default.xml should emphasize that enabling the flag also avoids the local rack unless no other rack is available for scheduling.{quote} Actually when task attempt is killed by default Avataar is VIRGIN. this is defect which needs to be addressed. If speculative task attempt is killed it is launched as normal task attempt {quote} * why do we need {{mapTaskAttemptToAvataar}} when each taskAttempt has a field called {{avataar}} ?{quote} How do you get taskattempt details in RMContainerAllocator?? {quote} - That's a design issue. One would expect that RequestEvent's lifetime should not survive {{handle()}} call. Therefore, the metadata should be consumed by the handlers. In the patch, {{ContainerRequestEvent.blacklistedNodes}} could be a field in taskAttempt. Then you won't need {{TaskAttemptBlacklistManager}} class.{quote} > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components:
[jira] [Commented] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103381#comment-17103381 ] Bilwa S T commented on MAPREDUCE-7169: -- Hi [~ahussein] What we are trying to achieve here is speculative attempt shouldn't be launched on faulty node. So even if task gets killed there is no point launching it on that node as it will slow.This is expected behaviour {quote} * Assuming that a new speculative attempt is created. Following the implementation, the new attempt X will have blacklisted nodes and skipped racks relevant to the original taskAttempt Y * Assuming taskAttempt Y is killed before attempt X gets assigned. * The RMContainerAllocator would still assign a host to attemptX based on the dated blacklists. Is this the expected behavior? or it is supposed to clear attemptX' blacklisted nodes?{quote} Yes i think these two cases should be handled {quote} * Should that object be synchronized? I believe there are more than one thread reading/writing to that object. Perhaps changing {{taskAttemptToEventMapping}} to {{concurrentHashMap}} would be sufficient. What do you think? {quote}* In {{taskAttemptToEventMapping}}, the data is only removed when the taskAttempt is assigned. If taskAttempt is killed before being assigned, {{taskAttemptToEventMapping}} would still have the taskAttempt. {quote}{quote} Will update this {quote} * Racks are going to be black listed too. Not just nodes. I believe that the javadoc and description in default.xml should emphasize that enabling the flag also avoids the local rack unless no other rack is available for scheduling.{quote} Actually when task attempt is killed by default Avataar is VIRGIN. this is defect which needs to be addressed. If speculative task attempt is killed it is launched as normal task attempt {quote} * why do we need {{mapTaskAttemptToAvataar}} when each taskAttempt has a field called {{avataar}} ?{quote} How do you get taskattempt details in RMContainerAllocator?? {quote} - That's a design issue. One would expect that RequestEvent's lifetime should not survive {{handle()}} call. Therefore, the metadata should be consumed by the handlers. In the patch, {{ContainerRequestEvent.blacklistedNodes}} could be a field in taskAttempt. Then you won't need {{TaskAttemptBlacklistManager}} class.{quote} > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > MAPREDUCE-7169-003.patch, MAPREDUCE-7169.004.patch, MAPREDUCE-7169.005.patch, > image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095502#comment-17095502 ] Bilwa S T edited comment on MAPREDUCE-7169 at 4/29/20, 2:40 PM: Hi [~ahussein] {code:java} The same node will be picked if there are no other available nodes. In MAPREDUCE-7169.005.patch , what is the expected behavior if the resources available to run the taskAttempt are only available on the same node? I do not see this case in the unit test.{code} Container is not assigned until resources are available on other node. Task will wait until it gets container on other node. As per use case we do not want container to be launched on same node as node might have a problem was (Author: bilwast): Hi [~ahussein] {code:java} The same node will be picked if there are no other available nodes. In MAPREDUCE-7169.005.patch , what is the expected behavior if the resources available to run the taskAttempt are only available on the same node? I do not see this case in the unit test.{code} Container is not assigned until resources are available on other node. > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > MAPREDUCE-7169-003.patch, MAPREDUCE-7169.004.patch, MAPREDUCE-7169.005.patch, > image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095502#comment-17095502 ] Bilwa S T commented on MAPREDUCE-7169: -- Hi [~ahussein] {code:java} The same node will be picked if there are no other available nodes. In MAPREDUCE-7169.005.patch , what is the expected behavior if the resources available to run the taskAttempt are only available on the same node? I do not see this case in the unit test.{code} Container is not assigned until resources are available on other node. > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > MAPREDUCE-7169-003.patch, MAPREDUCE-7169.004.patch, MAPREDUCE-7169.005.patch, > image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094600#comment-17094600 ] Bilwa S T commented on MAPREDUCE-7169: -- Hi [~jeagles] can you please review when you have free time? > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > MAPREDUCE-7169-003.patch, MAPREDUCE-7169.004.patch, MAPREDUCE-7169.005.patch, > image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084024#comment-17084024 ] Bilwa S T commented on MAPREDUCE-7169: -- Hi [~jeagles] Could you please help to review? > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > MAPREDUCE-7169-003.patch, MAPREDUCE-7169.004.patch, MAPREDUCE-7169.005.patch, > image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7199) HsJobsBlock reuse JobACLsManager for checkAccess
[ https://issues.apache.org/jira/browse/MAPREDUCE-7199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082254#comment-17082254 ] Bilwa S T commented on MAPREDUCE-7199: -- Hi [~surendrasingh] i have attached a new patch. As part of MAPREDUCE-7097 addendum it was added that job owner and MR admin should be able to view job if filter entity list by user is enabled. As JobACLManager takes care of verifying adminAcl there is no need to pass AdminACL Please review latest patch > HsJobsBlock reuse JobACLsManager for checkAccess > > > Key: MAPREDUCE-7199 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7199 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Bibin Chundatt >Assignee: Bilwa S T >Priority: Minor > Attachments: MAPREDUCE-7199-001.patch, MAPREDUCE-7199.002.patch, > MAPREDUCE-7199.003.patch > > > Reuse JobAclManager.checkAccess > {code} > private boolean checkAccess(String userName) { > if(!areAclsEnabled) { > return true; > } > // User could see its own job. > if (ugi.getShortUserName().equals(userName)) { > return true; > } > // Admin could also see all jobs > if (adminAclList != null && adminAclList.isUserAllowed(ugi)) { > return true; > } > return false; > } > {code} > {code} > jobACLsManager > .checkAccess(ugi, JobACL.VIEW_JOB, .. > new AccessControlList())) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7199) HsJobsBlock reuse JobACLsManager for checkAccess
[ https://issues.apache.org/jira/browse/MAPREDUCE-7199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7199: - Attachment: MAPREDUCE-7199.003.patch > HsJobsBlock reuse JobACLsManager for checkAccess > > > Key: MAPREDUCE-7199 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7199 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Bibin Chundatt >Assignee: Bilwa S T >Priority: Minor > Attachments: MAPREDUCE-7199-001.patch, MAPREDUCE-7199.002.patch, > MAPREDUCE-7199.003.patch > > > Reuse JobAclManager.checkAccess > {code} > private boolean checkAccess(String userName) { > if(!areAclsEnabled) { > return true; > } > // User could see its own job. > if (ugi.getShortUserName().equals(userName)) { > return true; > } > // Admin could also see all jobs > if (adminAclList != null && adminAclList.isUserAllowed(ugi)) { > return true; > } > return false; > } > {code} > {code} > jobACLsManager > .checkAccess(ugi, JobACL.VIEW_JOB, .. > new AccessControlList())) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7169: - Attachment: MAPREDUCE-7169.005.patch > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > MAPREDUCE-7169-003.patch, MAPREDUCE-7169.004.patch, MAPREDUCE-7169.005.patch, > image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081325#comment-17081325 ] Bilwa S T commented on MAPREDUCE-7169: -- cc [~ahussein] [~jeagles] [~jiwq] > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > MAPREDUCE-7169-003.patch, MAPREDUCE-7169.004.patch, > image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7169: - Attachment: MAPREDUCE-7169.004.patch > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > MAPREDUCE-7169-003.patch, MAPREDUCE-7169.004.patch, > image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081201#comment-17081201 ] Bilwa S T edited comment on MAPREDUCE-7169 at 4/11/20, 7:32 AM: Hi [~ahussein] Sorry for late reply. {quote}I see that you add the node hosting the original task to the blacklist of the speculative task. Wouldn't it be easier just to change the order of the dataLocalHosts so that the node will be picked last in the loop? In that case, the speculative task will run on the same node only if all other nodes cannot be assigned to the speculative task. {quote} With change like this, there are chances that task attempt will get launched on same node where it was launched which wouldn't solve the problem. If we blacklist node then in next iteration of containers assigned it will be launched on other node. {quote}My intuition is that changing the policy to pick the node for the speculative task will inherently change the efficiency of the speculation. For example, picking a different node may increase the startup time of the speculative task. This implies change of the speculation efficiency compared to the legacy behavior. Thus, I suggest to give the option for the user to enable/disable the new policy in case she prefers to evaluate the new behavior and revert back to the legacy one if necessary. {quote} I agree with this. i will change code accordingly. Currently working on it was (Author: bilwast): Hi [~ahussein] Sorry for late reply. {quote}I see that you add the node hosting the original task to the blacklist of the speculative task. Wouldn't it be easier just to change the order of the dataLocalHosts so that the node will be picked last in the loop? In that case, the speculative task will run on the same node only if all other nodes cannot be assigned to the speculative task. {quote} With change like this, there are chances that task attempt will get launched on same node where it was launched which wouldn't solve the problem. If we blacklist node then in next iteration of containers assigned it will be launched on other node. > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > MAPREDUCE-7169-003.patch, image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081201#comment-17081201 ] Bilwa S T commented on MAPREDUCE-7169: -- Hi [~ahussein] Sorry for late reply. {quote}I see that you add the node hosting the original task to the blacklist of the speculative task. Wouldn't it be easier just to change the order of the dataLocalHosts so that the node will be picked last in the loop? In that case, the speculative task will run on the same node only if all other nodes cannot be assigned to the speculative task. {quote} With change like this, there are chances that task attempt will get launched on same node where it was launched which wouldn't solve the problem. If we blacklist node then in next iteration of containers assigned it will be launched on other node. > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > MAPREDUCE-7169-003.patch, image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7199) HsJobsBlock reuse JobACLsManager for checkAccess
[ https://issues.apache.org/jira/browse/MAPREDUCE-7199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080617#comment-17080617 ] Bilwa S T commented on MAPREDUCE-7199: -- Thanks [~surendrasingh] for review comments. Instead of creating JobAClsManager object and calling JobACLsManager#checkAccess we can directly call {code:java} job.checkAccess(ugi, JobACL.VIEW_JOB) {code} which will call JobACLsManager#checkAccess. I have uploaded patch. Please review > HsJobsBlock reuse JobACLsManager for checkAccess > > > Key: MAPREDUCE-7199 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7199 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Bibin Chundatt >Assignee: Bilwa S T >Priority: Minor > Attachments: MAPREDUCE-7199-001.patch, MAPREDUCE-7199.002.patch > > > Reuse JobAclManager.checkAccess > {code} > private boolean checkAccess(String userName) { > if(!areAclsEnabled) { > return true; > } > // User could see its own job. > if (ugi.getShortUserName().equals(userName)) { > return true; > } > // Admin could also see all jobs > if (adminAclList != null && adminAclList.isUserAllowed(ugi)) { > return true; > } > return false; > } > {code} > {code} > jobACLsManager > .checkAccess(ugi, JobACL.VIEW_JOB, .. > new AccessControlList())) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7199) HsJobsBlock reuse JobACLsManager for checkAccess
[ https://issues.apache.org/jira/browse/MAPREDUCE-7199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7199: - Attachment: MAPREDUCE-7199.002.patch > HsJobsBlock reuse JobACLsManager for checkAccess > > > Key: MAPREDUCE-7199 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7199 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Bibin Chundatt >Assignee: Bilwa S T >Priority: Minor > Attachments: MAPREDUCE-7199-001.patch, MAPREDUCE-7199.002.patch > > > Reuse JobAclManager.checkAccess > {code} > private boolean checkAccess(String userName) { > if(!areAclsEnabled) { > return true; > } > // User could see its own job. > if (ugi.getShortUserName().equals(userName)) { > return true; > } > // Admin could also see all jobs > if (adminAclList != null && adminAclList.isUserAllowed(ugi)) { > return true; > } > return false; > } > {code} > {code} > jobACLsManager > .checkAccess(ugi, JobACL.VIEW_JOB, .. > new AccessControlList())) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001832#comment-17001832 ] Bilwa S T commented on MAPREDUCE-7169: -- Hi [~jeagles] can u please review the patch? > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > MAPREDUCE-7169-003.patch, image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7169: - Attachment: MAPREDUCE-7169-003.patch > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > MAPREDUCE-7169-003.patch, image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16983178#comment-16983178 ] Bilwa S T commented on MAPREDUCE-7169: -- HI [~jeagles] I will update it by EOD > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975784#comment-16975784 ] Bilwa S T commented on MAPREDUCE-7169: -- [~jiwq] sorry i missed ur comment on patch. I have uploaded latest patch. Please check > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975783#comment-16975783 ] Bilwa S T commented on MAPREDUCE-7169: -- [~ahussein] I have updated patch. > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7169: - Attachment: MAPREDUCE-7169-002.patch > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, MAPREDUCE-7169-002.patch, > image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7199) HsJobsBlock reuse JobACLsManager for checkAccess
[ https://issues.apache.org/jira/browse/MAPREDUCE-7199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826671#comment-16826671 ] Bilwa S T commented on MAPREDUCE-7199: -- Hi [~bibinchundatt] i have attached patch. Please review > HsJobsBlock reuse JobACLsManager for checkAccess > > > Key: MAPREDUCE-7199 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7199 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Minor > Attachments: MAPREDUCE-7199-001.patch > > > Reuse JobAclManager.checkAccess > {code} > private boolean checkAccess(String userName) { > if(!areAclsEnabled) { > return true; > } > // User could see its own job. > if (ugi.getShortUserName().equals(userName)) { > return true; > } > // Admin could also see all jobs > if (adminAclList != null && adminAclList.isUserAllowed(ugi)) { > return true; > } > return false; > } > {code} > {code} > jobACLsManager > .checkAccess(ugi, JobACL.VIEW_JOB, .. > new AccessControlList())) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7199) HsJobsBlock reuse JobACLsManager for checkAccess
[ https://issues.apache.org/jira/browse/MAPREDUCE-7199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7199: - Attachment: MAPREDUCE-7199-001.patch > HsJobsBlock reuse JobACLsManager for checkAccess > > > Key: MAPREDUCE-7199 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7199 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Minor > Attachments: MAPREDUCE-7199-001.patch > > > Reuse JobAclManager.checkAccess > {code} > private boolean checkAccess(String userName) { > if(!areAclsEnabled) { > return true; > } > // User could see its own job. > if (ugi.getShortUserName().equals(userName)) { > return true; > } > // Admin could also see all jobs > if (adminAclList != null && adminAclList.isUserAllowed(ugi)) { > return true; > } > return false; > } > {code} > {code} > jobACLsManager > .checkAccess(ugi, JobACL.VIEW_JOB, .. > new AccessControlList())) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7199) HsJobsBlock reuse JobACLsManager for checkAccess
[ https://issues.apache.org/jira/browse/MAPREDUCE-7199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7199: - Status: Patch Available (was: Open) > HsJobsBlock reuse JobACLsManager for checkAccess > > > Key: MAPREDUCE-7199 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7199 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Minor > Attachments: MAPREDUCE-7199-001.patch > > > Reuse JobAclManager.checkAccess > {code} > private boolean checkAccess(String userName) { > if(!areAclsEnabled) { > return true; > } > // User could see its own job. > if (ugi.getShortUserName().equals(userName)) { > return true; > } > // Admin could also see all jobs > if (adminAclList != null && adminAclList.isUserAllowed(ugi)) { > return true; > } > return false; > } > {code} > {code} > jobACLsManager > .checkAccess(ugi, JobACL.VIEW_JOB, .. > new AccessControlList())) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Assigned] (MAPREDUCE-7199) HsJobsBlock reuse JobACLsManager for checkAccess
[ https://issues.apache.org/jira/browse/MAPREDUCE-7199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T reassigned MAPREDUCE-7199: Assignee: Bilwa S T > HsJobsBlock reuse JobACLsManager for checkAccess > > > Key: MAPREDUCE-7199 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7199 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Minor > > Reuse JobAclManager.checkAccess > {code} > private boolean checkAccess(String userName) { > if(!areAclsEnabled) { > return true; > } > // User could see its own job. > if (ugi.getShortUserName().equals(userName)) { > return true; > } > // Admin could also see all jobs > if (adminAclList != null && adminAclList.isUserAllowed(ugi)) { > return true; > } > return false; > } > {code} > {code} > jobACLsManager > .checkAccess(ugi, JobACL.VIEW_JOB, .. > new AccessControlList())) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Assigned] (MAPREDUCE-7116) UT failure in TestMRTimelineEventHandling#testMRNewTimelineServiceEventHandling -Hadoop-3.1
[ https://issues.apache.org/jira/browse/MAPREDUCE-7116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T reassigned MAPREDUCE-7116: Assignee: Bilwa S T > UT failure in > TestMRTimelineEventHandling#testMRNewTimelineServiceEventHandling -Hadoop-3.1 > --- > > Key: MAPREDUCE-7116 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7116 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: K G Bakthavachalam >Assignee: Bilwa S T >Priority: Major > > unit test failure in > TestMRTimelineEventHandling#testMRNewTimelineServiceEventHandling due to > timeline server issue..due to Java.io.IOException : Job din't finish in > 30 seconds at > org.apache.hadoop.mapred.UtilsForTests.runJobSucceed(UtilsForTests.java:659) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7169: - Attachment: MAPREDUCE-7169-001.patch > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, > image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16798074#comment-16798074 ] Bilwa S T commented on MAPREDUCE-7169: -- cc [~bibinchundatt] [~jlowe] > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, > image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7169: - Status: Patch Available (was: Open) > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7169-001.patch, > image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Assigned] (MAPREDUCE-7195) Mapreduce task timeout to zero could cause too many status update
[ https://issues.apache.org/jira/browse/MAPREDUCE-7195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T reassigned MAPREDUCE-7195: Assignee: Bilwa S T > Mapreduce task timeout to zero could cause too many status update > - > > Key: MAPREDUCE-7195 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7195 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Major > Attachments: screenshot-1.png > > > * mapreduce.task.timeout=0 > Could create too many status update > {code} > public static long getTaskProgressReportInterval(final Configuration conf) { > long taskHeartbeatTimeOut = conf.getLong( > MRJobConfig.TASK_TIMEOUT, MRJobConfig.DEFAULT_TASK_TIMEOUT_MILLIS); > return conf.getLong(MRJobConfig.TASK_PROGRESS_REPORT_INTERVAL, > (long) (TASK_REPORT_INTERVAL_TO_TIMEOUT_RATIO * > taskHeartbeatTimeOut)); > } > {code} > mapreduce timeout=0 is used to disable timeout feature -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16794711#comment-16794711 ] Bilwa S T edited comment on MAPREDUCE-7169 at 3/18/19 4:57 AM: --- Hi [~uranus] Same as what [~bibinchundatt] had suggested. We can skip nodes where previous attempt was launched. {quote} * In TaskImpl#addAndScheduleAttempt whenever Avataar is SPECULATIVE update previous attempts nodes and racks * In TaskAttemptImpl#RequestContainerTransition when Avataar is SPECULATIVE , we can skip previous attempts nodes and racks . Also we need to keep record of blacklisted nodes * During allocation for mapper RMContainerAllocator#assignMapsWithLocality ,we have three types of resource requests * Hosts - already we have updated datahosts for the task attempt * Racks - this is also taken care in taskattemptimpl * Any - In this case we need blacklisted node record which we had upadated so that we can check if node is blacklisted for the task. If it is blacklisted then we skip allocating and it would retry for other container {quote} was (Author: bilwast): Hi [~uranus] Same as what [~bibinchundatt] had suggested. We can skip nodes where previous attempt was launched. {quote} * In TaskImpl#addAndScheduleAttempt whenever Avataar is SPECULATIVE update previous attempts nodes and racks * In TaskAttemptImpl#RequestContainerTransition when Avataar is SPECULATIVE , we can skip previous attempts nodes and racks . Also we need to keep record of blacklisted nodes * During allocation for mapper RMContainerAllocator#assignMapsWithLocality ,we have three types of resource requests 1. Hosts - already we have updated datahosts for the task attempt 2. Racks - this is also taken care in taskattemptimpl 3. Any - In this case we need blacklisted node record which we had upadated so that we can check if node is blacklisted for the task. If it is blacklisted then we skip allocating and it would retry for other container {quote} > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16794711#comment-16794711 ] Bilwa S T edited comment on MAPREDUCE-7169 at 3/18/19 4:53 AM: --- Hi [~uranus] Same as what [~bibinchundatt] had suggested. We can skip nodes where previous attempt was launched. {quote} * In TaskImpl#addAndScheduleAttempt whenever Avataar is SPECULATIVE update previous attempts nodes and racks * In TaskAttemptImpl#RequestContainerTransition when Avataar is SPECULATIVE , we can skip previous attempts nodes and racks . Also we need to keep record of blacklisted nodes * During allocation for mapper RMContainerAllocator#assignMapsWithLocality ,we have three types of resource requests 1. Hosts - already we have updated datahosts for the task attempt 2. Racks - this is also taken care in taskattemptimpl 3. Any - In this case we need blacklisted node record which we had upadated so that we can check if node is blacklisted for the task. If it is blacklisted then we skip allocating and it would retry for other container {quote} was (Author: bilwast): Hi [~uranus] Same as what [~bibinchundatt] had suggested. We can skip nodes where previous attempt was launched. 1. In TaskImpl#addAndScheduleAttempt whenever Avataar is SPECULATIVE update previous attempts nodes and racks 2. In TaskAttemptImpl#RequestContainerTransition when Avataar is SPECULATIVE we can skip previous attempts nodes and racks . Also we need to keep record of blacklisted nodes 3. During allocation for mappers ie in RMContainerAllocator#assignMapsWithLocality we have three types of resource requests 1. Hosts - already we have updated datahosts for the task attempt 2. Racks - this is also taken care in taskattemptimpl 3. Any - In this case we need blacklisted node record which we had upadated so that we can check if node is blacklisted for the task. If it is blacklisted then we skip allocating and it would retry for other container > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (MAPREDUCE-7169) Speculative attempts should not run on the same node
[ https://issues.apache.org/jira/browse/MAPREDUCE-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16794711#comment-16794711 ] Bilwa S T edited comment on MAPREDUCE-7169 at 3/18/19 4:56 AM: --- Hi [~uranus] Same as what [~bibinchundatt] had suggested. We can skip nodes where previous attempt was launched. {quote} * In TaskImpl#addAndScheduleAttempt whenever Avataar is SPECULATIVE update previous attempts nodes and racks * In TaskAttemptImpl#RequestContainerTransition when Avataar is SPECULATIVE , we can skip previous attempts nodes and racks . Also we need to keep record of blacklisted nodes * During allocation for mapper RMContainerAllocator#assignMapsWithLocality ,we have three types of resource requests 1. Hosts - already we have updated datahosts for the task attempt 2. Racks - this is also taken care in taskattemptimpl 3. Any - In this case we need blacklisted node record which we had upadated so that we can check if node is blacklisted for the task. If it is blacklisted then we skip allocating and it would retry for other container {quote} was (Author: bilwast): Hi [~uranus] Same as what [~bibinchundatt] had suggested. We can skip nodes where previous attempt was launched. {quote} # In TaskImpl#addAndScheduleAttempt whenever Avataar is SPECULATIVE update previous attempts nodes and racks # In TaskAttemptImpl#RequestContainerTransition when Avataar is SPECULATIVE , we can skip previous attempts nodes and racks . Also we need to keep record of blacklisted nodes # During allocation for mapper RMContainerAllocator#assignMapsWithLocality ,we have three types of resource requests 1. Hosts - already we have updated datahosts for the task attempt 2. Racks - this is also taken care in taskattemptimpl 3. Any - In this case we need blacklisted node record which we had upadated so that we can check if node is blacklisted for the task. If it is blacklisted then we skip allocating and it would retry for other container {quote} > Speculative attempts should not run on the same node > > > Key: MAPREDUCE-7169 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7169 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: yarn >Affects Versions: 2.7.2 >Reporter: Lee chen >Assignee: Bilwa S T >Priority: Major > Attachments: image-2018-12-03-09-54-07-859.png > > > I found in all versions of yarn, Speculative Execution may set the > speculative task to the node of original task.What i have read is only it > will try to have one more task attempt. haven't seen any place mentioning not > on same node.It is unreasonable.If the node have some problems lead to tasks > execution will be very slow. and then placement the speculative task to same > node cannot help the problematic task. > In our cluster (version 2.7.2,2700 nodes),this phenomenon appear > almost everyday. > !image-2018-12-03-09-54-07-859.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org