[jira] [Commented] (MAPREDUCE-6850) Shuffle Handler keep-alive connections are closed from the server side
[ https://issues.apache.org/jira/browse/MAPREDUCE-6850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15886703#comment-15886703 ] Rajesh Balamohan commented on MAPREDUCE-6850: - Thanks for sharing the latest patch [~jeagles]. .4 patch lgtm. > Shuffle Handler keep-alive connections are closed from the server side > -- > > Key: MAPREDUCE-6850 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6850 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: MAPREDUCE-6850.1.patch, MAPREDUCE-6850.2.patch, > MAPREDUCE-6850.3.patch, MAPREDUCE-6850.4.patch, With_Issue.png, > With_Patch.png, With_Patch_withData.png > > > When performance testing tez shuffle handler (TEZ-3334), it was noticed the > keep-alive connections are closed from the server-side. The client silently > recovers and logs the connection as keep-alive, despite reestablishing a > connection. This jira aims to remove the close from the server side, fixing > the bug preventing keep-alive connections. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (MAPREDUCE-6850) Shuffle Handler keep-alive connections are closed from the server side
[ https://issues.apache.org/jira/browse/MAPREDUCE-6850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884985#comment-15884985 ] Rajesh Balamohan edited comment on MAPREDUCE-6850 at 2/27/17 2:15 AM: -- I checked in small multi-node cluster with the patch. Attaching the tcpdump screenshots for reference. Patch works fine with keep-alive enabled and connections are being reused, where mapOutputs are retrieved using same connection. Attachment "With_Patch.png" shows the TCP stream, where multiple mapOutput being fetched from same connection. One very minor comment in the patch. {{timer}} variable in {{HttpPipelineFactory}} may not be needed. In MAPREDUCE-5787, Keepalive parameter checks were there till https://issues.apache.org/jira/secure/attachment/12634984/MAPREDUCE-5787-2.4.0-v3.patch as follows. {noformat} if (!keepAlive && !keepAliveParam) { lastMap.addListener(ChannelFutureListener.CLOSE); } {noformat} However, during refactoring it got missed out in subsequent patches in the same JIRA. That caused this problem. However, It would have relied on client to close the connection. I.e it was the responsibility of the client (JDK's internal http client) to terminate the connection after keep-alive timeout. Current patch proposed in this JIRA addresses that scenario as well, where in it would automatically close the connection if timeout exceeds the threshold provided in server side. was (Author: rajesh.balamohan): I checked in small multi-node cluster with the patch. Attaching the tcpdump screenshots for reference. Patch works fine with keep-alive enabled and connections are being reused, where mapOutputs are retrieved using same connection. Attachment "With_Patch.png" shows the TCP stream, where multiple mapOutput being fetched from same connection. One very minor comment in the patch. {{timer}} variable in {{HttpPipelineFactory}} may not be needed. In MAPREDUCE-5787, Keepalive parameter checks were there till https://issues.apache.org/jira/secure/attachment/12634984/MAPREDUCE-5787-2.4.0-v3.patch as follows. {noformat} if (!keepAlive && !keepAliveParam) { lastMap.addListener(ChannelFutureListener.CLOSE); } {noformat} However, during refactoring it got missed out in subsequent patches. That caused this problem. However, It would have relied on client to close the connection. I.e it was the responsibility of the client (JDK's internal http client) to terminate the connection after keep-alive timeout. Current patch proposed in this JIRA addresses that scenario as well, where in it would automatically close the connection if timeout exceeds the threshold provided in server side. > Shuffle Handler keep-alive connections are closed from the server side > -- > > Key: MAPREDUCE-6850 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6850 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: MAPREDUCE-6850.1.patch, MAPREDUCE-6850.2.patch, > MAPREDUCE-6850.3.patch, With_Issue.png, With_Patch.png, > With_Patch_withData.png > > > When performance testing tez shuffle handler (TEZ-3334), it was noticed the > keep-alive connections are closed from the server-side. The client silently > recovers and logs the connection as keep-alive, despite reestablishing a > connection. This jira aims to remove the close from the server side, fixing > the bug preventing keep-alive connections. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6850) Shuffle Handler keep-alive connections are closed from the server side
[ https://issues.apache.org/jira/browse/MAPREDUCE-6850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-6850: Attachment: With_Patch_withData.png With_Patch.png With_Issue.png I checked in small multi-node cluster with the patch. Attaching the tcpdump screenshots for reference. Patch works fine with keep-alive enabled and connections are being reused, where mapOutputs are retrieved using same connection. Attachment "With_Patch.png" shows the TCP stream, where multiple mapOutput being fetched from same connection. One very minor comment in the patch. {{timer}} variable in {{HttpPipelineFactory}} may not be needed. In MAPREDUCE-5787, Keepalive parameter checks were there till https://issues.apache.org/jira/secure/attachment/12634984/MAPREDUCE-5787-2.4.0-v3.patch as follows. {noformat} if (!keepAlive && !keepAliveParam) { lastMap.addListener(ChannelFutureListener.CLOSE); } {noformat} However, during refactoring it got missed out in subsequent patches. That caused this problem. However, It would have relied on client to close the connection. I.e it was the responsibility of the client (JDK's internal http client) to terminate the connection after keep-alive timeout. Current patch proposed in this JIRA addresses that scenario as well, where in it would automatically close the connection if timeout exceeds the threshold provided in server side. > Shuffle Handler keep-alive connections are closed from the server side > -- > > Key: MAPREDUCE-6850 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6850 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: MAPREDUCE-6850.1.patch, MAPREDUCE-6850.2.patch, > MAPREDUCE-6850.3.patch, With_Issue.png, With_Patch.png, > With_Patch_withData.png > > > When performance testing tez shuffle handler (TEZ-3334), it was noticed the > keep-alive connections are closed from the server-side. The client silently > recovers and logs the connection as keep-alive, despite reestablishing a > connection. This jira aims to remove the close from the server side, fixing > the bug preventing keep-alive connections. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6850) Shuffle Handler keep-alive connections are closed from the server side
[ https://issues.apache.org/jira/browse/MAPREDUCE-6850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884074#comment-15884074 ] Rajesh Balamohan commented on MAPREDUCE-6850: - Patch looks good to me. I would need more time for checking in cluster. Got DFSOutputStream timeout exception in the cluster I was trying out (which is not related to this jira) > Shuffle Handler keep-alive connections are closed from the server side > -- > > Key: MAPREDUCE-6850 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6850 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: MAPREDUCE-6850.1.patch, MAPREDUCE-6850.2.patch, > MAPREDUCE-6850.3.patch > > > When performance testing tez shuffle handler (TEZ-3334), it was noticed the > keep-alive connections are closed from the server-side. The client silently > recovers and logs the connection as keep-alive, despite reestablishing a > connection. This jira aims to remove the close from the server side, fixing > the bug preventing keep-alive connections. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6850) Shuffle Handler keep-alive connections are closed from the server side
[ https://issues.apache.org/jira/browse/MAPREDUCE-6850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882248#comment-15882248 ] Rajesh Balamohan commented on MAPREDUCE-6850: - Thanks for the patch [~jeagles]. I am getting a small cluster today/tomorrow. I will check the patch and will update. > Shuffle Handler keep-alive connections are closed from the server side > -- > > Key: MAPREDUCE-6850 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6850 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: MAPREDUCE-6850.1.patch, MAPREDUCE-6850.2.patch, > MAPREDUCE-6850.3.patch > > > When performance testing tez shuffle handler (TEZ-3334), it was noticed the > keep-alive connections are closed from the server-side. The client silently > recovers and logs the connection as keep-alive, despite reestablishing a > connection. This jira aims to remove the close from the server side, fixing > the bug preventing keep-alive connections. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Status: Open (was: Patch Available) Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: nodemanager Affects Versions: 2.4.0 Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Priority: Critical Labels: ShuffleKeepalive Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0-v4.patch, MAPREDUCE-5787-2.4.0-v5-v6-diff.patch, MAPREDUCE-5787-2.4.0-v5.patch, MAPREDUCE-5787-2.4.0-v6.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Attachment: MAPREDUCE-5787-2.4.0-v7.patch Addressed Vinod's concern on increased in memory due to mapOutputFileName and IndexRecord. It is possible to configure the cache via mapreduce.shuffle.mapoutput-info.meta.cache.size. (default value is 1000). String locaDirAllocator computations will be carried out twice if number of mapId goes past this limit. Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: nodemanager Affects Versions: 2.4.0 Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Priority: Critical Labels: ShuffleKeepalive Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0-v4.patch, MAPREDUCE-5787-2.4.0-v5-v6-diff.patch, MAPREDUCE-5787-2.4.0-v5.patch, MAPREDUCE-5787-2.4.0-v6.patch, MAPREDUCE-5787-2.4.0-v7.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Status: Patch Available (was: Open) Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: nodemanager Affects Versions: 2.4.0 Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Priority: Critical Labels: ShuffleKeepalive Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0-v4.patch, MAPREDUCE-5787-2.4.0-v5-v6-diff.patch, MAPREDUCE-5787-2.4.0-v5.patch, MAPREDUCE-5787-2.4.0-v6.patch, MAPREDUCE-5787-2.4.0-v7.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Status: Patch Available (was: Open) Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: nodemanager Affects Versions: 2.4.0 Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Priority: Critical Labels: ShuffleKeepalive Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0-v4.patch, MAPREDUCE-5787-2.4.0-v5.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Attachment: MAPREDUCE-5787-2.4.0-v5.patch Incorporated review comments from Vinod Can we also change the MapReduce fetcher to use keep-alive depending on whether it is enabled or not? - HttpURLConnection will automatically use persistent connection when keep-alive and Content-Length headers are properly set. So, there is no need to change the fetcher code. Suggestion for Configuration renames - Fixed Add both to the mapred-default.xml - Fixed LOG KeepAliveParam along with other things like jobId, mapId etc. -Fixed populateHeaders. We are already parsing jobID, ApplicationId etc as part of sendMapOutput. We should avoid doing the string parsing multiple times. Is setting CONTENT_LENGTH important? Even so, for doing it, we are reading the index-record two times - Yes, content-length is very much needed for this. Fixed multiple parsing issue. Instead of re-defining new constants like CONNECTION_HEADER in ShuffleHandler, can you use the standard constants in java (HttpHeaders)? - Fixed Finally, can you reuse code between the two tests? - Fixed Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: nodemanager Affects Versions: 2.4.0 Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Priority: Critical Labels: ShuffleKeepalive Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0-v4.patch, MAPREDUCE-5787-2.4.0-v5.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13939955#comment-13939955 ] Rajesh Balamohan commented on MAPREDUCE-5787: - Is setting CONTENT_LENGTH important? Even so, for doing it, we are reading the index-record two times - once here and once while sending the output. This will have performance impact. Yes, CONTENT_LENGTH is needed in the client side for keep alive. A placeholder is needed to avoid computing the index-record 2 times. I will refactor and post the patch asap. Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: nodemanager Affects Versions: 2.4.0 Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Priority: Critical Labels: ShuffleKeepalive Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0-v4.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Status: Open (was: Patch Available) Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Status: Patch Available (was: Open) Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Attachment: (was: BUG-14568-v3-branch-2.4.0.patch) Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Attachment: MAPREDUCE-5787-2.4.0-v3.patch Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Attachment: BUG-14568-v3-branch-2.4.0.patch Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Status: Open (was: Patch Available) Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Attachment: (was: MAPREDUCE-5787-2.4.0-v3.patch) Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Status: Patch Available (was: Open) Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Attachment: MAPREDUCE-5787-2.4.0-v3.patch Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937421#comment-13937421 ] Rajesh Balamohan commented on MAPREDUCE-5787: - Review Request: https://reviews.apache.org/r/19264/diff/1/?file=521462#file521462line585 Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Status: Open (was: Patch Available) Will incorporate the review comments from Gopal (https://reviews.apache.org/r/19264/diff/1/?file=521462#file521462line585) and upload the patch Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Attachment: MAPREDUCE-5787-2.4.0-v3.patch Incorporated review comments from Gopal (https://reviews.apache.org/r/19264/diff/1/?file=521462#file521462line585) Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Status: Patch Available (was: Open) Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Attachment: (was: MAPREDUCE-5787-2.4.0-v3.patch) Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: nodemanager Affects Versions: 2.4.0 Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Priority: Critical Labels: ShuffleKeepalive Fix For: 2.4.0 Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Attachment: MAPREDUCE-5787-2.4.0-v4.patch Renaming the patch as v4. Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: nodemanager Affects Versions: 2.4.0 Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Priority: Critical Labels: ShuffleKeepalive Fix For: 2.4.0 Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0-v4.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Status: Open (was: Patch Available) Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: nodemanager Affects Versions: 2.4.0 Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Priority: Critical Labels: ShuffleKeepalive Fix For: 2.4.0 Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0-v4.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Status: Patch Available (was: Open) Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: nodemanager Affects Versions: 2.4.0 Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Priority: Critical Labels: ShuffleKeepalive Fix For: 2.4.0 Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0-v3.patch, MAPREDUCE-5787-2.4.0-v4.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Attachment: MAPREDUCE-5787-2.4.0-v2.patch - Keep-Alive is disabled by default. - Keep-Alive can be enabled by setting mapreduce.shuffle.enable.keep.alive - Timeout parameter can be adjusted using mapreduce.shuffle.enable.keep.alive.timeout - There is add-on facility wherein Keep-Alive can be enabled via request URL by adding keepAlive=true parameter. This will allow frameworks like Tez to benefit from Keep-Alive connections without affecting any MR jobs (for which the keep-alive connections are disabled by default). Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Status: Patch Available (was: Open) Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: MAPREDUCE-5787-2.4.0-v2.patch, MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
[ https://issues.apache.org/jira/browse/MAPREDUCE-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-5787: Attachment: MAPREDUCE-5787-2.4.0.patch Modify ShuffleHandler to support Keep-Alive --- Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: MAPREDUCE-5787-2.4.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5788) Modify Fetcher to pull data using persistent connection
[ https://issues.apache.org/jira/browse/MAPREDUCE-5788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13931659#comment-13931659 ] Rajesh Balamohan commented on MAPREDUCE-5788: - Existing HttpUrlConnection is capable of handling persistent connections as long as Content-Length header is specified. It also honors Keep-Alive: timeout header. ShuffleHandler would be sending Content-Length Keep-Alive:timeout headers, if mapreduce.shuffle.enable.keep.alive is set to true. No changes are needed in Fetcher side (Need to mark this as wont-fix) Modify Fetcher to pull data using persistent connection --- Key: MAPREDUCE-5788 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5788 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5786) Support Keep-Alive connections in ShuffleHandler
[ https://issues.apache.org/jira/browse/MAPREDUCE-5786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13926423#comment-13926423 ] Rajesh Balamohan commented on MAPREDUCE-5786: - Thanks for comments Jason. We need to have mapreduce.shuffle.enable.keep.alive to enable keep-alive in the ShuffleHandler and mapreduce.shuffle.enable.keep.alive.timeout to determine the time-out value for the persistent connection. E.g, Keep-Alive: timeout=60 header specifies that the connection will be kept alive for 60 seconds after which the connection will be closed. This will allow us to tune persistent connection duration on large clusters with different job patterns. Support Keep-Alive connections in ShuffleHandler Key: MAPREDUCE-5786 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5786 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 2.4.0 Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Labels: shuffle Currently ShuffleHandler supports fetching map-outputs in batches from same host. But there are scenarios wherein, fetchers pull data aggressively (i.e start pulling the data as when they are available). In this case, the number of mapIds that are pulled from same host remains at 1. This causes lots of connections to be established. Number of connections can be reduced a lot if ShuffleHandler supports Keep-Alive. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MAPREDUCE-5786) Support Keep-Alive connections in ShuffleHandler
Rajesh Balamohan created MAPREDUCE-5786: --- Summary: Support Keep-Alive connections in ShuffleHandler Key: MAPREDUCE-5786 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5786 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 2.4.0 Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Currently ShuffleHandler supports fetching map-outputs in batches from same host. But there are scenarios wherein, fetchers pull data aggressively (i.e start pulling the data as when they are available). In this case, the number of mapIds that are pulled from same host remains at 1. This causes lots of connections to be established. Number of connections can be reduced a lot if ShuffleHandler supports Keep-Alive. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MAPREDUCE-5787) Modify ShuffleHandler to support Keep-Alive
Rajesh Balamohan created MAPREDUCE-5787: --- Summary: Modify ShuffleHandler to support Keep-Alive Key: MAPREDUCE-5787 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5787 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MAPREDUCE-5788) Modify Fetcher to pull data using persistent connection
Rajesh Balamohan created MAPREDUCE-5788: --- Summary: Modify Fetcher to pull data using persistent connection Key: MAPREDUCE-5788 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5788 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5611) CombineFileInputFormat only requests a single location per split when more could be optimal
[ https://issues.apache.org/jira/browse/MAPREDUCE-5611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848164#comment-13848164 ] Rajesh Balamohan commented on MAPREDUCE-5611: - Agreed, ideal will be to compute and have the *intersection* of the nodes in the split information. We will modify the patch to accommodate this and post the details. CombineFileInputFormat only requests a single location per split when more could be optimal --- Key: MAPREDUCE-5611 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5611 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 1.2.1 Reporter: Chandra Prakash Bhagtani Assignee: Chandra Prakash Bhagtani Attachments: CombineFileInputFormat-trunk.patch I have come across an issue with CombineFileInputFormat. Actually I ran a hive query on approx 1.2 GB data with CombineHiveInputFormat which internally uses CombineFileInputFormat. My cluster size is 9 datanodes and max.split.size is 256 MB When I ran this query with replication factor 9, hive consistently creates all 6 rack-local tasks and with replication factor 3 it creates 5 rack-local and 1 data local tasks. When replication factor is 9 (equal to cluster size), all the tasks should be data-local as each datanode contains all the replicas of the input data, but that is not happening i.e all the tasks are rack-local. When I dug into CombineFileInputFormat.java code in getMoreSplits method, I found the issue with the following snippet (specially in case of higher replication factor) {code:title=CombineFileInputFormat.java|borderStyle=solid} for (IteratorMap.EntryString, ListOneBlockInfo iter = nodeToBlocks.entrySet().iterator(); iter.hasNext();) { Map.EntryString, ListOneBlockInfo one = iter.next(); nodes.add(one.getKey()); ListOneBlockInfo blocksInNode = one.getValue(); // for each block, copy it into validBlocks. Delete it from // blockToNodes so that the same block does not appear in // two different splits. for (OneBlockInfo oneblock : blocksInNode) { if (blockToNodes.containsKey(oneblock)) { validBlocks.add(oneblock); blockToNodes.remove(oneblock); curSplitSize += oneblock.length; // if the accumulated split size exceeds the maximum, then // create this split. if (maxSize != 0 curSplitSize = maxSize) { // create an input split and add it to the splits array addCreatedSplit(splits, nodes, validBlocks); curSplitSize = 0; validBlocks.clear(); } } } {code} First node in the map nodeToBlocks has all the replicas of input file, so the above code creates 6 splits all with only one location. Now if JT doesn't schedule these tasks on that node, all the tasks will be rack-local, even though all the other datanodes have all the other replicas. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (MAPREDUCE-5611) CombineFileInputFormat creates more rack-local tasks due to less split location info.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13835199#comment-13835199 ] Rajesh Balamohan commented on MAPREDUCE-5611: - Thanks Chandra. This is a good perf patch. Here are the data locality numbers which can be useful to analyze the perf improvement. Without Patch: Job CountersLaunched map tasks 0 0 335 Data-local map tasks0 0 179 Rack-local map tasks0 0 81 With Patch: Job CountersLaunched map tasks 0 0 335 Data-local map tasks0 0 279 Rack-local map tasks0 0 47 The data locality improves a lot with this patch in Hive queries. CombineFileInputFormat creates more rack-local tasks due to less split location info. - Key: MAPREDUCE-5611 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5611 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: trunk Reporter: Chandra Prakash Bhagtani Assignee: Chandra Prakash Bhagtani Fix For: trunk Attachments: CombineFileInputFormat-trunk.patch I have come across an issue with CombineFileInputFormat. Actually I ran a hive query on approx 1.2 GB data with CombineHiveInputFormat which internally uses CombineFileInputFormat. My cluster size is 9 datanodes and max.split.size is 256 MB When I ran this query with replication factor 9, hive consistently creates all 6 rack-local tasks and with replication factor 3 it creates 5 rack-local and 1 data local tasks. When replication factor is 9 (equal to cluster size), all the tasks should be data-local as each datanode contains all the replicas of the input data, but that is not happening i.e all the tasks are rack-local. When I dug into CombineFileInputFormat.java code in getMoreSplits method, I found the issue with the following snippet (specially in case of higher replication factor) {code:title=CombineFileInputFormat.java|borderStyle=solid} for (IteratorMap.EntryString, ListOneBlockInfo iter = nodeToBlocks.entrySet().iterator(); iter.hasNext();) { Map.EntryString, ListOneBlockInfo one = iter.next(); nodes.add(one.getKey()); ListOneBlockInfo blocksInNode = one.getValue(); // for each block, copy it into validBlocks. Delete it from // blockToNodes so that the same block does not appear in // two different splits. for (OneBlockInfo oneblock : blocksInNode) { if (blockToNodes.containsKey(oneblock)) { validBlocks.add(oneblock); blockToNodes.remove(oneblock); curSplitSize += oneblock.length; // if the accumulated split size exceeds the maximum, then // create this split. if (maxSize != 0 curSplitSize = maxSize) { // create an input split and add it to the splits array addCreatedSplit(splits, nodes, validBlocks); curSplitSize = 0; validBlocks.clear(); } } } {code} First node in the map nodeToBlocks has all the replicas of input file, so the above code creates 6 splits all with only one location. Now if JT doesn't schedule these tasks on that node, all the tasks will be rack-local, even though all the other datanodes have all the other replicas. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAPREDUCE-5611) CombineFileInputFormat creates more rack-local tasks due to less split location info.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13835200#comment-13835200 ] Rajesh Balamohan commented on MAPREDUCE-5611: - Just wanted to add the response times as well Without Patch : 289 seconds With Patch: 219 seconds This testing was carried out with with Hive 0.10 CombineFileInputFormat creates more rack-local tasks due to less split location info. - Key: MAPREDUCE-5611 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5611 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: trunk Reporter: Chandra Prakash Bhagtani Assignee: Chandra Prakash Bhagtani Fix For: trunk Attachments: CombineFileInputFormat-trunk.patch I have come across an issue with CombineFileInputFormat. Actually I ran a hive query on approx 1.2 GB data with CombineHiveInputFormat which internally uses CombineFileInputFormat. My cluster size is 9 datanodes and max.split.size is 256 MB When I ran this query with replication factor 9, hive consistently creates all 6 rack-local tasks and with replication factor 3 it creates 5 rack-local and 1 data local tasks. When replication factor is 9 (equal to cluster size), all the tasks should be data-local as each datanode contains all the replicas of the input data, but that is not happening i.e all the tasks are rack-local. When I dug into CombineFileInputFormat.java code in getMoreSplits method, I found the issue with the following snippet (specially in case of higher replication factor) {code:title=CombineFileInputFormat.java|borderStyle=solid} for (IteratorMap.EntryString, ListOneBlockInfo iter = nodeToBlocks.entrySet().iterator(); iter.hasNext();) { Map.EntryString, ListOneBlockInfo one = iter.next(); nodes.add(one.getKey()); ListOneBlockInfo blocksInNode = one.getValue(); // for each block, copy it into validBlocks. Delete it from // blockToNodes so that the same block does not appear in // two different splits. for (OneBlockInfo oneblock : blocksInNode) { if (blockToNodes.containsKey(oneblock)) { validBlocks.add(oneblock); blockToNodes.remove(oneblock); curSplitSize += oneblock.length; // if the accumulated split size exceeds the maximum, then // create this split. if (maxSize != 0 curSplitSize = maxSize) { // create an input split and add it to the splits array addCreatedSplit(splits, nodes, validBlocks); curSplitSize = 0; validBlocks.clear(); } } } {code} First node in the map nodeToBlocks has all the replicas of input file, so the above code creates 6 splits all with only one location. Now if JT doesn't schedule these tasks on that node, all the tasks will be rack-local, even though all the other datanodes have all the other replicas. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAPREDUCE-2450) Calls from running tasks to TaskTracker methods sometimes fail and incur a 60s timeout
[ https://issues.apache.org/jira/browse/MAPREDUCE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036047#comment-13036047 ] Rajesh Balamohan commented on MAPREDUCE-2450: - Todd Lipcon added a comment - 19/May/11 06:41 Would this also be fixed by HADOOP-6762? Hi Todd, Hadoop-6762 could be fixing this issue as well. However, I haven't tested with Hadoop-6762. The patch proposed in https://issues.apache.org/jira/secure/attachment/12477611/mapreduce-2450.patch is well tested in large scale cluster repeatedly. Calls from running tasks to TaskTracker methods sometimes fail and incur a 60s timeout -- Key: MAPREDUCE-2450 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2450 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.23.0 Reporter: Matei Zaharia Fix For: 0.23.0 Attachments: HADOOP-5380.Y.20.branch.patch, HADOOP-5380.patch, HADOOP_5380-Y.0.20.20x.patch, mapreduce-2450.patch I'm seeing some map tasks in my jobs take 1 minute to commit after they finish the map computation. On the map side, the output looks like this: code 2009-03-02 21:30:54,384 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=MAP, sessionId= - already initialized 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 800 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 300 2009-03-02 21:30:55,493 INFO org.apache.hadoop.mapred.MapTask: data buffer = 239075328/298844160 2009-03-02 21:30:55,494 INFO org.apache.hadoop.mapred.MapTask: record buffer = 786432/983040 2009-03-02 21:31:00,381 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output 2009-03-02 21:31:07,892 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0 2009-03-02 21:31:07,951 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_200903022127_0001_m_003163_0 is done. And is in the process of commiting 2009-03-02 21:32:07,949 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.io.IOException: Call to /127.0.0.1:50311 failed on local exception: java.nio.channels.ClosedChannelException at org.apache.hadoop.ipc.Client.wrapException(Client.java:765) at org.apache.hadoop.ipc.Client.call(Client.java:733) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at org.apache.hadoop.mapred.$Proxy0.ping(Unknown Source) at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:525) at java.lang.Thread.run(Thread.java:619) Caused by: java.nio.channels.ClosedChannelException at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:167) at java.nio.channels.SelectableChannel.register(SelectableChannel.java:254) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:331) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) 2009-03-02 21:32:07,953 INFO org.apache.hadoop.mapred.TaskRunner: Task 'attempt_200903022127_0001_m_003163_0' done. /code In the TaskTracker log, it looks like this: code 2009-03-02 21:31:08,110 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call ping(attempt_200903022127_0001_m_003163_0) from 127.0.0.1:56884: output error 2009-03-02 21:31:08,111 INFO org.apache.hadoop.ipc.Server: IPC Server handler 10 on 50311 caught: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)at org.apache.hadoop.ipc.Server.channelWrite(Server.java:1195) at org.apache.hadoop.ipc.Server.access$1900(Server.java:77) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:613) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:677) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:981) /code Note that the task actually seemed to commit - it
[jira] [Commented] (MAPREDUCE-2450) Calls from running tasks to TaskTracker methods sometimes fail and incur a 60s timeout
[ https://issues.apache.org/jira/browse/MAPREDUCE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13026246#comment-13026246 ] Rajesh Balamohan commented on MAPREDUCE-2450: - Ran large sort job and GridMix-V3 with 1200 jobs to verify this patch. Large sort-job/gridmix-v3 often simulated the problem reported in the bug and with the patch sort job/gridmix-v3 executed fine without timeout issues in tasklogs. This patch doesn't call for any additional testcases. Calls from running tasks to TaskTracker methods sometimes fail and incur a 60s timeout -- Key: MAPREDUCE-2450 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2450 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.23.0 Reporter: Matei Zaharia Fix For: 0.23.0 Attachments: HADOOP-5380.Y.20.branch.patch, HADOOP-5380.patch, HADOOP_5380-Y.0.20.20x.patch, mapreduce-2450.patch I'm seeing some map tasks in my jobs take 1 minute to commit after they finish the map computation. On the map side, the output looks like this: code 2009-03-02 21:30:54,384 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=MAP, sessionId= - already initialized 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 800 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 300 2009-03-02 21:30:55,493 INFO org.apache.hadoop.mapred.MapTask: data buffer = 239075328/298844160 2009-03-02 21:30:55,494 INFO org.apache.hadoop.mapred.MapTask: record buffer = 786432/983040 2009-03-02 21:31:00,381 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output 2009-03-02 21:31:07,892 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0 2009-03-02 21:31:07,951 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_200903022127_0001_m_003163_0 is done. And is in the process of commiting 2009-03-02 21:32:07,949 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.io.IOException: Call to /127.0.0.1:50311 failed on local exception: java.nio.channels.ClosedChannelException at org.apache.hadoop.ipc.Client.wrapException(Client.java:765) at org.apache.hadoop.ipc.Client.call(Client.java:733) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at org.apache.hadoop.mapred.$Proxy0.ping(Unknown Source) at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:525) at java.lang.Thread.run(Thread.java:619) Caused by: java.nio.channels.ClosedChannelException at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:167) at java.nio.channels.SelectableChannel.register(SelectableChannel.java:254) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:331) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) 2009-03-02 21:32:07,953 INFO org.apache.hadoop.mapred.TaskRunner: Task 'attempt_200903022127_0001_m_003163_0' done. /code In the TaskTracker log, it looks like this: code 2009-03-02 21:31:08,110 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call ping(attempt_200903022127_0001_m_003163_0) from 127.0.0.1:56884: output error 2009-03-02 21:31:08,111 INFO org.apache.hadoop.ipc.Server: IPC Server handler 10 on 50311 caught: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)at org.apache.hadoop.ipc.Server.channelWrite(Server.java:1195) at org.apache.hadoop.ipc.Server.access$1900(Server.java:77) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:613) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:677) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:981) /code Note that the task actually seemed to commit - it didn't get speculatively executed or anything. However, the job
[jira] [Updated] (MAPREDUCE-1461) Feature to instruct rumen-folder utility to skip jobs worth of specific duration
[ https://issues.apache.org/jira/browse/MAPREDUCE-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-1461: Status: Open (was: Patch Available) Feature to instruct rumen-folder utility to skip jobs worth of specific duration Key: MAPREDUCE-1461 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1461 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tools/rumen Reporter: Rajesh Balamohan Attachments: MR-1461-trunk.patch, mapreduce-1461--2010-02-05.patch, mapreduce-1461--2010-03-04.patch JSON outputs of rumen on production logs can be huge in the order of multiple GB. Rumen's folder utility helps in getting a smaller snapshot of this JSON data. It would be helpful to have an option in rumen-folder, wherein user can specify a duration from which rumen-folder should start processing data. Related JIRA link: https://issues.apache.org/jira/browse/MAPREDUCE-1295 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-1461) Feature to instruct rumen-folder utility to skip jobs worth of specific duration
[ https://issues.apache.org/jira/browse/MAPREDUCE-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-1461: Attachment: mr-1461-trunk-with-testcases.patch Attaching the patch with -ve testcase as well. Feature to instruct rumen-folder utility to skip jobs worth of specific duration Key: MAPREDUCE-1461 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1461 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tools/rumen Affects Versions: 0.23.0 Reporter: Rajesh Balamohan Attachments: MR-1461-trunk.patch, mapreduce-1461--2010-02-05.patch, mapreduce-1461--2010-03-04.patch, mr-1461-trunk-with-testcases.patch JSON outputs of rumen on production logs can be huge in the order of multiple GB. Rumen's folder utility helps in getting a smaller snapshot of this JSON data. It would be helpful to have an option in rumen-folder, wherein user can specify a duration from which rumen-folder should start processing data. Related JIRA link: https://issues.apache.org/jira/browse/MAPREDUCE-1295 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-1461) Feature to instruct rumen-folder utility to skip jobs worth of specific duration
[ https://issues.apache.org/jira/browse/MAPREDUCE-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-1461: Affects Version/s: 0.23.0 Status: Patch Available (was: Open) Feature to instruct rumen-folder utility to skip jobs worth of specific duration Key: MAPREDUCE-1461 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1461 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tools/rumen Affects Versions: 0.23.0 Reporter: Rajesh Balamohan Attachments: MR-1461-trunk.patch, mapreduce-1461--2010-02-05.patch, mapreduce-1461--2010-03-04.patch, mr-1461-trunk-with-testcases.patch JSON outputs of rumen on production logs can be huge in the order of multiple GB. Rumen's folder utility helps in getting a smaller snapshot of this JSON data. It would be helpful to have an option in rumen-folder, wherein user can specify a duration from which rumen-folder should start processing data. Related JIRA link: https://issues.apache.org/jira/browse/MAPREDUCE-1295 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2450) Calls from running tasks to TaskTracker methods sometimes fail and incur a 60s timeout
[ https://issues.apache.org/jira/browse/MAPREDUCE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-2450: Status: Open (was: Patch Available) Calls from running tasks to TaskTracker methods sometimes fail and incur a 60s timeout -- Key: MAPREDUCE-2450 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2450 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.23.0 Reporter: Matei Zaharia Fix For: 0.23.0 Attachments: HADOOP-5380.Y.20.branch.patch, HADOOP-5380.patch, HADOOP_5380-Y.0.20.20x.patch I'm seeing some map tasks in my jobs take 1 minute to commit after they finish the map computation. On the map side, the output looks like this: code 2009-03-02 21:30:54,384 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=MAP, sessionId= - already initialized 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 800 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 300 2009-03-02 21:30:55,493 INFO org.apache.hadoop.mapred.MapTask: data buffer = 239075328/298844160 2009-03-02 21:30:55,494 INFO org.apache.hadoop.mapred.MapTask: record buffer = 786432/983040 2009-03-02 21:31:00,381 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output 2009-03-02 21:31:07,892 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0 2009-03-02 21:31:07,951 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_200903022127_0001_m_003163_0 is done. And is in the process of commiting 2009-03-02 21:32:07,949 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.io.IOException: Call to /127.0.0.1:50311 failed on local exception: java.nio.channels.ClosedChannelException at org.apache.hadoop.ipc.Client.wrapException(Client.java:765) at org.apache.hadoop.ipc.Client.call(Client.java:733) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at org.apache.hadoop.mapred.$Proxy0.ping(Unknown Source) at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:525) at java.lang.Thread.run(Thread.java:619) Caused by: java.nio.channels.ClosedChannelException at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:167) at java.nio.channels.SelectableChannel.register(SelectableChannel.java:254) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:331) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) 2009-03-02 21:32:07,953 INFO org.apache.hadoop.mapred.TaskRunner: Task 'attempt_200903022127_0001_m_003163_0' done. /code In the TaskTracker log, it looks like this: code 2009-03-02 21:31:08,110 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call ping(attempt_200903022127_0001_m_003163_0) from 127.0.0.1:56884: output error 2009-03-02 21:31:08,111 INFO org.apache.hadoop.ipc.Server: IPC Server handler 10 on 50311 caught: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)at org.apache.hadoop.ipc.Server.channelWrite(Server.java:1195) at org.apache.hadoop.ipc.Server.access$1900(Server.java:77) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:613) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:677) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:981) /code Note that the task actually seemed to commit - it didn't get speculatively executed or anything. However, the job wasn't able to continue until this one task was done. Both parties seem to think the channel was closed. How does the channel get closed externally? If closing it from outside is unavoidable, maybe the right thing to do is to set a much lower timeout, because 1 minute delay can be pretty significant for a small job.
[jira] [Updated] (MAPREDUCE-2450) Calls from running tasks to TaskTracker methods sometimes fail and incur a 60s timeout
[ https://issues.apache.org/jira/browse/MAPREDUCE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-2450: Attachment: mapreduce-2450.patch resubmitting for hudson build. Calls from running tasks to TaskTracker methods sometimes fail and incur a 60s timeout -- Key: MAPREDUCE-2450 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2450 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.23.0 Reporter: Matei Zaharia Fix For: 0.23.0 Attachments: HADOOP-5380.Y.20.branch.patch, HADOOP-5380.patch, HADOOP_5380-Y.0.20.20x.patch, mapreduce-2450.patch I'm seeing some map tasks in my jobs take 1 minute to commit after they finish the map computation. On the map side, the output looks like this: code 2009-03-02 21:30:54,384 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=MAP, sessionId= - already initialized 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 800 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 300 2009-03-02 21:30:55,493 INFO org.apache.hadoop.mapred.MapTask: data buffer = 239075328/298844160 2009-03-02 21:30:55,494 INFO org.apache.hadoop.mapred.MapTask: record buffer = 786432/983040 2009-03-02 21:31:00,381 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output 2009-03-02 21:31:07,892 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0 2009-03-02 21:31:07,951 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_200903022127_0001_m_003163_0 is done. And is in the process of commiting 2009-03-02 21:32:07,949 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.io.IOException: Call to /127.0.0.1:50311 failed on local exception: java.nio.channels.ClosedChannelException at org.apache.hadoop.ipc.Client.wrapException(Client.java:765) at org.apache.hadoop.ipc.Client.call(Client.java:733) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at org.apache.hadoop.mapred.$Proxy0.ping(Unknown Source) at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:525) at java.lang.Thread.run(Thread.java:619) Caused by: java.nio.channels.ClosedChannelException at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:167) at java.nio.channels.SelectableChannel.register(SelectableChannel.java:254) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:331) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) 2009-03-02 21:32:07,953 INFO org.apache.hadoop.mapred.TaskRunner: Task 'attempt_200903022127_0001_m_003163_0' done. /code In the TaskTracker log, it looks like this: code 2009-03-02 21:31:08,110 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call ping(attempt_200903022127_0001_m_003163_0) from 127.0.0.1:56884: output error 2009-03-02 21:31:08,111 INFO org.apache.hadoop.ipc.Server: IPC Server handler 10 on 50311 caught: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)at org.apache.hadoop.ipc.Server.channelWrite(Server.java:1195) at org.apache.hadoop.ipc.Server.access$1900(Server.java:77) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:613) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:677) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:981) /code Note that the task actually seemed to commit - it didn't get speculatively executed or anything. However, the job wasn't able to continue until this one task was done. Both parties seem to think the channel was closed. How does the channel get closed externally? If closing it from outside is unavoidable, maybe the right thing to do is to set a much lower timeout, because 1 minute
[jira] [Updated] (MAPREDUCE-2450) Calls from running tasks to TaskTracker methods sometimes fail and incur a 60s timeout
[ https://issues.apache.org/jira/browse/MAPREDUCE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-2450: Status: Patch Available (was: Open) Calls from running tasks to TaskTracker methods sometimes fail and incur a 60s timeout -- Key: MAPREDUCE-2450 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2450 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.23.0 Reporter: Matei Zaharia Fix For: 0.23.0 Attachments: HADOOP-5380.Y.20.branch.patch, HADOOP-5380.patch, HADOOP_5380-Y.0.20.20x.patch, mapreduce-2450.patch I'm seeing some map tasks in my jobs take 1 minute to commit after they finish the map computation. On the map side, the output looks like this: code 2009-03-02 21:30:54,384 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=MAP, sessionId= - already initialized 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 800 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 300 2009-03-02 21:30:55,493 INFO org.apache.hadoop.mapred.MapTask: data buffer = 239075328/298844160 2009-03-02 21:30:55,494 INFO org.apache.hadoop.mapred.MapTask: record buffer = 786432/983040 2009-03-02 21:31:00,381 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output 2009-03-02 21:31:07,892 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0 2009-03-02 21:31:07,951 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_200903022127_0001_m_003163_0 is done. And is in the process of commiting 2009-03-02 21:32:07,949 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.io.IOException: Call to /127.0.0.1:50311 failed on local exception: java.nio.channels.ClosedChannelException at org.apache.hadoop.ipc.Client.wrapException(Client.java:765) at org.apache.hadoop.ipc.Client.call(Client.java:733) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at org.apache.hadoop.mapred.$Proxy0.ping(Unknown Source) at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:525) at java.lang.Thread.run(Thread.java:619) Caused by: java.nio.channels.ClosedChannelException at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:167) at java.nio.channels.SelectableChannel.register(SelectableChannel.java:254) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:331) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) 2009-03-02 21:32:07,953 INFO org.apache.hadoop.mapred.TaskRunner: Task 'attempt_200903022127_0001_m_003163_0' done. /code In the TaskTracker log, it looks like this: code 2009-03-02 21:31:08,110 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call ping(attempt_200903022127_0001_m_003163_0) from 127.0.0.1:56884: output error 2009-03-02 21:31:08,111 INFO org.apache.hadoop.ipc.Server: IPC Server handler 10 on 50311 caught: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)at org.apache.hadoop.ipc.Server.channelWrite(Server.java:1195) at org.apache.hadoop.ipc.Server.access$1900(Server.java:77) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:613) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:677) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:981) /code Note that the task actually seemed to commit - it didn't get speculatively executed or anything. However, the job wasn't able to continue until this one task was done. Both parties seem to think the channel was closed. How does the channel get closed externally? If closing it from outside is unavoidable, maybe the right thing to do is to set a much lower timeout, because 1 minute delay can be pretty
[jira] [Moved] (MAPREDUCE-2450) Calls from running tasks to TaskTracker methods sometimes fail and incur a 60s timeout
[ https://issues.apache.org/jira/browse/MAPREDUCE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan moved HADOOP-5380 to MAPREDUCE-2450: - Fix Version/s: (was: 0.23.0) 0.23.0 Affects Version/s: (was: 0.23.0) 0.23.0 Key: MAPREDUCE-2450 (was: HADOOP-5380) Project: Hadoop Map/Reduce (was: Hadoop Common) Calls from running tasks to TaskTracker methods sometimes fail and incur a 60s timeout -- Key: MAPREDUCE-2450 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2450 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.23.0 Reporter: Matei Zaharia Fix For: 0.23.0 Attachments: HADOOP-5380.Y.20.branch.patch, HADOOP-5380.patch, HADOOP_5380-Y.0.20.20x.patch I'm seeing some map tasks in my jobs take 1 minute to commit after they finish the map computation. On the map side, the output looks like this: code 2009-03-02 21:30:54,384 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=MAP, sessionId= - already initialized 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 800 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 300 2009-03-02 21:30:55,493 INFO org.apache.hadoop.mapred.MapTask: data buffer = 239075328/298844160 2009-03-02 21:30:55,494 INFO org.apache.hadoop.mapred.MapTask: record buffer = 786432/983040 2009-03-02 21:31:00,381 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output 2009-03-02 21:31:07,892 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0 2009-03-02 21:31:07,951 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_200903022127_0001_m_003163_0 is done. And is in the process of commiting 2009-03-02 21:32:07,949 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.io.IOException: Call to /127.0.0.1:50311 failed on local exception: java.nio.channels.ClosedChannelException at org.apache.hadoop.ipc.Client.wrapException(Client.java:765) at org.apache.hadoop.ipc.Client.call(Client.java:733) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at org.apache.hadoop.mapred.$Proxy0.ping(Unknown Source) at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:525) at java.lang.Thread.run(Thread.java:619) Caused by: java.nio.channels.ClosedChannelException at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:167) at java.nio.channels.SelectableChannel.register(SelectableChannel.java:254) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:331) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) 2009-03-02 21:32:07,953 INFO org.apache.hadoop.mapred.TaskRunner: Task 'attempt_200903022127_0001_m_003163_0' done. /code In the TaskTracker log, it looks like this: code 2009-03-02 21:31:08,110 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call ping(attempt_200903022127_0001_m_003163_0) from 127.0.0.1:56884: output error 2009-03-02 21:31:08,111 INFO org.apache.hadoop.ipc.Server: IPC Server handler 10 on 50311 caught: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)at org.apache.hadoop.ipc.Server.channelWrite(Server.java:1195) at org.apache.hadoop.ipc.Server.access$1900(Server.java:77) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:613) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:677) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:981) /code Note that the task actually seemed to commit - it didn't get speculatively executed or anything. However, the job wasn't able to continue until this one task was done. Both parties seem to
[jira] [Updated] (MAPREDUCE-2450) Calls from running tasks to TaskTracker methods sometimes fail and incur a 60s timeout
[ https://issues.apache.org/jira/browse/MAPREDUCE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-2450: Status: Open (was: Patch Available) Calls from running tasks to TaskTracker methods sometimes fail and incur a 60s timeout -- Key: MAPREDUCE-2450 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2450 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.23.0 Reporter: Matei Zaharia Fix For: 0.23.0 Attachments: HADOOP-5380.Y.20.branch.patch, HADOOP-5380.patch, HADOOP_5380-Y.0.20.20x.patch I'm seeing some map tasks in my jobs take 1 minute to commit after they finish the map computation. On the map side, the output looks like this: code 2009-03-02 21:30:54,384 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=MAP, sessionId= - already initialized 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 800 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 300 2009-03-02 21:30:55,493 INFO org.apache.hadoop.mapred.MapTask: data buffer = 239075328/298844160 2009-03-02 21:30:55,494 INFO org.apache.hadoop.mapred.MapTask: record buffer = 786432/983040 2009-03-02 21:31:00,381 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output 2009-03-02 21:31:07,892 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0 2009-03-02 21:31:07,951 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_200903022127_0001_m_003163_0 is done. And is in the process of commiting 2009-03-02 21:32:07,949 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.io.IOException: Call to /127.0.0.1:50311 failed on local exception: java.nio.channels.ClosedChannelException at org.apache.hadoop.ipc.Client.wrapException(Client.java:765) at org.apache.hadoop.ipc.Client.call(Client.java:733) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at org.apache.hadoop.mapred.$Proxy0.ping(Unknown Source) at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:525) at java.lang.Thread.run(Thread.java:619) Caused by: java.nio.channels.ClosedChannelException at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:167) at java.nio.channels.SelectableChannel.register(SelectableChannel.java:254) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:331) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) 2009-03-02 21:32:07,953 INFO org.apache.hadoop.mapred.TaskRunner: Task 'attempt_200903022127_0001_m_003163_0' done. /code In the TaskTracker log, it looks like this: code 2009-03-02 21:31:08,110 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call ping(attempt_200903022127_0001_m_003163_0) from 127.0.0.1:56884: output error 2009-03-02 21:31:08,111 INFO org.apache.hadoop.ipc.Server: IPC Server handler 10 on 50311 caught: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)at org.apache.hadoop.ipc.Server.channelWrite(Server.java:1195) at org.apache.hadoop.ipc.Server.access$1900(Server.java:77) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:613) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:677) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:981) /code Note that the task actually seemed to commit - it didn't get speculatively executed or anything. However, the job wasn't able to continue until this one task was done. Both parties seem to think the channel was closed. How does the channel get closed externally? If closing it from outside is unavoidable, maybe the right thing to do is to set a much lower timeout, because 1 minute delay can be pretty significant for a small job.
[jira] [Updated] (MAPREDUCE-2450) Calls from running tasks to TaskTracker methods sometimes fail and incur a 60s timeout
[ https://issues.apache.org/jira/browse/MAPREDUCE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-2450: Status: Patch Available (was: Open) Moved the JIRA from Hadoop-Common to Hadoop-MapReduce. Resubmitting for hudson build Calls from running tasks to TaskTracker methods sometimes fail and incur a 60s timeout -- Key: MAPREDUCE-2450 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2450 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.23.0 Reporter: Matei Zaharia Fix For: 0.23.0 Attachments: HADOOP-5380.Y.20.branch.patch, HADOOP-5380.patch, HADOOP_5380-Y.0.20.20x.patch I'm seeing some map tasks in my jobs take 1 minute to commit after they finish the map computation. On the map side, the output looks like this: code 2009-03-02 21:30:54,384 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=MAP, sessionId= - already initialized 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 800 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 300 2009-03-02 21:30:55,493 INFO org.apache.hadoop.mapred.MapTask: data buffer = 239075328/298844160 2009-03-02 21:30:55,494 INFO org.apache.hadoop.mapred.MapTask: record buffer = 786432/983040 2009-03-02 21:31:00,381 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output 2009-03-02 21:31:07,892 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0 2009-03-02 21:31:07,951 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_200903022127_0001_m_003163_0 is done. And is in the process of commiting 2009-03-02 21:32:07,949 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.io.IOException: Call to /127.0.0.1:50311 failed on local exception: java.nio.channels.ClosedChannelException at org.apache.hadoop.ipc.Client.wrapException(Client.java:765) at org.apache.hadoop.ipc.Client.call(Client.java:733) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at org.apache.hadoop.mapred.$Proxy0.ping(Unknown Source) at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:525) at java.lang.Thread.run(Thread.java:619) Caused by: java.nio.channels.ClosedChannelException at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:167) at java.nio.channels.SelectableChannel.register(SelectableChannel.java:254) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:331) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) 2009-03-02 21:32:07,953 INFO org.apache.hadoop.mapred.TaskRunner: Task 'attempt_200903022127_0001_m_003163_0' done. /code In the TaskTracker log, it looks like this: code 2009-03-02 21:31:08,110 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call ping(attempt_200903022127_0001_m_003163_0) from 127.0.0.1:56884: output error 2009-03-02 21:31:08,111 INFO org.apache.hadoop.ipc.Server: IPC Server handler 10 on 50311 caught: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)at org.apache.hadoop.ipc.Server.channelWrite(Server.java:1195) at org.apache.hadoop.ipc.Server.access$1900(Server.java:77) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:613) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:677) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:981) /code Note that the task actually seemed to commit - it didn't get speculatively executed or anything. However, the job wasn't able to continue until this one task was done. Both parties seem to think the channel was closed. How does the channel get closed externally? If closing it from outside is unavoidable, maybe the right thing to do is to set a
[jira] [Updated] (MAPREDUCE-1461) Feature to instruct rumen-folder utility to skip jobs worth of specific duration
[ https://issues.apache.org/jira/browse/MAPREDUCE-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-1461: Status: Patch Available (was: Open) Feature to instruct rumen-folder utility to skip jobs worth of specific duration Key: MAPREDUCE-1461 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1461 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tools/rumen Reporter: Rajesh Balamohan Attachments: MR-1461-trunk.patch, mapreduce-1461--2010-02-05.patch, mapreduce-1461--2010-03-04.patch JSON outputs of rumen on production logs can be huge in the order of multiple GB. Rumen's folder utility helps in getting a smaller snapshot of this JSON data. It would be helpful to have an option in rumen-folder, wherein user can specify a duration from which rumen-folder should start processing data. Related JIRA link: https://issues.apache.org/jira/browse/MAPREDUCE-1295 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2153) Bring in more job configuration properties in to the trace file
[ https://issues.apache.org/jira/browse/MAPREDUCE-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-2153: Assignee: Rajesh Balamohan Status: Open (was: Patch Available) Bring in more job configuration properties in to the trace file --- Key: MAPREDUCE-2153 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2153 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tools/rumen Affects Versions: 0.23.0 Reporter: Ravi Gummadi Assignee: Rajesh Balamohan Attachments: MR-2153-patch.txt, MapReduce-2153-trunk.patch, MapReduce-2153-trunk.patch, mr-2153-test-patch-results.txt To emulate distributed cache usage in gridmix jobs, there are 9 configuration properties needed to be available in trace file: (1) mapreduce.job.cache.files (2) mapreduce.job.cache.files.visibilities (3) mapreduce.job.cache.files.filesizes (4) mapreduce.job.cache.files.timestamps (5) mapreduce.job.cache.archives (6) mapreduce.job.cache.archives.visibilities (7) mapreduce.job.cache.archives.filesizes (8) mapreduce.job.cache.archives.timestamps (9) mapreduce.job.cache.symlink.create To emulate data compression in gridmix jobs, trace file should contain the following configuration properties: (1) mapreduce.map.output.compress (2) mapreduce.map.output.compress.codec (3) mapreduce.output.fileoutputformat.compress (4) mapreduce.output.fileoutputformat.compress.codec (5) mapreduce.output.fileoutputformat.compress.type Ideally, gridmix should set many job specific configuration properties like io.sort.mb, io.sort.factor, etc when running simulated jobs to get the same effect of original/real job in terms of spilled records, number of merges, etc. TraceBuilder should bring in all these properties into the generated trace file. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2153) Bring in more job configuration properties in to the trace file
[ https://issues.apache.org/jira/browse/MAPREDUCE-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-2153: Status: Patch Available (was: Open) Bring in more job configuration properties in to the trace file --- Key: MAPREDUCE-2153 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2153 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tools/rumen Affects Versions: 0.23.0 Reporter: Ravi Gummadi Assignee: Rajesh Balamohan Attachments: MR-2153-patch.txt, MapReduce-2153-trunk.patch, MapReduce-2153-trunk.patch, mr-2153-test-patch-results.txt To emulate distributed cache usage in gridmix jobs, there are 9 configuration properties needed to be available in trace file: (1) mapreduce.job.cache.files (2) mapreduce.job.cache.files.visibilities (3) mapreduce.job.cache.files.filesizes (4) mapreduce.job.cache.files.timestamps (5) mapreduce.job.cache.archives (6) mapreduce.job.cache.archives.visibilities (7) mapreduce.job.cache.archives.filesizes (8) mapreduce.job.cache.archives.timestamps (9) mapreduce.job.cache.symlink.create To emulate data compression in gridmix jobs, trace file should contain the following configuration properties: (1) mapreduce.map.output.compress (2) mapreduce.map.output.compress.codec (3) mapreduce.output.fileoutputformat.compress (4) mapreduce.output.fileoutputformat.compress.codec (5) mapreduce.output.fileoutputformat.compress.type Ideally, gridmix should set many job specific configuration properties like io.sort.mb, io.sort.factor, etc when running simulated jobs to get the same effect of original/real job in terms of spilled records, number of merges, etc. TraceBuilder should bring in all these properties into the generated trace file. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2153) Bring in more job configuration properties in to the trace file
[ https://issues.apache.org/jira/browse/MAPREDUCE-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-2153: Attachment: MR-2153-patch.txt Fixed the javac warnings in earlier patch Bring in more job configuration properties in to the trace file --- Key: MAPREDUCE-2153 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2153 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tools/rumen Affects Versions: 0.23.0 Reporter: Ravi Gummadi Assignee: Rajesh Balamohan Attachments: MR-2153-patch.txt, MapReduce-2153-trunk.patch, MapReduce-2153-trunk.patch, mr-2153-test-patch-results.txt To emulate distributed cache usage in gridmix jobs, there are 9 configuration properties needed to be available in trace file: (1) mapreduce.job.cache.files (2) mapreduce.job.cache.files.visibilities (3) mapreduce.job.cache.files.filesizes (4) mapreduce.job.cache.files.timestamps (5) mapreduce.job.cache.archives (6) mapreduce.job.cache.archives.visibilities (7) mapreduce.job.cache.archives.filesizes (8) mapreduce.job.cache.archives.timestamps (9) mapreduce.job.cache.symlink.create To emulate data compression in gridmix jobs, trace file should contain the following configuration properties: (1) mapreduce.map.output.compress (2) mapreduce.map.output.compress.codec (3) mapreduce.output.fileoutputformat.compress (4) mapreduce.output.fileoutputformat.compress.codec (5) mapreduce.output.fileoutputformat.compress.type Ideally, gridmix should set many job specific configuration properties like io.sort.mb, io.sort.factor, etc when running simulated jobs to get the same effect of original/real job in terms of spilled records, number of merges, etc. TraceBuilder should bring in all these properties into the generated trace file. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2153) Bring in more job configuration properties in to the trace file
[ https://issues.apache.org/jira/browse/MAPREDUCE-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-2153: Affects Version/s: 0.23.0 Status: Patch Available (was: Open) Bring in more job configuration properties in to the trace file --- Key: MAPREDUCE-2153 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2153 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tools/rumen Affects Versions: 0.23.0 Reporter: Ravi Gummadi Attachments: MapReduce-2153-trunk.patch, mr-2153-test-patch-results.txt To emulate distributed cache usage in gridmix jobs, there are 9 configuration properties needed to be available in trace file: (1) mapreduce.job.cache.files (2) mapreduce.job.cache.files.visibilities (3) mapreduce.job.cache.files.filesizes (4) mapreduce.job.cache.files.timestamps (5) mapreduce.job.cache.archives (6) mapreduce.job.cache.archives.visibilities (7) mapreduce.job.cache.archives.filesizes (8) mapreduce.job.cache.archives.timestamps (9) mapreduce.job.cache.symlink.create To emulate data compression in gridmix jobs, trace file should contain the following configuration properties: (1) mapreduce.map.output.compress (2) mapreduce.map.output.compress.codec (3) mapreduce.output.fileoutputformat.compress (4) mapreduce.output.fileoutputformat.compress.codec (5) mapreduce.output.fileoutputformat.compress.type Ideally, gridmix should set many job specific configuration properties like io.sort.mb, io.sort.factor, etc when running simulated jobs to get the same effect of original/real job in terms of spilled records, number of merges, etc. TraceBuilder should bring in all these properties into the generated trace file. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2153) Bring in more job configuration properties in to the trace file
[ https://issues.apache.org/jira/browse/MAPREDUCE-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-2153: Attachment: MapReduce-2153-trunk.patch Uploading the same patch for running via Hudson Bring in more job configuration properties in to the trace file --- Key: MAPREDUCE-2153 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2153 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tools/rumen Affects Versions: 0.23.0 Reporter: Ravi Gummadi Attachments: MapReduce-2153-trunk.patch, MapReduce-2153-trunk.patch, mr-2153-test-patch-results.txt To emulate distributed cache usage in gridmix jobs, there are 9 configuration properties needed to be available in trace file: (1) mapreduce.job.cache.files (2) mapreduce.job.cache.files.visibilities (3) mapreduce.job.cache.files.filesizes (4) mapreduce.job.cache.files.timestamps (5) mapreduce.job.cache.archives (6) mapreduce.job.cache.archives.visibilities (7) mapreduce.job.cache.archives.filesizes (8) mapreduce.job.cache.archives.timestamps (9) mapreduce.job.cache.symlink.create To emulate data compression in gridmix jobs, trace file should contain the following configuration properties: (1) mapreduce.map.output.compress (2) mapreduce.map.output.compress.codec (3) mapreduce.output.fileoutputformat.compress (4) mapreduce.output.fileoutputformat.compress.codec (5) mapreduce.output.fileoutputformat.compress.type Ideally, gridmix should set many job specific configuration properties like io.sort.mb, io.sort.factor, etc when running simulated jobs to get the same effect of original/real job in terms of spilled records, number of merges, etc. TraceBuilder should bring in all these properties into the generated trace file. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2153) Bring in more job configuration properties in to the trace file
[ https://issues.apache.org/jira/browse/MAPREDUCE-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-2153: Status: Patch Available (was: Open) Bring in more job configuration properties in to the trace file --- Key: MAPREDUCE-2153 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2153 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tools/rumen Affects Versions: 0.23.0 Reporter: Ravi Gummadi Attachments: MapReduce-2153-trunk.patch, MapReduce-2153-trunk.patch, mr-2153-test-patch-results.txt To emulate distributed cache usage in gridmix jobs, there are 9 configuration properties needed to be available in trace file: (1) mapreduce.job.cache.files (2) mapreduce.job.cache.files.visibilities (3) mapreduce.job.cache.files.filesizes (4) mapreduce.job.cache.files.timestamps (5) mapreduce.job.cache.archives (6) mapreduce.job.cache.archives.visibilities (7) mapreduce.job.cache.archives.filesizes (8) mapreduce.job.cache.archives.timestamps (9) mapreduce.job.cache.symlink.create To emulate data compression in gridmix jobs, trace file should contain the following configuration properties: (1) mapreduce.map.output.compress (2) mapreduce.map.output.compress.codec (3) mapreduce.output.fileoutputformat.compress (4) mapreduce.output.fileoutputformat.compress.codec (5) mapreduce.output.fileoutputformat.compress.type Ideally, gridmix should set many job specific configuration properties like io.sort.mb, io.sort.factor, etc when running simulated jobs to get the same effect of original/real job in terms of spilled records, number of merges, etc. TraceBuilder should bring in all these properties into the generated trace file. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-1461) Feature to instruct rumen-folder utility to skip jobs worth of specific duration
[ https://issues.apache.org/jira/browse/MAPREDUCE-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-1461: Attachment: MR-1461-trunk.patch Regenerated the patch for latest apache trunk codebase Feature to instruct rumen-folder utility to skip jobs worth of specific duration Key: MAPREDUCE-1461 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1461 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tools/rumen Reporter: Rajesh Balamohan Attachments: MR-1461-trunk.patch, mapreduce-1461--2010-02-05.patch, mapreduce-1461--2010-03-04.patch JSON outputs of rumen on production logs can be huge in the order of multiple GB. Rumen's folder utility helps in getting a smaller snapshot of this JSON data. It would be helpful to have an option in rumen-folder, wherein user can specify a duration from which rumen-folder should start processing data. Related JIRA link: https://issues.apache.org/jira/browse/MAPREDUCE-1295 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2153) Bring in more job configuration properties in to the trace file
[ https://issues.apache.org/jira/browse/MAPREDUCE-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-2153: Attachment: mr-2153-test-patch-results.txt ant test-patch results findbugs are not related to this patch. Bring in more job configuration properties in to the trace file --- Key: MAPREDUCE-2153 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2153 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tools/rumen Reporter: Ravi Gummadi Attachments: MapReduce-2153-trunk.patch, mr-2153-test-patch-results.txt To emulate distributed cache usage in gridmix jobs, there are 9 configuration properties needed to be available in trace file: (1) mapreduce.job.cache.files (2) mapreduce.job.cache.files.visibilities (3) mapreduce.job.cache.files.filesizes (4) mapreduce.job.cache.files.timestamps (5) mapreduce.job.cache.archives (6) mapreduce.job.cache.archives.visibilities (7) mapreduce.job.cache.archives.filesizes (8) mapreduce.job.cache.archives.timestamps (9) mapreduce.job.cache.symlink.create To emulate data compression in gridmix jobs, trace file should contain the following configuration properties: (1) mapreduce.map.output.compress (2) mapreduce.map.output.compress.codec (3) mapreduce.output.fileoutputformat.compress (4) mapreduce.output.fileoutputformat.compress.codec (5) mapreduce.output.fileoutputformat.compress.type Ideally, gridmix should set many job specific configuration properties like io.sort.mb, io.sort.factor, etc when running simulated jobs to get the same effect of original/real job in terms of spilled records, number of merges, etc. TraceBuilder should bring in all these properties into the generated trace file. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2417) In Gridmix, in RoundRobinUserResolver mode, the testing/proxy users are not associated with unique users in a trace
[ https://issues.apache.org/jira/browse/MAPREDUCE-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-2417: Attachment: MR-2417-trunk.patch In Gridmix, in RoundRobinUserResolver mode, the testing/proxy users are not associated with unique users in a trace --- Key: MAPREDUCE-2417 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2417 Project: Hadoop Map/Reduce Issue Type: Bug Components: contrib/gridmix Reporter: Ravi Gummadi Assignee: Ravi Gummadi Attachments: MR-2417-trunk.patch As per the Gridmix documentation, the testing users should associate with unique user in the trace. However, currently the gridmix impersonate the users based on job irrespective of user. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2153) Bring in more job configuration properties in to the trace file
[ https://issues.apache.org/jira/browse/MAPREDUCE-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-2153: Attachment: MapReduce-2153-trunk.patch Attaching the patch for apache trunk. This patch ensures that all job properties are saved in the json file under jobProperties tag. Bring in more job configuration properties in to the trace file --- Key: MAPREDUCE-2153 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2153 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tools/rumen Reporter: Ravi Gummadi Attachments: MapReduce-2153-trunk.patch To emulate distributed cache usage in gridmix jobs, there are 9 configuration properties needed to be available in trace file: (1) mapreduce.job.cache.files (2) mapreduce.job.cache.files.visibilities (3) mapreduce.job.cache.files.filesizes (4) mapreduce.job.cache.files.timestamps (5) mapreduce.job.cache.archives (6) mapreduce.job.cache.archives.visibilities (7) mapreduce.job.cache.archives.filesizes (8) mapreduce.job.cache.archives.timestamps (9) mapreduce.job.cache.symlink.create To emulate data compression in gridmix jobs, trace file should contain the following configuration properties: (1) mapreduce.map.output.compress (2) mapreduce.map.output.compress.codec (3) mapreduce.output.fileoutputformat.compress (4) mapreduce.output.fileoutputformat.compress.codec (5) mapreduce.output.fileoutputformat.compress.type Ideally, gridmix should set many job specific configuration properties like io.sort.mb, io.sort.factor, etc when running simulated jobs to get the same effect of original/real job in terms of spilled records, number of merges, etc. TraceBuilder should bring in all these properties into the generated trace file. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (MAPREDUCE-1904) Reducing locking contention in TaskTracker.MapOutputServlet's LocalDirAllocator
[ https://issues.apache.org/jira/browse/MAPREDUCE-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908604#action_12908604 ] Rajesh Balamohan commented on MAPREDUCE-1904: - Thanks for the review comments Arun. 1. For #1, I would post the profiler output of which methods are expensive in getLocalPathToRead(). 2. For #2, the code path for LocalDirAllocator.confChanged() need not be called in this context of TaskTracker. Reason: In this context, TaskTracker is trying to check for any config changes related to mapred.local.dir using LocalDirAllocator. Once its read, this parameter does not change over TaskTracker's lifetime. Hence, it is not mandatory to do this check for every invocation. Corner case: When tasktracker goes down and new configs are reloaded, the LRUCache would also be repopulated. Reducing locking contention in TaskTracker.MapOutputServlet's LocalDirAllocator --- Key: MAPREDUCE-1904 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1904 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tasktracker Affects Versions: 0.20.1 Reporter: Rajesh Balamohan Attachments: MAPREDUCE-1904-RC10.patch, MAPREDUCE-1904-trunk.patch, profiler output after applying the patch.jpg, TaskTracker- yourkit profiler output .jpg, Thread profiler output showing contention.jpg While profiling tasktracker with Sort benchmark, it was observed that threads block on LocalDirAllocator.getLocalPathToRead() in order to get the index file and temporary map output file. As LocalDirAllocator is tied up with ServetContext, only one instance would be available per tasktracker httpserver. Given the jobid mapid, LocalDirAllocator retrieves index file path and temporary map output file path. getLocalPathToRead() is internally synchronized. Introducing a LRUCache for this lookup reduces the contention heavily (LRUCache with key =jobid +mapid and value=PATH to the file). Size of the LRUCache can be varied based on the environment and I observed a throughput improvement in the order of 4-7% with the introduction of LRUCache. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-1904) Reducing locking contention in TaskTracker.MapOutputServlet's LocalDirAllocator
[ https://issues.apache.org/jira/browse/MAPREDUCE-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-1904: Attachment: MAPREDUCE-1904-trunk.patch Attaching the patch for trunk version. Reducing locking contention in TaskTracker.MapOutputServlet's LocalDirAllocator --- Key: MAPREDUCE-1904 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1904 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tasktracker Affects Versions: 0.20.1 Reporter: Rajesh Balamohan Attachments: MAPREDUCE-1904-RC10.patch, MAPREDUCE-1904-trunk.patch, profiler output after applying the patch.jpg, TaskTracker- yourkit profiler output .jpg, Thread profiler output showing contention.jpg While profiling tasktracker with Sort benchmark, it was observed that threads block on LocalDirAllocator.getLocalPathToRead() in order to get the index file and temporary map output file. As LocalDirAllocator is tied up with ServetContext, only one instance would be available per tasktracker httpserver. Given the jobid mapid, LocalDirAllocator retrieves index file path and temporary map output file path. getLocalPathToRead() is internally synchronized. Introducing a LRUCache for this lookup reduces the contention heavily (LRUCache with key =jobid +mapid and value=PATH to the file). Size of the LRUCache can be varied based on the environment and I observed a throughput improvement in the order of 4-7% with the introduction of LRUCache. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-1904) Reducing locking contention in TaskTracker.MapOutputServlet's LocalDirAllocator
[ https://issues.apache.org/jira/browse/MAPREDUCE-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-1904: Attachment: TaskTracker- yourkit profiler output .jpg LocalDirAllocator.AllocatorPerContext is heavily contended. Reducing locking contention in TaskTracker.MapOutputServlet's LocalDirAllocator --- Key: MAPREDUCE-1904 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1904 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tasktracker Affects Versions: 0.20.1 Reporter: Rajesh Balamohan Attachments: MAPREDUCE-1904-RC10.patch, TaskTracker- yourkit profiler output .jpg While profiling tasktracker with Sort benchmark, it was observed that threads block on LocalDirAllocator.getLocalPathToRead() in order to get the index file and temporary map output file. As LocalDirAllocator is tied up with ServetContext, only one instance would be available per tasktracker httpserver. Given the jobid mapid, LocalDirAllocator retrieves index file path and temporary map output file path. getLocalPathToRead() is internally synchronized. Introducing a LRUCache for this lookup reduces the contention heavily (LRUCache with key =jobid +mapid and value=PATH to the file). Size of the LRUCache can be varied based on the environment and I observed a throughput improvement in the order of 4-7% with the introduction of LRUCache. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-1904) Reducing locking contention in TaskTracker.MapOutputServlet's LocalDirAllocator
[ https://issues.apache.org/jira/browse/MAPREDUCE-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-1904: Attachment: profiler output after applying the patch.jpg Contention on LocalDirAllocator is very less. Close to 0% Reducing locking contention in TaskTracker.MapOutputServlet's LocalDirAllocator --- Key: MAPREDUCE-1904 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1904 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tasktracker Affects Versions: 0.20.1 Reporter: Rajesh Balamohan Attachments: MAPREDUCE-1904-RC10.patch, profiler output after applying the patch.jpg, TaskTracker- yourkit profiler output .jpg, Thread profiler output showing contention.jpg While profiling tasktracker with Sort benchmark, it was observed that threads block on LocalDirAllocator.getLocalPathToRead() in order to get the index file and temporary map output file. As LocalDirAllocator is tied up with ServetContext, only one instance would be available per tasktracker httpserver. Given the jobid mapid, LocalDirAllocator retrieves index file path and temporary map output file path. getLocalPathToRead() is internally synchronized. Introducing a LRUCache for this lookup reduces the contention heavily (LRUCache with key =jobid +mapid and value=PATH to the file). Size of the LRUCache can be varied based on the environment and I observed a throughput improvement in the order of 4-7% with the introduction of LRUCache. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-1904) Reducing locking contention in TaskTracker.MapOutputServlet's LocalDirAllocator
[ https://issues.apache.org/jira/browse/MAPREDUCE-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-1904: Attachment: Thread profiler output showing contention.jpg Reducing locking contention in TaskTracker.MapOutputServlet's LocalDirAllocator --- Key: MAPREDUCE-1904 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1904 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tasktracker Affects Versions: 0.20.1 Reporter: Rajesh Balamohan Attachments: MAPREDUCE-1904-RC10.patch, profiler output after applying the patch.jpg, TaskTracker- yourkit profiler output .jpg, Thread profiler output showing contention.jpg While profiling tasktracker with Sort benchmark, it was observed that threads block on LocalDirAllocator.getLocalPathToRead() in order to get the index file and temporary map output file. As LocalDirAllocator is tied up with ServetContext, only one instance would be available per tasktracker httpserver. Given the jobid mapid, LocalDirAllocator retrieves index file path and temporary map output file path. getLocalPathToRead() is internally synchronized. Introducing a LRUCache for this lookup reduces the contention heavily (LRUCache with key =jobid +mapid and value=PATH to the file). Size of the LRUCache can be varied based on the environment and I observed a throughput improvement in the order of 4-7% with the introduction of LRUCache. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-1904) Reducing locking contention in TaskTracker.MapOutputServlet's LocalDirAllocator
[ https://issues.apache.org/jira/browse/MAPREDUCE-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-1904: Attachment: MAPREDUCE-1904-RC10.patch Patch for RC10 release is attached here. Reducing locking contention in TaskTracker.MapOutputServlet's LocalDirAllocator --- Key: MAPREDUCE-1904 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1904 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tasktracker Affects Versions: 0.20.1 Reporter: Rajesh Balamohan Attachments: MAPREDUCE-1904-RC10.patch While profiling tasktracker with Sort benchmark, it was observed that threads block on LocalDirAllocator.getLocalPathToRead() in order to get the index file and temporary map output file. As LocalDirAllocator is tied up with ServetContext, only one instance would be available per tasktracker httpserver. Given the jobid mapid, LocalDirAllocator retrieves index file path and temporary map output file path. getLocalPathToRead() is internally synchronized. Introducing a LRUCache for this lookup reduces the contention heavily (LRUCache with key =jobid +mapid and value=PATH to the file). Size of the LRUCache can be varied based on the environment and I observed a throughput improvement in the order of 4-7% with the introduction of LRUCache. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAPREDUCE-1904) Reducing locking contention in TaskTracker.MapOutputServlet's LocalDirAllocator
Reducing locking contention in TaskTracker.MapOutputServlet's LocalDirAllocator --- Key: MAPREDUCE-1904 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1904 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tasktracker Affects Versions: 0.20.1 Reporter: Rajesh Balamohan While profiling tasktracker with Sort benchmark, it was observed that threads block on LocalDirAllocator.getLocalPathToRead() in order to get the index file and temporary map output file. As LocalDirAllocator is tied up with ServetContext, only one instance would be available per tasktracker httpserver. Given the jobid mapid, LocalDirAllocator retrieves index file path and temporary map output file path. getLocalPathToRead() is internally synchronized. Introducing a LRUCache for this lookup reduces the contention heavily (LRUCache with key =jobid +mapid and value=PATH to the file). Size of the LRUCache can be varied based on the environment and I observed a throughput improvement in the order of 4-7% with the introduction of LRUCache. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1533) Reduce or remove usage of String.format() usage in CapacityTaskScheduler.updateQSIObjects and Counters.makeEscapedString()
[ https://issues.apache.org/jira/browse/MAPREDUCE-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861311#action_12861311 ] Rajesh Balamohan commented on MAPREDUCE-1533: - I am on vacation from 23-Apr - 17-May. I do not have internet access during this time. Please check with my manager sriguru@ for any urgent issues. ~Rajesh.B Reduce or remove usage of String.format() usage in CapacityTaskScheduler.updateQSIObjects and Counters.makeEscapedString() -- Key: MAPREDUCE-1533 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1533 Project: Hadoop Map/Reduce Issue Type: Bug Components: jobtracker Affects Versions: 0.20.1 Reporter: Rajesh Balamohan Assignee: Amar Kamat Attachments: MAPREDUCE-1533-and-others-20100413.1.txt, MAPREDUCE-1533-and-others-20100413.bugfix.txt, mapreduce-1533-v1.4.patch When short jobs are executed in hadoop with OutOfBandHeardBeat=true, JT executes heartBeat() method heavily. This internally makes a call to CapacityTaskScheduler.updateQSIObjects(). CapacityTaskScheduler.updateQSIObjects(), internally calls String.format() for setting the job scheduling information. Based on the datastructure size of jobQueuesManager and queueInfoMap, the number of times String.format() gets executed becomes very high. String.format() internally does pattern matching which turns to be out very heavy (This was revealed while profiling JT. Almost 57% of time was spent in CapacityScheduler.assignTasks(), out of which String.format() took 46%. Would it be possible to do String.format() only at the time of invoking JobInProgress.getSchedulingInfo?. This might reduce the pressure on JT while processing heartbeats. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-1461) Feature to instruct rumen-folder utility to skip jobs worth of specific duration
[ https://issues.apache.org/jira/browse/MAPREDUCE-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-1461: Attachment: mapreduce-1461--2010-03-04.patch I took the trunk version and generated the patch. Please refer the attached file. Feature to instruct rumen-folder utility to skip jobs worth of specific duration Key: MAPREDUCE-1461 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1461 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Rajesh Balamohan Fix For: 0.22.0 Attachments: mapreduce-1461--2010-02-05.patch, mapreduce-1461--2010-03-04.patch JSON outputs of rumen on production logs can be huge in the order of multiple GB. Rumen's folder utility helps in getting a smaller snapshot of this JSON data. It would be helpful to have an option in rumen-folder, wherein user can specify a duration from which rumen-folder should start processing data. Related JIRA link: https://issues.apache.org/jira/browse/MAPREDUCE-1295 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAPREDUCE-1533) reduce or remove usage of String.format() usage in CapacityTaskScheduler.updateQSIObjects
reduce or remove usage of String.format() usage in CapacityTaskScheduler.updateQSIObjects - Key: MAPREDUCE-1533 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1533 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.20.1 Reporter: Rajesh Balamohan When short jobs are executed in hadoop with OutOfBandHeardBeat=true, JT executes heartBeat() method heavily. This internally makes a call to CapacityTaskScheduler.updateQSIObjects(). CapacityTaskScheduler.updateQSIObjects(), internally calls String.format() for setting the job scheduling information. Based on the datastructure size of jobQueuesManager and queueInfoMap, the number of times String.format() gets executed becomes very high. String.format() internally does pattern matching which turns to be out very heavy (This was revealed while profiling JT. Almost 57% of time was spent in CapacityScheduler.assignTasks(), out of which String.format() took 46%. Would it be possible to do String.format() only at the time of invoking JobInProgress.getSchedulingInfo?. This might reduce the pressure on JT while processing heartbeats. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1354) Refactor JobTracker.submitJob to not lock the JobTracker during the HDFS accesses
[ https://issues.apache.org/jira/browse/MAPREDUCE-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834136#action_12834136 ] Rajesh Balamohan commented on MAPREDUCE-1354: - In the latest patch, getTaskCompletionEvents is using synchronized(this) I believe. It has to be on synchronized (jobs) { public TaskCompletionEvent[] getTaskCompletionEvents( JobID jobid, int fromEventId, int maxEvents) throws IOException{ JobInProgress job = null; synchronized (jobs) { job = this.jobs.get(jobid); } if (null != job) { return isJobInited(job) ? job.getTaskCompletionEvents(fromEventId, maxEvents) : TaskCompletionEvent.EMPTY_ARRAY; } return completedJobStatusStore.readJobTaskCompletionEvents(jobid, fromEventId, maxEvents); } Refactor JobTracker.submitJob to not lock the JobTracker during the HDFS accesses - Key: MAPREDUCE-1354 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1354 Project: Hadoop Map/Reduce Issue Type: Bug Components: jobtracker Reporter: Devaraj Das Assignee: Arun C Murthy Priority: Critical Attachments: MAPREDUCE-1354_yhadoop20.patch, MAPREDUCE-1354_yhadoop20.patch, MAPREDUCE-1354_yhadoop20.patch, MAPREDUCE-1354_yhadoop20.patch, MAPREDUCE-1354_yhadoop20.patch, MAPREDUCE-1354_yhadoop20.patch It'd be nice to have the JobTracker object not be locked while accessing the HDFS for reading the jobconf file and while writing the jobinfo file in the submitJob method. We should see if we can avoid taking the lock altogether. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1354) Refactor JobTracker.submitJob to not lock the JobTracker during the HDFS accesses
[ https://issues.apache.org/jira/browse/MAPREDUCE-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834149#action_12834149 ] Rajesh Balamohan commented on MAPREDUCE-1354: - Plz ignore the previous comment. Had a discussion with Hemanth and will try out synchronized(jobs) on trunk. Refactor JobTracker.submitJob to not lock the JobTracker during the HDFS accesses - Key: MAPREDUCE-1354 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1354 Project: Hadoop Map/Reduce Issue Type: Bug Components: jobtracker Reporter: Devaraj Das Assignee: Arun C Murthy Priority: Critical Attachments: MAPREDUCE-1354_yhadoop20.patch, MAPREDUCE-1354_yhadoop20.patch, MAPREDUCE-1354_yhadoop20.patch, MAPREDUCE-1354_yhadoop20.patch, MAPREDUCE-1354_yhadoop20.patch, MAPREDUCE-1354_yhadoop20.patch It'd be nice to have the JobTracker object not be locked while accessing the HDFS for reading the jobconf file and while writing the jobinfo file in the submitJob method. We should see if we can avoid taking the lock altogether. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAPREDUCE-1495) Reduce locking contention on JobTracker.getTaskCompletionEvents()
Reduce locking contention on JobTracker.getTaskCompletionEvents() - Key: MAPREDUCE-1495 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1495 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 0.20.1 Reporter: Rajesh Balamohan While profiling JT for slow performance with small-jobs, it was observed that JobTracker.getTaskCompletionEvents() is attributing to 40% of lock contention on JT. This JIRA ticket is created to explore the possibilities of reducing the sychronized code block in this method. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1495) Reduce locking contention on JobTracker.getTaskCompletionEvents()
[ https://issues.apache.org/jira/browse/MAPREDUCE-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833834#action_12833834 ] Rajesh Balamohan commented on MAPREDUCE-1495: - As of now, its implemented as follows in JobTracker public synchronized TaskCompletionEvent[] getTaskCompletionEvents( JobID jobid, int fromEventId, int maxEvents) throws IOException{ synchronized (this) { JobInProgress job = this.jobs.get(jobid); if (null != job) { if (job.inited()) { return job.getTaskCompletionEvents(fromEventId, maxEvents); } else { return EMPTY_EVENTS; } } } return completedJobStatusStore.readJobTaskCompletionEvents(jobid, fromEventId, maxEvents); } Where, jobs is TreeMapJobID, JobInProgress(). It is possible to reduce to contention in 2 ways. 1. Reduce the synch to only JobInProgress job = this.jobs.get(jobid); Rest of the code is independent of the synch block (asaik). 2. Change datastructure of jobs to ConcurrentHashMapJobID, JobInProgress(). This way, we can jobs.get(jobid) automatically becomes threadsafe and the total synchornization itself can be eliminated. If its mandatory to maintain the order, I have to try the 1st one. Reduce locking contention on JobTracker.getTaskCompletionEvents() - Key: MAPREDUCE-1495 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1495 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 0.20.1 Reporter: Rajesh Balamohan While profiling JT for slow performance with small-jobs, it was observed that JobTracker.getTaskCompletionEvents() is attributing to 40% of lock contention on JT. This JIRA ticket is created to explore the possibilities of reducing the sychronized code block in this method. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1495) Reduce locking contention on JobTracker.getTaskCompletionEvents()
[ https://issues.apache.org/jira/browse/MAPREDUCE-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834030#action_12834030 ] Rajesh Balamohan commented on MAPREDUCE-1495: - Made the following changes to reduce contentions and profiled JT. This virtually eliminated the contentions caused by getTaskCompletionEvents In JobTracker.java: (locking only jobs during get()) public TaskCompletionEvent[] getTaskCompletionEvents( JobID jobid, int fromEventId, int maxEvents) throws IOException{ JobInProgress job = null; synchronized (jobs) { job = this.jobs.get(jobid); } if (null != job) { if (job.inited()) { return job.getTaskCompletionEvents(fromEventId, maxEvents); } else { return EMPTY_EVENTS; } } return completedJobStatusStore.readJobTaskCompletionEvents(jobid, fromEventId, maxEvents); } In Configuration.java: (eliminated synchronize on method level. This might be required only when properties is null) private Properties getProps() { if (properties == null) { synchronized(this) { properties = new Properties(); loadResources(properties, resources, quietmode); if (overlay!= null) { properties.putAll(overlay); if (storeResource) { for (Map.EntryObject,Object item: overlay.entrySet()) { updatingResource.put((String) item.getKey(), Unknown); } } } } } return properties; } Reduce locking contention on JobTracker.getTaskCompletionEvents() - Key: MAPREDUCE-1495 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1495 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 0.20.1 Reporter: Rajesh Balamohan While profiling JT for slow performance with small-jobs, it was observed that JobTracker.getTaskCompletionEvents() is attributing to 40% of lock contention on JT. This JIRA ticket is created to explore the possibilities of reducing the sychronized code block in this method. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAPREDUCE-1461) Feature to instruct rumen-folder utility to skip jobs worth of specific duration
Feature to instruct rumen-folder utility to skip jobs worth of specific duration Key: MAPREDUCE-1461 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1461 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Rajesh Balamohan Fix For: 0.22.0 JSON outputs of rumen on production logs can be huge in the order of multiple GB. Rumen's folder utility helps in getting a smaller snapshot of this JSON data. It would be helpful to have an option in rumen-folder, wherein user can specify a duration from which rumen-folder should start processing data. Related JIRA link: https://issues.apache.org/jira/browse/MAPREDUCE-1295 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-1461) Feature to instruct rumen-folder utility to skip jobs worth of specific duration
[ https://issues.apache.org/jira/browse/MAPREDUCE-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated MAPREDUCE-1461: Attachment: mapreduce-1461--2010-02-05.patch The attached patch implements this feature. User can specify the time duration to be skipped by specifying -starts-after commandline argument. Feature to instruct rumen-folder utility to skip jobs worth of specific duration Key: MAPREDUCE-1461 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1461 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Rajesh Balamohan Fix For: 0.22.0 Attachments: mapreduce-1461--2010-02-05.patch JSON outputs of rumen on production logs can be huge in the order of multiple GB. Rumen's folder utility helps in getting a smaller snapshot of this JSON data. It would be helpful to have an option in rumen-folder, wherein user can specify a duration from which rumen-folder should start processing data. Related JIRA link: https://issues.apache.org/jira/browse/MAPREDUCE-1295 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.