[jira] [Commented] (MAPREDUCE-6144) DefaultSpeculator always add both MAP and REDUCE Speculative task even MAP_SPECULATIVE or REDUCE_SPECULATIVE is disabled.
[ https://issues.apache.org/jira/browse/MAPREDUCE-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224217#comment-14224217 ] zhihai xu commented on MAPREDUCE-6144: -- Hi [~xgong], Yes, you are right. mapContainerNeeds will be empty if MAP_SPECULATIVE is disabled, If mapContainerNeeds is empty, maybeScheduleAMapSpeculation will return 0. Same for reduceContainerNeeds and maybeScheduleAReduceSpeculation, because SpeculatorEventDispatcher in MRAppMaster handles TASK_CONTAINER_NEED_UPDATE SpeculatorEvent based on task type, MAP_SPECULATIVE configuration and REDUCE_SPECULATIVE configuration. thanks for the review. zhihai DefaultSpeculator always add both MAP and REDUCE Speculative task even MAP_SPECULATIVE or REDUCE_SPECULATIVE is disabled. - Key: MAPREDUCE-6144 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6144 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 2.5.1 Reporter: zhihai xu Assignee: zhihai xu Attachments: MAPREDUCE-6144.000.patch, MAPREDUCE-6144.001.patch DefaultSpeculator always add both MAP and REDUCE Speculative task even MAP_SPECULATIVE or REDUCE_SPECULATIVE is disabled. If both MAP_SPECULATIVE and REDUCE_SPECULATIVE are disabled, DefaultSpeculator won't start. The issue will happen if only one of MAP_SPECULATIVE and REDUCE_SPECULATIVE is enabled, both MAP and REDUCE Speculative task are generate. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6144) DefaultSpeculator always add both MAP and REDUCE Speculative task even MAP_SPECULATIVE or REDUCE_SPECULATIVE is disabled.
[ https://issues.apache.org/jira/browse/MAPREDUCE-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated MAPREDUCE-6144: - Resolution: Not a Problem Status: Resolved (was: Patch Available) DefaultSpeculator always add both MAP and REDUCE Speculative task even MAP_SPECULATIVE or REDUCE_SPECULATIVE is disabled. - Key: MAPREDUCE-6144 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6144 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 2.5.1 Reporter: zhihai xu Assignee: zhihai xu Attachments: MAPREDUCE-6144.000.patch, MAPREDUCE-6144.001.patch DefaultSpeculator always add both MAP and REDUCE Speculative task even MAP_SPECULATIVE or REDUCE_SPECULATIVE is disabled. If both MAP_SPECULATIVE and REDUCE_SPECULATIVE are disabled, DefaultSpeculator won't start. The issue will happen if only one of MAP_SPECULATIVE and REDUCE_SPECULATIVE is enabled, both MAP and REDUCE Speculative task are generate. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6168) Old MR client is still broken when receiving new counters from MR job
[ https://issues.apache.org/jira/browse/MAPREDUCE-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224654#comment-14224654 ] Junping Du commented on MAPREDUCE-6168: --- Hi [~zjshen], Thanks for reporting this. From the track stack below: {noformat} org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.findCounter(FrameworkCounterGroup.java:182) at org.apache.hadoop.mapreduce.counters.AbstractCounters.findCounter(AbstractCounters.java:154) {noformat} It show that the client is still in 2.2 (https://github.com/apache/hadoop/blob/branch-2.2/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/counters/FrameworkCounterGroup.java), and we fix the problem in 2.6 which means the client after 2.6 can be compatible with new counters in future (forward compatibility) but not means existing broken compatibility can be recovered (except we want to fix the code in branch-2.2). [~zjshen], I would like to fix this JIRA as won't fix. Do you agree? Old MR client is still broken when receiving new counters from MR job - Key: MAPREDUCE-6168 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6168 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Zhijie Shen Assignee: Junping Du Priority: Blocker In the following scenarios: 1. Either insecure or secure; 2. MR 2.2 with new shuffle on NM; 3. Submitting via old client. We will see the following console exception: {code} 14/11/17 14:56:19 INFO mapreduce.Job: Job job_1416264695865_0003 completed successfully java.lang.IllegalArgumentException: No enum constant org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_REDUCES at java.lang.Enum.valueOf(Enum.java:236) at org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.valueOf(FrameworkCounterGroup.java:148) at org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.findCounter(FrameworkCounterGroup.java:182) at org.apache.hadoop.mapreduce.counters.AbstractCounters.findCounter(AbstractCounters.java:154) at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:240) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:370) at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:511) at org.apache.hadoop.mapreduce.Job$7.run(Job.java:756) at org.apache.hadoop.mapreduce.Job$7.run(Job.java:753) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:753) at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1361) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1289) at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306) at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) {code} The problem is supposed to be fixed by MAPREDUCE-5831, however, it seems that we haven't cover all the problematic code path. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-5831) Old MR client is not compatible with new MR application
[ https://issues.apache.org/jira/browse/MAPREDUCE-5831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224663#comment-14224663 ] Junping Du commented on MAPREDUCE-5831: --- Hi [~zjshen], I think problem get fixed since 2.6. For 2.2, because it doesn't have any fix, the problem could be still there. Please see my comments in MAPREDUCE-6168. Old MR client is not compatible with new MR application --- Key: MAPREDUCE-5831 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5831 Project: Hadoop Map/Reduce Issue Type: Bug Components: client, mr-am Affects Versions: 2.2.0, 2.3.0 Reporter: Zhijie Shen Assignee: Junping Du Priority: Blocker Fix For: 2.6.0 Attachments: MAPREDUCE-5831-v2.patch, MAPREDUCE-5831-v3.patch, MAPREDUCE-5831.patch Recently, we saw the following scenario: 1. The user setup a cluster of hadoop 2.3., which contains YARN 2.3 and MR 2.3. 2. The user client on a machine that MR 2.2 is installed and in the classpath. Then, when the user submitted a simple wordcount job, he saw the following message: {code} 16:00:41,027 INFO main mapreduce.Job:1345 - map 100% reduce 100% 16:00:41,036 INFO main mapreduce.Job:1356 - Job job_1396468045458_0006 completed successfully 16:02:20,535 WARN main mapreduce.JobRunner:212 - Cannot start job [wordcountJob] java.lang.IllegalArgumentException: No enum constant org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_REDUCES at java.lang.Enum.valueOf(Enum.java:236) at org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.valueOf(FrameworkCounterGroup.java:148) at org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.findCounter(FrameworkCounterGroup.java:182) at org.apache.hadoop.mapreduce.counters.AbstractCounters.findCounter(AbstractCounters.java:154) at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:240) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:370) at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:511) at org.apache.hadoop.mapreduce.Job$7.run(Job.java:756) at org.apache.hadoop.mapreduce.Job$7.run(Job.java:753) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:753) at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1361) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1289) . . . {code} The problem is that the wordcount job was running on one or more than one nodes of the YARN cluster, where MR 2.3 libs were installed, and JobCounter.MB_MILLIS_REDUCES is available in the counters. On the other side, due to the classpath setting, the client was likely to run with MR 2.2 libs. After the client retrieved the counters from MR AM, it tried to construct the Counter object with the received counter name. Unfortunately, the enum didn't exist in the client's classpath. Therefore, No enum constant exception is thrown here. JobCounter.MB_MILLIS_REDUCES is brought to MR2 via MAPREDUCE-5464 since Hadoop 2.3. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MAPREDUCE-6173) Document the configuration of deploying MR over distributed cache with enabling wired encryption at the same time
Junping Du created MAPREDUCE-6173: - Summary: Document the configuration of deploying MR over distributed cache with enabling wired encryption at the same time Key: MAPREDUCE-6173 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6173 Project: Hadoop Map/Reduce Issue Type: Improvement Components: distributed-cache, documentation Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Use the current documented configuration (specified in http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html) to work with the cluster enabling shuffle encryption (http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html) will cause the job failed with exception below: {noformat} 2014-10-10 02:17:16,600 WARN [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to tassapol-centos5nano1-3.cs1cloud.internal:13562 with 1 map outputs javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Alerts.java:174) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1731) at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:241) at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:235) at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1206) at com.sun.net.ssl.internal.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:136) at com.sun.net.ssl.internal.ssl.Handshaker.processLoop(Handshaker.java:593) at com.sun.net.ssl.internal.ssl.Handshaker.process_record(Handshaker.java:529) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:925) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1170) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1197) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1181) at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:434) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:81) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:61) at sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:584) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1193) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:318) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:427) {noformat} This is due to ssl-client.xml is not included in MR tar ball when we deploy it over distributed cache. Putting the ssl-client.xml on CLASSPATH of MR job can resolve the problem and we should document it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6173) Document the configuration of deploying MR over distributed cache with enabling wired encryption at the same time
[ https://issues.apache.org/jira/browse/MAPREDUCE-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated MAPREDUCE-6173: -- Attachment: Screen Shot for MAPREDUCE-6173.png Document the configuration of deploying MR over distributed cache with enabling wired encryption at the same time - Key: MAPREDUCE-6173 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6173 Project: Hadoop Map/Reduce Issue Type: Improvement Components: distributed-cache, documentation Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Attachments: Screen Shot for MAPREDUCE-6173.png Use the current documented configuration (specified in http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html) to work with the cluster enabling shuffle encryption (http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html) will cause the job failed with exception below: {noformat} 2014-10-10 02:17:16,600 WARN [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to tassapol-centos5nano1-3.cs1cloud.internal:13562 with 1 map outputs javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Alerts.java:174) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1731) at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:241) at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:235) at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1206) at com.sun.net.ssl.internal.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:136) at com.sun.net.ssl.internal.ssl.Handshaker.processLoop(Handshaker.java:593) at com.sun.net.ssl.internal.ssl.Handshaker.process_record(Handshaker.java:529) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:925) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1170) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1197) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1181) at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:434) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:81) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:61) at sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:584) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1193) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:318) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:427) {noformat} This is due to ssl-client.xml is not included in MR tar ball when we deploy it over distributed cache. Putting the ssl-client.xml on CLASSPATH of MR job can resolve the problem and we should document it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6160) Potential NullPointerException in MRClientProtocol interface implementation.
[ https://issues.apache.org/jira/browse/MAPREDUCE-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated MAPREDUCE-6160: -- Attachment: MAPREDUCE-6160.3.patch Potential NullPointerException in MRClientProtocol interface implementation. Key: MAPREDUCE-6160 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6160 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Rohith Assignee: Rohith Attachments: MAPREDUCE-6160.1.patch, MAPREDUCE-6160.2.patch, MAPREDUCE-6160.3.patch, MAPREDUCE-6160.patch, MAPREDUCE-6160.patch In the implementation of MRClientProtocol, many methods can throw NullPointerExceptions. Instead of NullPointerExceptions, better to throw IOException with proper message. In the HistoryClientService class and MRClientService class has #verifyAndGetJob() method that return job object as null. {code} getTaskReport(GetTaskReportRequest request) throws IOException; getTaskAttemptReport(GetTaskAttemptReportRequest request) throws IOException; getCounters(GetCountersRequest request) throws IOException; getTaskAttemptCompletionEvents(GetTaskAttemptCompletionEventsRequest request) throws IOException; getTaskReports(GetTaskReportsRequest request) throws IOException; getDiagnostics(GetDiagnosticsRequest request) throws IOException; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6173) Document the configuration of deploying MR over distributed cache with enabling wired encryption at the same time
[ https://issues.apache.org/jira/browse/MAPREDUCE-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated MAPREDUCE-6173: -- Attachment: MAPREDUCE-6173.patch Document the configuration of deploying MR over distributed cache with enabling wired encryption at the same time - Key: MAPREDUCE-6173 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6173 Project: Hadoop Map/Reduce Issue Type: Improvement Components: distributed-cache, documentation Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Attachments: MAPREDUCE-6173.patch, Screen Shot for MAPREDUCE-6173.png Use the current documented configuration (specified in http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html) to work with the cluster enabling shuffle encryption (http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html) will cause the job failed with exception below: {noformat} 2014-10-10 02:17:16,600 WARN [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to tassapol-centos5nano1-3.cs1cloud.internal:13562 with 1 map outputs javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Alerts.java:174) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1731) at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:241) at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:235) at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1206) at com.sun.net.ssl.internal.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:136) at com.sun.net.ssl.internal.ssl.Handshaker.processLoop(Handshaker.java:593) at com.sun.net.ssl.internal.ssl.Handshaker.process_record(Handshaker.java:529) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:925) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1170) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1197) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1181) at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:434) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:81) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:61) at sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:584) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1193) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:318) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:427) {noformat} This is due to ssl-client.xml is not included in MR tar ball when we deploy it over distributed cache. Putting the ssl-client.xml on CLASSPATH of MR job can resolve the problem and we should document it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6173) Document the configuration of deploying MR over distributed cache with enabling wired encryption at the same time
[ https://issues.apache.org/jira/browse/MAPREDUCE-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated MAPREDUCE-6173: -- Status: Patch Available (was: Open) Document the configuration of deploying MR over distributed cache with enabling wired encryption at the same time - Key: MAPREDUCE-6173 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6173 Project: Hadoop Map/Reduce Issue Type: Improvement Components: distributed-cache, documentation Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Attachments: MAPREDUCE-6173.patch, Screen Shot for MAPREDUCE-6173.png Use the current documented configuration (specified in http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html) to work with the cluster enabling shuffle encryption (http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html) will cause the job failed with exception below: {noformat} 2014-10-10 02:17:16,600 WARN [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to tassapol-centos5nano1-3.cs1cloud.internal:13562 with 1 map outputs javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Alerts.java:174) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1731) at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:241) at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:235) at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1206) at com.sun.net.ssl.internal.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:136) at com.sun.net.ssl.internal.ssl.Handshaker.processLoop(Handshaker.java:593) at com.sun.net.ssl.internal.ssl.Handshaker.process_record(Handshaker.java:529) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:925) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1170) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1197) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1181) at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:434) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:81) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:61) at sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:584) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1193) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:318) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:427) {noformat} This is due to ssl-client.xml is not included in MR tar ball when we deploy it over distributed cache. Putting the ssl-client.xml on CLASSPATH of MR job can resolve the problem and we should document it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6173) Document the configuration of deploying MR over distributed cache with enabling wired encryption at the same time
[ https://issues.apache.org/jira/browse/MAPREDUCE-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224762#comment-14224762 ] Junping Du commented on MAPREDUCE-6173: --- Attach the first patch and screenshot for this. Document the configuration of deploying MR over distributed cache with enabling wired encryption at the same time - Key: MAPREDUCE-6173 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6173 Project: Hadoop Map/Reduce Issue Type: Improvement Components: distributed-cache, documentation Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Attachments: MAPREDUCE-6173.patch, Screen Shot for MAPREDUCE-6173.png Use the current documented configuration (specified in http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html) to work with the cluster enabling shuffle encryption (http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html) will cause the job failed with exception below: {noformat} 2014-10-10 02:17:16,600 WARN [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to tassapol-centos5nano1-3.cs1cloud.internal:13562 with 1 map outputs javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Alerts.java:174) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1731) at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:241) at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:235) at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1206) at com.sun.net.ssl.internal.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:136) at com.sun.net.ssl.internal.ssl.Handshaker.processLoop(Handshaker.java:593) at com.sun.net.ssl.internal.ssl.Handshaker.process_record(Handshaker.java:529) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:925) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1170) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1197) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1181) at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:434) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:81) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:61) at sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:584) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1193) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:318) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:427) {noformat} This is due to ssl-client.xml is not included in MR tar ball when we deploy it over distributed cache. Putting the ssl-client.xml on CLASSPATH of MR job can resolve the problem and we should document it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6160) Potential NullPointerException in MRClientProtocol interface implementation.
[ https://issues.apache.org/jira/browse/MAPREDUCE-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224764#comment-14224764 ] Rohith commented on MAPREDUCE-6160: --- bq. just one nit. Should we just say Unknown Job + jobId for the error message? I changed log as above. I updated the patch with changing comment message for consistency. Please review. Potential NullPointerException in MRClientProtocol interface implementation. Key: MAPREDUCE-6160 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6160 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Rohith Assignee: Rohith Attachments: MAPREDUCE-6160.1.patch, MAPREDUCE-6160.2.patch, MAPREDUCE-6160.3.patch, MAPREDUCE-6160.patch, MAPREDUCE-6160.patch In the implementation of MRClientProtocol, many methods can throw NullPointerExceptions. Instead of NullPointerExceptions, better to throw IOException with proper message. In the HistoryClientService class and MRClientService class has #verifyAndGetJob() method that return job object as null. {code} getTaskReport(GetTaskReportRequest request) throws IOException; getTaskAttemptReport(GetTaskAttemptReportRequest request) throws IOException; getCounters(GetCountersRequest request) throws IOException; getTaskAttemptCompletionEvents(GetTaskAttemptCompletionEventsRequest request) throws IOException; getTaskReports(GetTaskReportsRequest request) throws IOException; getDiagnostics(GetDiagnosticsRequest request) throws IOException; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated MAPREDUCE-6166: -- Attachment: MAPREDUCE-6166.v2.201411251627.txt Thank you very much [~jira.shegalov] and [~jlowe] for your comments and help in making this patch better. If it's okay, I would like to # update this patch with the {{final}} keyword on {{JobConf jobConf}} # Create a separate Jira for refactoring the code to inherit from a common class. Uploading patch to cover #1. Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk --- Key: MAPREDUCE-6166 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 2.6.0 Reporter: Eric Payne Assignee: Eric Payne Attachments: MAPREDUCE-6166.v1.201411221941.txt, MAPREDUCE-6166.v2.201411251627.txt In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate map partition output gets corrupted on disk on the map side. If this corrupted map output is too large to shuffle in memory, the reducer streams it to disk without validating the checksum. In jobs this large, it could take hours before the reducer finally tries to read the corrupted file and fails. Since retries of the failed reduce attempt will also take hours, this delay in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6160) Potential NullPointerException in MRClientProtocol interface implementation.
[ https://issues.apache.org/jira/browse/MAPREDUCE-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated MAPREDUCE-6160: -- Status: Patch Available (was: Open) Potential NullPointerException in MRClientProtocol interface implementation. Key: MAPREDUCE-6160 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6160 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Rohith Assignee: Rohith Attachments: MAPREDUCE-6160.1.patch, MAPREDUCE-6160.2.patch, MAPREDUCE-6160.3.patch, MAPREDUCE-6160.patch, MAPREDUCE-6160.patch In the implementation of MRClientProtocol, many methods can throw NullPointerExceptions. Instead of NullPointerExceptions, better to throw IOException with proper message. In the HistoryClientService class and MRClientService class has #verifyAndGetJob() method that return job object as null. {code} getTaskReport(GetTaskReportRequest request) throws IOException; getTaskAttemptReport(GetTaskAttemptReportRequest request) throws IOException; getCounters(GetCountersRequest request) throws IOException; getTaskAttemptCompletionEvents(GetTaskAttemptCompletionEventsRequest request) throws IOException; getTaskReports(GetTaskReportsRequest request) throws IOException; getDiagnostics(GetDiagnosticsRequest request) throws IOException; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6173) Document the configuration of deploying MR over distributed cache with enabling wired encryption at the same time
[ https://issues.apache.org/jira/browse/MAPREDUCE-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224828#comment-14224828 ] Hadoop QA commented on MAPREDUCE-6173: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12683586/MAPREDUCE-6173.patch against trunk revision 61a2510. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5049//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5049//console This message is automatically generated. Document the configuration of deploying MR over distributed cache with enabling wired encryption at the same time - Key: MAPREDUCE-6173 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6173 Project: Hadoop Map/Reduce Issue Type: Improvement Components: distributed-cache, documentation Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Attachments: MAPREDUCE-6173.patch, Screen Shot for MAPREDUCE-6173.png Use the current documented configuration (specified in http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html) to work with the cluster enabling shuffle encryption (http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html) will cause the job failed with exception below: {noformat} 2014-10-10 02:17:16,600 WARN [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to tassapol-centos5nano1-3.cs1cloud.internal:13562 with 1 map outputs javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Alerts.java:174) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1731) at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:241) at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:235) at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1206) at com.sun.net.ssl.internal.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:136) at com.sun.net.ssl.internal.ssl.Handshaker.processLoop(Handshaker.java:593) at com.sun.net.ssl.internal.ssl.Handshaker.process_record(Handshaker.java:529) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:925) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1170) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1197) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1181) at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:434) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:81) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:61) at sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:584) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1193) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:318) at
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224834#comment-14224834 ] Hadoop QA commented on MAPREDUCE-6166: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12683592/MAPREDUCE-6166.v2.201411251627.txt against trunk revision 61a2510. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 eclipse:eclipse{color}. The patch failed to build with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5050//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5050//console This message is automatically generated. Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk --- Key: MAPREDUCE-6166 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 2.6.0 Reporter: Eric Payne Assignee: Eric Payne Attachments: MAPREDUCE-6166.v1.201411221941.txt, MAPREDUCE-6166.v2.201411251627.txt In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate map partition output gets corrupted on disk on the map side. If this corrupted map output is too large to shuffle in memory, the reducer streams it to disk without validating the checksum. In jobs this large, it could take hours before the reducer finally tries to read the corrupted file and fails. Since retries of the failed reduce attempt will also take hours, this delay in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6160) Potential NullPointerException in MRClientProtocol interface implementation.
[ https://issues.apache.org/jira/browse/MAPREDUCE-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224858#comment-14224858 ] Hadoop QA commented on MAPREDUCE-6160: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12683587/MAPREDUCE-6160.3.patch against trunk revision 61a2510. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5048//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5048//console This message is automatically generated. Potential NullPointerException in MRClientProtocol interface implementation. Key: MAPREDUCE-6160 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6160 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Rohith Assignee: Rohith Attachments: MAPREDUCE-6160.1.patch, MAPREDUCE-6160.2.patch, MAPREDUCE-6160.3.patch, MAPREDUCE-6160.patch, MAPREDUCE-6160.patch In the implementation of MRClientProtocol, many methods can throw NullPointerExceptions. Instead of NullPointerExceptions, better to throw IOException with proper message. In the HistoryClientService class and MRClientService class has #verifyAndGetJob() method that return job object as null. {code} getTaskReport(GetTaskReportRequest request) throws IOException; getTaskAttemptReport(GetTaskAttemptReportRequest request) throws IOException; getCounters(GetCountersRequest request) throws IOException; getTaskAttemptCompletionEvents(GetTaskAttemptCompletionEventsRequest request) throws IOException; getTaskReports(GetTaskReportsRequest request) throws IOException; getDiagnostics(GetDiagnosticsRequest request) throws IOException; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6168) Old MR client is still broken when receiving new counters from MR job
[ https://issues.apache.org/jira/browse/MAPREDUCE-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224857#comment-14224857 ] Karthik Kambatla commented on MAPREDUCE-6168: - I would call the issue fixed in 2.x only if 2.2 client is able to talk to 2.x server and viceversa. This doesn't seem to be the case for 2.6. Let me know if I am missing something here. Old MR client is still broken when receiving new counters from MR job - Key: MAPREDUCE-6168 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6168 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Zhijie Shen Assignee: Junping Du Priority: Blocker In the following scenarios: 1. Either insecure or secure; 2. MR 2.2 with new shuffle on NM; 3. Submitting via old client. We will see the following console exception: {code} 14/11/17 14:56:19 INFO mapreduce.Job: Job job_1416264695865_0003 completed successfully java.lang.IllegalArgumentException: No enum constant org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_REDUCES at java.lang.Enum.valueOf(Enum.java:236) at org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.valueOf(FrameworkCounterGroup.java:148) at org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.findCounter(FrameworkCounterGroup.java:182) at org.apache.hadoop.mapreduce.counters.AbstractCounters.findCounter(AbstractCounters.java:154) at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:240) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:370) at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:511) at org.apache.hadoop.mapreduce.Job$7.run(Job.java:756) at org.apache.hadoop.mapreduce.Job$7.run(Job.java:753) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:753) at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1361) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1289) at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306) at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) {code} The problem is supposed to be fixed by MAPREDUCE-5831, however, it seems that we haven't cover all the problematic code path. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6168) Old MR client is still broken when receiving new counters from MR job
[ https://issues.apache.org/jira/browse/MAPREDUCE-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224903#comment-14224903 ] Junping Du commented on MAPREDUCE-6168: --- Thanks for comments [~kasha]! bq. I would call the issue fixed in 2.x only if 2.2 client is able to talk to 2.x server and viceversa. This doesn't seem to be the case for 2.6. Let me know if I am missing something here. If 2.2 client cannot talk with 2.4 and 2.5 in the sever side, does the meaning too much for 2.2 client to talk with 2.6 server? The problem exists in client side that exception get thrown when getting new counters from server in new version. If we want to fix it in server side, we should have some mechanism to filter the new counter in server side with detecting the version of client. So far from what I know, we don't have versioning differentiation for MR client that make the fix in server side harder. If we are adding some new version detection mechanism here after 2.6.x, then we cannot differentiate previous release versions so Counter filter in server side still cannot work. Thoughts? Old MR client is still broken when receiving new counters from MR job - Key: MAPREDUCE-6168 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6168 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Zhijie Shen Assignee: Junping Du Priority: Blocker In the following scenarios: 1. Either insecure or secure; 2. MR 2.2 with new shuffle on NM; 3. Submitting via old client. We will see the following console exception: {code} 14/11/17 14:56:19 INFO mapreduce.Job: Job job_1416264695865_0003 completed successfully java.lang.IllegalArgumentException: No enum constant org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_REDUCES at java.lang.Enum.valueOf(Enum.java:236) at org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.valueOf(FrameworkCounterGroup.java:148) at org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.findCounter(FrameworkCounterGroup.java:182) at org.apache.hadoop.mapreduce.counters.AbstractCounters.findCounter(AbstractCounters.java:154) at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:240) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:370) at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:511) at org.apache.hadoop.mapreduce.Job$7.run(Job.java:756) at org.apache.hadoop.mapreduce.Job$7.run(Job.java:753) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:753) at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1361) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1289) at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306) at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) {code} The problem is supposed to be fixed by MAPREDUCE-5831, however, it seems that we haven't cover all the problematic code path. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6168) Old MR client is still broken when receiving new counters from MR job
[ https://issues.apache.org/jira/browse/MAPREDUCE-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224974#comment-14224974 ] Zhijie Shen commented on MAPREDUCE-6168: Now I can recall the context before: MR Client 2.2 and 2.3 is not able to talk to MR 2.4+ Runtime (including the rencently released 2.6), because more counters are introduced into MR 2.4. In MAPREDUCE-5831, we finally choose to fix the client, such that in even later version, if we introduce more counters, MR Client 2.6+ is able to be compatible to them. However, for MR 2.2 and 2.3, the problem is not fixed unless we back port the fix to these two versions, or we fix the runtime not to emit the new counters to the old client. As to the latter option, it's going to be hard because we don't have the client version information in the communication protocol. I'm Okay if we resolve this issue as won't fix. Old MR client is still broken when receiving new counters from MR job - Key: MAPREDUCE-6168 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6168 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Zhijie Shen Assignee: Junping Du Priority: Blocker In the following scenarios: 1. Either insecure or secure; 2. MR 2.2 with new shuffle on NM; 3. Submitting via old client. We will see the following console exception: {code} 14/11/17 14:56:19 INFO mapreduce.Job: Job job_1416264695865_0003 completed successfully java.lang.IllegalArgumentException: No enum constant org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_REDUCES at java.lang.Enum.valueOf(Enum.java:236) at org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.valueOf(FrameworkCounterGroup.java:148) at org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.findCounter(FrameworkCounterGroup.java:182) at org.apache.hadoop.mapreduce.counters.AbstractCounters.findCounter(AbstractCounters.java:154) at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:240) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:370) at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:511) at org.apache.hadoop.mapreduce.Job$7.run(Job.java:756) at org.apache.hadoop.mapreduce.Job$7.run(Job.java:753) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:753) at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1361) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1289) at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306) at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) {code} The problem is supposed to be fixed by MAPREDUCE-5831, however, it seems that we haven't cover all the problematic code path. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-5568) JHS returns invalid string for reducer completion percentage if AM restarts with 0 reducer.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated MAPREDUCE-5568: --- Resolution: Fixed Fix Version/s: 2.7.0 Target Version/s: 2.7.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Committed to trunk and branch-2, thanks [~minjikim] ! JHS returns invalid string for reducer completion percentage if AM restarts with 0 reducer. --- Key: MAPREDUCE-5568 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5568 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 2.4.1, 2.5.1 Reporter: Jian He Assignee: MinJi Kim Fix For: 2.7.0 Attachments: 5568.patch01, 5568.patch02, 5568.patch03, 5568.patch04 JobCLient shows like: {code} 13/10/05 16:26:09 INFO mapreduce.Job: map 100% reduce NaN% 13/10/05 16:26:09 INFO mapreduce.Job: Job job_1381015536254_0001 completed successfully 13/10/05 16:26:09 INFO mapreduce.Job: Counters: 26 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=76741 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=48 HDFS: Number of bytes written=0 HDFS: Number of read operations=1 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 {code} With mapped job -status command, it shows: {code} Uber job : false Number of maps: 1 Number of reduces: 0 map() completion: 1.0 reduce() completion: NaN Job state: SUCCEEDED retired: false reason for failure: {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-5568) JHS returns invalid string for reducer completion percentage if AM restarts with 0 reducer.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225213#comment-14225213 ] Hudson commented on MAPREDUCE-5568: --- FAILURE: Integrated in Hadoop-trunk-Commit #6603 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6603/]) MAPREDUCE-5568. Fixed CompletedJob in JHS to show progress percentage correctly in case the number of mappers or reducers is zero. Contributed by MinJi Kim (jianhe: rev 78f7cdbfd6e2b9fac51c369c748ae93d12ef065a) * hadoop-mapreduce-project/CHANGES.txt * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/test/java/org/apache/hadoop/mapreduce/v2/hs/TestJobHistoryEntities.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedJob.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/test/resources/job_1416424547277_0002-1416424775281-root-TeraGen-1416424785433-2-0-SUCCEEDED-default-1416424779349.jhist JHS returns invalid string for reducer completion percentage if AM restarts with 0 reducer. --- Key: MAPREDUCE-5568 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5568 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 2.4.1, 2.5.1 Reporter: Jian He Assignee: MinJi Kim Fix For: 2.7.0 Attachments: 5568.patch01, 5568.patch02, 5568.patch03, 5568.patch04 JobCLient shows like: {code} 13/10/05 16:26:09 INFO mapreduce.Job: map 100% reduce NaN% 13/10/05 16:26:09 INFO mapreduce.Job: Job job_1381015536254_0001 completed successfully 13/10/05 16:26:09 INFO mapreduce.Job: Counters: 26 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=76741 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=48 HDFS: Number of bytes written=0 HDFS: Number of read operations=1 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 {code} With mapped job -status command, it shows: {code} Uber job : false Number of maps: 1 Number of reduces: 0 map() completion: 1.0 reduce() completion: NaN Job state: SUCCEEDED retired: false reason for failure: {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-5568) JHS returns invalid string for reducer completion percentage if AM restarts with 0 reducer.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225293#comment-14225293 ] MinJi Kim commented on MAPREDUCE-5568: -- Awesome. Thanks, @jianhe! JHS returns invalid string for reducer completion percentage if AM restarts with 0 reducer. --- Key: MAPREDUCE-5568 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5568 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 2.4.1, 2.5.1 Reporter: Jian He Assignee: MinJi Kim Fix For: 2.7.0 Attachments: 5568.patch01, 5568.patch02, 5568.patch03, 5568.patch04 JobCLient shows like: {code} 13/10/05 16:26:09 INFO mapreduce.Job: map 100% reduce NaN% 13/10/05 16:26:09 INFO mapreduce.Job: Job job_1381015536254_0001 completed successfully 13/10/05 16:26:09 INFO mapreduce.Job: Counters: 26 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=76741 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=48 HDFS: Number of bytes written=0 HDFS: Number of read operations=1 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 {code} With mapped job -status command, it shows: {code} Uber job : false Number of maps: 1 Number of reduces: 0 map() completion: 1.0 reduce() completion: NaN Job state: SUCCEEDED retired: false reason for failure: {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-5785) Derive heap size or mapreduce.*.memory.mb automatically
[ https://issues.apache.org/jira/browse/MAPREDUCE-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225462#comment-14225462 ] Jian He commented on MAPREDUCE-5785: After this patch, job somehow fails due to not able to launch task container {{Error: Could not find or load main class null}}. (might be my own setup problem) Derive heap size or mapreduce.*.memory.mb automatically --- Key: MAPREDUCE-5785 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5785 Project: Hadoop Map/Reduce Issue Type: New Feature Components: mr-am, task Reporter: Gera Shegalov Assignee: Gera Shegalov Fix For: 3.0.0 Attachments: MAPREDUCE-5785.v01.patch, MAPREDUCE-5785.v02.patch, MAPREDUCE-5785.v03.patch, mr-5785-4.patch, mr-5785-5.patch, mr-5785-6.patch Currently users have to set 2 memory-related configs per Job / per task type. One first chooses some container size map reduce.\*.memory.mb and then a corresponding maximum Java heap size Xmx map reduce.\*.memory.mb. This makes sure that the JVM's C-heap (native memory + Java heap) does not exceed this mapreduce.*.memory.mb. If one forgets to tune Xmx, MR-AM might be - allocating big containers whereas the JVM will only use the default -Xmx200m. - allocating small containers that will OOM because Xmx is too high. With this JIRA, we propose to set Xmx automatically based on an empirical ratio that can be adjusted. Xmx is not changed automatically if provided by the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-5785) Derive heap size or mapreduce.*.memory.mb automatically
[ https://issues.apache.org/jira/browse/MAPREDUCE-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225468#comment-14225468 ] Karthik Kambatla commented on MAPREDUCE-5785: - Actually, this is a bug with the patch itself. It might be best to revert it for now until we fix the issue. Reverting it. Derive heap size or mapreduce.*.memory.mb automatically --- Key: MAPREDUCE-5785 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5785 Project: Hadoop Map/Reduce Issue Type: New Feature Components: mr-am, task Reporter: Gera Shegalov Assignee: Gera Shegalov Fix For: 3.0.0 Attachments: MAPREDUCE-5785.v01.patch, MAPREDUCE-5785.v02.patch, MAPREDUCE-5785.v03.patch, mr-5785-4.patch, mr-5785-5.patch, mr-5785-6.patch Currently users have to set 2 memory-related configs per Job / per task type. One first chooses some container size map reduce.\*.memory.mb and then a corresponding maximum Java heap size Xmx map reduce.\*.memory.mb. This makes sure that the JVM's C-heap (native memory + Java heap) does not exceed this mapreduce.*.memory.mb. If one forgets to tune Xmx, MR-AM might be - allocating big containers whereas the JVM will only use the default -Xmx200m. - allocating small containers that will OOM because Xmx is too high. With this JIRA, we propose to set Xmx automatically based on an empirical ratio that can be adjusted. Xmx is not changed automatically if provided by the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-5785) Derive heap size or mapreduce.*.memory.mb automatically
[ https://issues.apache.org/jira/browse/MAPREDUCE-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225470#comment-14225470 ] Karthik Kambatla commented on MAPREDUCE-5785: - Reverted. Let me fix the bug and post another patch. Derive heap size or mapreduce.*.memory.mb automatically --- Key: MAPREDUCE-5785 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5785 Project: Hadoop Map/Reduce Issue Type: New Feature Components: mr-am, task Reporter: Gera Shegalov Assignee: Gera Shegalov Fix For: 3.0.0 Attachments: MAPREDUCE-5785.v01.patch, MAPREDUCE-5785.v02.patch, MAPREDUCE-5785.v03.patch, mr-5785-4.patch, mr-5785-5.patch, mr-5785-6.patch Currently users have to set 2 memory-related configs per Job / per task type. One first chooses some container size map reduce.\*.memory.mb and then a corresponding maximum Java heap size Xmx map reduce.\*.memory.mb. This makes sure that the JVM's C-heap (native memory + Java heap) does not exceed this mapreduce.*.memory.mb. If one forgets to tune Xmx, MR-AM might be - allocating big containers whereas the JVM will only use the default -Xmx200m. - allocating small containers that will OOM because Xmx is too high. With this JIRA, we propose to set Xmx automatically based on an empirical ratio that can be adjusted. Xmx is not changed automatically if provided by the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-5785) Derive heap size or mapreduce.*.memory.mb automatically
[ https://issues.apache.org/jira/browse/MAPREDUCE-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225487#comment-14225487 ] Hudson commented on MAPREDUCE-5785: --- FAILURE: Integrated in Hadoop-trunk-Commit #6607 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6607/]) Revert MAPREDUCE-5785. Derive heap size or mapreduce.*.memory.mb automatically. (Gera Shegalov and Karthik Kambatla via kasha) (kasha: rev a655973e781caf662b360c96e0fa3f5a873cf676) * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestMapReduceChildJVM.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/JobConf.java * hadoop-mapreduce-project/CHANGES.txt * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/MapReduceChildJVM.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java Derive heap size or mapreduce.*.memory.mb automatically --- Key: MAPREDUCE-5785 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5785 Project: Hadoop Map/Reduce Issue Type: New Feature Components: mr-am, task Reporter: Gera Shegalov Assignee: Gera Shegalov Fix For: 3.0.0 Attachments: MAPREDUCE-5785.v01.patch, MAPREDUCE-5785.v02.patch, MAPREDUCE-5785.v03.patch, mr-5785-4.patch, mr-5785-5.patch, mr-5785-6.patch Currently users have to set 2 memory-related configs per Job / per task type. One first chooses some container size map reduce.\*.memory.mb and then a corresponding maximum Java heap size Xmx map reduce.\*.memory.mb. This makes sure that the JVM's C-heap (native memory + Java heap) does not exceed this mapreduce.*.memory.mb. If one forgets to tune Xmx, MR-AM might be - allocating big containers whereas the JVM will only use the default -Xmx200m. - allocating small containers that will OOM because Xmx is too high. With this JIRA, we propose to set Xmx automatically based on an empirical ratio that can be adjusted. Xmx is not changed automatically if provided by the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6168) Old MR client is still broken when receiving new counters from MR job
[ https://issues.apache.org/jira/browse/MAPREDUCE-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225572#comment-14225572 ] Karthik Kambatla commented on MAPREDUCE-6168: - Thanks for the details, Junping and Zhijie. It is unfortunate we missed it in 2.2. I guess we can't do much at this point beyond calling it compatible for 2.6 onwards. I am fine with resolving it as Won't Fix too. Old MR client is still broken when receiving new counters from MR job - Key: MAPREDUCE-6168 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6168 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Zhijie Shen Assignee: Junping Du Priority: Blocker In the following scenarios: 1. Either insecure or secure; 2. MR 2.2 with new shuffle on NM; 3. Submitting via old client. We will see the following console exception: {code} 14/11/17 14:56:19 INFO mapreduce.Job: Job job_1416264695865_0003 completed successfully java.lang.IllegalArgumentException: No enum constant org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_REDUCES at java.lang.Enum.valueOf(Enum.java:236) at org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.valueOf(FrameworkCounterGroup.java:148) at org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.findCounter(FrameworkCounterGroup.java:182) at org.apache.hadoop.mapreduce.counters.AbstractCounters.findCounter(AbstractCounters.java:154) at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:240) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:370) at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:511) at org.apache.hadoop.mapreduce.Job$7.run(Job.java:756) at org.apache.hadoop.mapreduce.Job$7.run(Job.java:753) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:753) at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1361) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1289) at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306) at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) {code} The problem is supposed to be fixed by MAPREDUCE-5831, however, it seems that we haven't cover all the problematic code path. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-5932) Provide an option to use a dedicated reduce-side shuffle log
[ https://issues.apache.org/jira/browse/MAPREDUCE-5932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated MAPREDUCE-5932: - Attachment: MAPREDUCE-5932.v05.patch bq. There's a lot of ways to do this, including tacking on yet another boolean (which isn't great for readability given there's now one for the AppMaster), passing an enum that can differentiate AM/map/task (not sure one exists to reuse), pass the Task/TaskId object and null means AM, etc. Hi [~jlowe], I opted for passing {{task}} as the easiest option. Provide an option to use a dedicated reduce-side shuffle log Key: MAPREDUCE-5932 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5932 Project: Hadoop Map/Reduce Issue Type: Improvement Components: mrv2 Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: MAPREDUCE-5932.v01.patch, MAPREDUCE-5932.v02.patch, MAPREDUCE-5932.v03.patch, MAPREDUCE-5932.v04.patch, MAPREDUCE-5932.v05.patch For reducers in large jobs our users cannot easily spot portions of the log associated with problems with their code. An example reducer with INFO-level logging generates ~3500 lines / ~700KiB lines per second. 95% of the log is the client-side of the shuffle {{org.apache.hadoop.mapreduce.task.reduce.*}} {code} $ wc syslog 3642 48192 691013 syslog $ grep task.reduce syslog | wc 3424 46534 659038 $ grep task.reduce.ShuffleScheduler syslog | wc 1521 17745 251458 $ grep task.reduce.Fetcher syslog | wc 1045 15340 223683 $ grep task.reduce.InMemoryMapOutput syslog | wc 4004800 72060 $ grep task.reduce.MergeManagerImpl syslog | wc 4328200 106555 {code} Byte percentage breakdown: {code} Shuffle total: 95% ShuffleScheduler:36% Fetcher: 32% InMemoryMapOutput: 10% MergeManagerImpl:15% {code} While this is information is actually often useful for devops debugging shuffle performance issues, the job users are often lost. We propose to have a dedicated syslog.shuffle file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6171) The visibilities of the distributed cache files and archives should be determined by both their permissions and if they are located in HDFS encryption zone
[ https://issues.apache.org/jira/browse/MAPREDUCE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225672#comment-14225672 ] Arun Suresh commented on MAPREDUCE-6171: [~dian.fu], Any reason why _yarn_ user is blacklisted from _DECRYPT_EEK_ calls ? My understanding was that only the HDFS admin ie. the _hdfs_ user only needs to be blacklisted The visibilities of the distributed cache files and archives should be determined by both their permissions and if they are located in HDFS encryption zone --- Key: MAPREDUCE-6171 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6171 Project: Hadoop Map/Reduce Issue Type: Bug Components: security Reporter: Dian Fu The visibilities of the distributed cache files and archives are currently determined by the permission of these files or archives. The following is the logic of method isPublic() in class ClientDistributedCacheManager: {code} static boolean isPublic(Configuration conf, URI uri, MapURI, FileStatus statCache) throws IOException { FileSystem fs = FileSystem.get(uri, conf); Path current = new Path(uri.getPath()); //the leaf level file should be readable by others if (!checkPermissionOfOther(fs, current, FsAction.READ, statCache)) { return false; } return ancestorsHaveExecutePermissions(fs, current.getParent(), statCache); } {code} At NodeManager side, it will use yarn user to download public files and use the user who submits the job to download private files. In normal cases, there is no problem with this. However, if the files are located in an encryption zone(HDFS-6134) and yarn user are configured to be disallowed to fetch the DataEncryptionKey(DEK) of this encryption zone by KMS, the download process of this file will fail. You can reproduce this issue with the following steps (assume you submit job with user testUser): # create a clean cluster which has HDFS cryptographic FileSystem feature # create directory /data/ in HDFS and make it as an encryption zone with keyName testKey # configure KMS to only allow user testUser can decrypt DEK of key testKey in KMS {code} property namekey.acl.testKey.DECRYPT_EEK/name valuetestUser/value /property {code} # execute job teragen with user testUser: {code} su -s /bin/bash testUser -c hadoop jar hadoop-mapreduce-examples*.jar teragen 1 /data/terasort-input {code} # execute job terasort with user testUser: {code} su -s /bin/bash testUser -c hadoop jar hadoop-mapreduce-examples*.jar terasort /data/terasort-input /data/terasort-output {code} You will see logs like this at the job submitter's console: {code} INFO mapreduce.Job: Job job_1416860917658_0002 failed with state FAILED due to: Application application_1416860917658_0002 failed 2 times due to AM Container for appattempt_1416860917658_0002_02 exited with exitCode: -1000 due to: org.apache.hadoop.security.authorize.AuthorizationException: User [yarn] is not authorized to perform [DECRYPT_EEK] on key with ACL name [testKey]!! {code} The initial idea to solve this issue is to modify the logic in ClientDistributedCacheManager.isPublic to consider also whether this file is in an encryption zone. If it is in an encryption zone, this file should be considered as private. Then at NodeManager side, it will use user who submits the job to fetch the file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-5932) Provide an option to use a dedicated reduce-side shuffle log
[ https://issues.apache.org/jira/browse/MAPREDUCE-5932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225714#comment-14225714 ] Hadoop QA commented on MAPREDUCE-5932: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12683732/MAPREDUCE-5932.v05.patch against trunk revision a655973. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.mapreduce.v2.TestMROldApiJobs org.apache.hadoop.mapred.TestJobCleanup The test build failed in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5051//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5051//console This message is automatically generated. Provide an option to use a dedicated reduce-side shuffle log Key: MAPREDUCE-5932 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5932 Project: Hadoop Map/Reduce Issue Type: Improvement Components: mrv2 Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: MAPREDUCE-5932.v01.patch, MAPREDUCE-5932.v02.patch, MAPREDUCE-5932.v03.patch, MAPREDUCE-5932.v04.patch, MAPREDUCE-5932.v05.patch For reducers in large jobs our users cannot easily spot portions of the log associated with problems with their code. An example reducer with INFO-level logging generates ~3500 lines / ~700KiB lines per second. 95% of the log is the client-side of the shuffle {{org.apache.hadoop.mapreduce.task.reduce.*}} {code} $ wc syslog 3642 48192 691013 syslog $ grep task.reduce syslog | wc 3424 46534 659038 $ grep task.reduce.ShuffleScheduler syslog | wc 1521 17745 251458 $ grep task.reduce.Fetcher syslog | wc 1045 15340 223683 $ grep task.reduce.InMemoryMapOutput syslog | wc 4004800 72060 $ grep task.reduce.MergeManagerImpl syslog | wc 4328200 106555 {code} Byte percentage breakdown: {code} Shuffle total: 95% ShuffleScheduler:36% Fetcher: 32% InMemoryMapOutput: 10% MergeManagerImpl:15% {code} While this is information is actually often useful for devops debugging shuffle performance issues, the job users are often lost. We propose to have a dedicated syslog.shuffle file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6171) The visibilities of the distributed cache files and archives should be determined by both their permissions and if they are located in HDFS encryption zone
[ https://issues.apache.org/jira/browse/MAPREDUCE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225726#comment-14225726 ] Dian Fu commented on MAPREDUCE-6171: Hi [~asuresh], key based ACL in KMS is currently implemented as whitelist. So if I configure as follows in kms-acl.xml, {code} property namekey.acl.testKey.DECRYPT_EEK/name valuetestUser/value /property {code}, then only {{testUser}} user can do {{DECRYPT_EEK}} call on key {{testKey}}. If I want {{yarn}} user can also do {{DECRYPT_EEK}} call on {{testKey}} key, I need add {{yarn}} user to the above configuration value manually. This means that if I want to configure key based ACL({{DECRYPT_EEK}}) for {{some key}}, I need also add {{yarn}} user to configuration {{DECRYPT_EEK}} for that key. As I don't know if {{yarn}} user will later need to do {{DECRYPT_EEK}} for this key, such as the described example in the description of this JIRA. This is inconvenient and tricky. On the other hand, fetch the files under an encryption zone with the user who submits the job is more straight forward. The visibilities of the distributed cache files and archives should be determined by both their permissions and if they are located in HDFS encryption zone --- Key: MAPREDUCE-6171 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6171 Project: Hadoop Map/Reduce Issue Type: Bug Components: security Reporter: Dian Fu The visibilities of the distributed cache files and archives are currently determined by the permission of these files or archives. The following is the logic of method isPublic() in class ClientDistributedCacheManager: {code} static boolean isPublic(Configuration conf, URI uri, MapURI, FileStatus statCache) throws IOException { FileSystem fs = FileSystem.get(uri, conf); Path current = new Path(uri.getPath()); //the leaf level file should be readable by others if (!checkPermissionOfOther(fs, current, FsAction.READ, statCache)) { return false; } return ancestorsHaveExecutePermissions(fs, current.getParent(), statCache); } {code} At NodeManager side, it will use yarn user to download public files and use the user who submits the job to download private files. In normal cases, there is no problem with this. However, if the files are located in an encryption zone(HDFS-6134) and yarn user are configured to be disallowed to fetch the DataEncryptionKey(DEK) of this encryption zone by KMS, the download process of this file will fail. You can reproduce this issue with the following steps (assume you submit job with user testUser): # create a clean cluster which has HDFS cryptographic FileSystem feature # create directory /data/ in HDFS and make it as an encryption zone with keyName testKey # configure KMS to only allow user testUser can decrypt DEK of key testKey in KMS {code} property namekey.acl.testKey.DECRYPT_EEK/name valuetestUser/value /property {code} # execute job teragen with user testUser: {code} su -s /bin/bash testUser -c hadoop jar hadoop-mapreduce-examples*.jar teragen 1 /data/terasort-input {code} # execute job terasort with user testUser: {code} su -s /bin/bash testUser -c hadoop jar hadoop-mapreduce-examples*.jar terasort /data/terasort-input /data/terasort-output {code} You will see logs like this at the job submitter's console: {code} INFO mapreduce.Job: Job job_1416860917658_0002 failed with state FAILED due to: Application application_1416860917658_0002 failed 2 times due to AM Container for appattempt_1416860917658_0002_02 exited with exitCode: -1000 due to: org.apache.hadoop.security.authorize.AuthorizationException: User [yarn] is not authorized to perform [DECRYPT_EEK] on key with ACL name [testKey]!! {code} The initial idea to solve this issue is to modify the logic in ClientDistributedCacheManager.isPublic to consider also whether this file is in an encryption zone. If it is in an encryption zone, this file should be considered as private. Then at NodeManager side, it will use user who submits the job to fetch the file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225864#comment-14225864 ] Gera Shegalov commented on MAPREDUCE-6166: -- Sounds good, [~eepayne]. I have a few comments then. Some are in light of a follow-up JIRA. bq. update this patch with the final keyword on JobConf jobConf Let us make the instance variable type a more general {{Configuration}} as we are not doing anything specific to {{JobConf}}. Instead of introducing a new local variable iFin in {{OnDiskMapOutput#shuffle}}, we can overwrite it as in {{InMemoryMapOutput#shuffle}}. We can either capture the shuffle size in an instance variable as {{InMemoryMapOutput}} does implicitly via {{memory.length}}. Or we can set {code} long bytesLeft = compressedLength - ((IFileInputStream)input).getSize() {code} Then we don't need to touch the line {{input.read}} to do {{readWithChecksum}} Good call adding {{finally}} with {{close}}. I also have some comments for the test: {{ios.finish()}} should be removed because it's redundant: {{IFileOutputStream#close()}} will call it as well. We don't need PrintStream wrapping and we need to be careful not to leak file descriptors in case I/O fails. {code} new PrintStream(fout).print(bout.toString()); fout.close(); {code} Should be something like: {code} try { fout.write(bout.toByteArray()); } finally { fout.close(); } {code} Similarly we need to make sure that {{fin.close()}} is in a try-finally block enclosing header and shuffle read. Let us not do {code} catch(Exception e) { fail(OnDiskMapOutput.shuffle did not process the map partition file); {code} It's redundant because the exception is failing the test already. Same PrintStream and fout.close remarks for the code creating the corrupted file {{dataSize/2}}: I believe Sun Java Coding Style require spaces around arithmetic operations. In the fragment where we expect the checksum to fail, {{fin.close()}} should be in some finally. {{catch(Exception e)}} is too broad. Let us be more specific and maybe even log it: {code} } catch(ChecksumException e) { LOG.info(Expected checksum exception thrown., e); } {code} Thinking a bit more about the file.out, it does not seem to be cleaned up after the test has finished. But we probably don't even need to create files, we can simply use {{new ByteArrayInputStream(bout.toByteArray())}} and {{new ByteArrayInputStream(corrupted)}} as input. Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk --- Key: MAPREDUCE-6166 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 2.6.0 Reporter: Eric Payne Assignee: Eric Payne Attachments: MAPREDUCE-6166.v1.201411221941.txt, MAPREDUCE-6166.v2.201411251627.txt In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate map partition output gets corrupted on disk on the map side. If this corrupted map output is too large to shuffle in memory, the reducer streams it to disk without validating the checksum. In jobs this large, it could take hours before the reducer finally tries to read the corrupted file and fails. Since retries of the failed reduce attempt will also take hours, this delay in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-5932) Provide an option to use a dedicated reduce-side shuffle log
[ https://issues.apache.org/jira/browse/MAPREDUCE-5932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225873#comment-14225873 ] Gera Shegalov commented on MAPREDUCE-5932: -- I believe failures are unrelated. I reran the 2 tests on my laptop without any issues. [~jlowe], can you take a look at this version please? Provide an option to use a dedicated reduce-side shuffle log Key: MAPREDUCE-5932 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5932 Project: Hadoop Map/Reduce Issue Type: Improvement Components: mrv2 Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: MAPREDUCE-5932.v01.patch, MAPREDUCE-5932.v02.patch, MAPREDUCE-5932.v03.patch, MAPREDUCE-5932.v04.patch, MAPREDUCE-5932.v05.patch For reducers in large jobs our users cannot easily spot portions of the log associated with problems with their code. An example reducer with INFO-level logging generates ~3500 lines / ~700KiB lines per second. 95% of the log is the client-side of the shuffle {{org.apache.hadoop.mapreduce.task.reduce.*}} {code} $ wc syslog 3642 48192 691013 syslog $ grep task.reduce syslog | wc 3424 46534 659038 $ grep task.reduce.ShuffleScheduler syslog | wc 1521 17745 251458 $ grep task.reduce.Fetcher syslog | wc 1045 15340 223683 $ grep task.reduce.InMemoryMapOutput syslog | wc 4004800 72060 $ grep task.reduce.MergeManagerImpl syslog | wc 4328200 106555 {code} Byte percentage breakdown: {code} Shuffle total: 95% ShuffleScheduler:36% Fetcher: 32% InMemoryMapOutput: 10% MergeManagerImpl:15% {code} While this is information is actually often useful for devops debugging shuffle performance issues, the job users are often lost. We propose to have a dedicated syslog.shuffle file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)