[jira] [Commented] (YARN-7747) YARN UI is broken in the minicluster
[ https://issues.apache.org/jira/browse/YARN-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17729468#comment-17729468 ] Gera Shegalov commented on YARN-7747: - Sorry I dropped the ball with this JIRA. I have no bandwidth to work on it. I unassigned it from myself if someone can pick it up. > YARN UI is broken in the minicluster > - > > Key: YARN-7747 > URL: https://issues.apache.org/jira/browse/YARN-7747 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Gera Shegalov >Priority: Major > Attachments: YARN-7747.001.patch, YARN-7747.002.patch > > > YARN web apps use non-injected instances of GuiceFilter, i.e. instances > created by Jetty as opposed by Guice itself. This triggers the [call > path|https://github.com/google/guice/blob/master/extensions/servlet/src/com/google/inject/servlet/GuiceFilter.java#L251] > where the static field {{pipeline}} is used instead of the instance field > {{injectedPipeline}}. However, besides GuiceFilter instances created by > Jetty, each Guice module generates them as well. On the injection call path > this static variable is updated by each instance. Thus if there are multiple > modules as it happens to be the case in the minicluster the one loaded last > ends up defining the filter pipeline for all Jetty instances. In the > minicluster case this is the nodemanager UI > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7747) YARN UI is broken in the minicluster
[ https://issues.apache.org/jira/browse/YARN-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov reassigned YARN-7747: --- Assignee: (was: Gera Shegalov) > YARN UI is broken in the minicluster > - > > Key: YARN-7747 > URL: https://issues.apache.org/jira/browse/YARN-7747 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Gera Shegalov >Priority: Major > Attachments: YARN-7747.001.patch, YARN-7747.002.patch > > > YARN web apps use non-injected instances of GuiceFilter, i.e. instances > created by Jetty as opposed by Guice itself. This triggers the [call > path|https://github.com/google/guice/blob/master/extensions/servlet/src/com/google/inject/servlet/GuiceFilter.java#L251] > where the static field {{pipeline}} is used instead of the instance field > {{injectedPipeline}}. However, besides GuiceFilter instances created by > Jetty, each Guice module generates them as well. On the injection call path > this static variable is updated by each instance. Thus if there are multiple > modules as it happens to be the case in the minicluster the one loaded last > ends up defining the filter pipeline for all Jetty instances. In the > minicluster case this is the nodemanager UI > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11055) In cgroups-operations.c some fprintf format strings don't end with "\n"
[ https://issues.apache.org/jira/browse/YARN-11055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov reassigned YARN-11055: Assignee: Gera Shegalov > In cgroups-operations.c some fprintf format strings don't end with "\n" > > > Key: YARN-11055 > URL: https://issues.apache.org/jira/browse/YARN-11055 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.3.1 >Reporter: Gera Shegalov >Assignee: Gera Shegalov >Priority: Minor > Labels: cgroups, easyfix > > In cgroup-operations.c some {{{}fprintf{}}}s are missing a newline char at > the end leading to a hard-to-parse error message output > example: > https://github.com/apache/hadoop/blame/b225287913ac366a531eacfa0266adbdf03d883e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/cgroups/cgroups-operations.c#L130 > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11055) In cgroups-operations.c some fprintf format strings don't end with "\n"
[ https://issues.apache.org/jira/browse/YARN-11055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-11055: - Summary: In cgroups-operations.c some fprintf format strings don't end with "\n" (was: In cgroups-operations.c some fprintf format strings lack "\n" ) > In cgroups-operations.c some fprintf format strings don't end with "\n" > > > Key: YARN-11055 > URL: https://issues.apache.org/jira/browse/YARN-11055 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.3.1 >Reporter: Gera Shegalov >Priority: Minor > Labels: cgroups, easyfix > > In cgroup-operations.c some {{{}fprintf{}}}s are missing a newline char at > the end leading to a hard-to-parse error message output > example: > https://github.com/apache/hadoop/blame/b225287913ac366a531eacfa0266adbdf03d883e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/cgroups/cgroups-operations.c#L130 > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11055) In cgroups-operations.c some fprintf format strings lack "\n"
[ https://issues.apache.org/jira/browse/YARN-11055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-11055: - Summary: In cgroups-operations.c some fprintf format strings lack "\n" (was: cgroups-operations.c some fprintf format strings lack "\n" ) > In cgroups-operations.c some fprintf format strings lack "\n" > -- > > Key: YARN-11055 > URL: https://issues.apache.org/jira/browse/YARN-11055 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.3.1 >Reporter: Gera Shegalov >Priority: Minor > Labels: cgroups, easyfix > > In cgroup-operations.c some {{{}fprintf{}}}s are missing a newline char at > the end leading to a hard-to-parse error message output > example: > https://github.com/apache/hadoop/blame/b225287913ac366a531eacfa0266adbdf03d883e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/cgroups/cgroups-operations.c#L130 > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11055) cgroups-operations.c some fprintf format strings lack "\n"
[ https://issues.apache.org/jira/browse/YARN-11055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-11055: - Priority: Minor (was: Major) > cgroups-operations.c some fprintf format strings lack "\n" > --- > > Key: YARN-11055 > URL: https://issues.apache.org/jira/browse/YARN-11055 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.3.1 >Reporter: Gera Shegalov >Priority: Minor > Labels: cgroups, easyfix > > In cgroup-operations.c some {{{}fprintf{}}}s are missing a newline char at > the end leading to a hard-to-parse error message output > example: > https://github.com/apache/hadoop/blame/b225287913ac366a531eacfa0266adbdf03d883e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/cgroups/cgroups-operations.c#L130 > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11056) Incorrect capitalization of NVIDIA in the docs
Gera Shegalov created YARN-11056: Summary: Incorrect capitalization of NVIDIA in the docs Key: YARN-11056 URL: https://issues.apache.org/jira/browse/YARN-11056 Project: Hadoop YARN Issue Type: Bug Reporter: Gera Shegalov According to [https://www.nvidia.com/en-us/about-nvidia/legal-info/] the spelling should be all-caps NVIDIA Examples of differing capitalization https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/UsingGpus.md -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11055) cgroups-operations.c some fprintf format strings lack "\n"
Gera Shegalov created YARN-11055: Summary: cgroups-operations.c some fprintf format strings lack "\n" Key: YARN-11055 URL: https://issues.apache.org/jira/browse/YARN-11055 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.3.1, 3.3.0, 3.2.0, 3.1.0, 3.0.0 Reporter: Gera Shegalov In cgroup-operations.c some {{{}fprintf{}}}s are missing a newline char at the end leading to a hard-to-parse error message output example: https://github.com/apache/hadoop/blame/b225287913ac366a531eacfa0266adbdf03d883e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/cgroups/cgroups-operations.c#L130 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1529) Add Localization overhead metrics to NM
[ https://issues.apache.org/jira/browse/YARN-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169137#comment-17169137 ] Gera Shegalov commented on YARN-1529: - I am glad this is still useful. Thanks for committing, [~Jim_Brennan] [~epayne]! > Add Localization overhead metrics to NM > --- > > Key: YARN-1529 > URL: https://issues.apache.org/jira/browse/YARN-1529 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Gera Shegalov >Assignee: Jim Brennan >Priority: Major > Fix For: 3.2.2, 2.10.1, 3.4.0, 3.3.1, 3.1.5 > > Attachments: YARN-1529-branch-2.10.001.patch, YARN-1529.005.patch, > YARN-1529.006.patch, YARN-1529.v01.patch, YARN-1529.v02.patch, > YARN-1529.v03.patch, YARN-1529.v04.patch > > > Users are often unaware of localization cost that their jobs incur. To > measure effectiveness of localization caches it is necessary to expose the > overhead in the form of metrics. > We propose addition of the following metrics to NodeManagerMetrics. > When a container is about to launch, its set of LocalResources has to be > fetched from a central location, typically on HDFS, that results in a number > of download requests for the files missing in caches. > LocalizedFilesMissed: total files (requests) downloaded from DFS. Cache > misses. > LocalizedFilesCached: total localization requests that were served from local > caches. Cache hits. > LocalizedBytesMissed: total bytes downloaded from DFS due to cache misses. > LocalizedBytesCached: total bytes satisfied from local caches. > Localized(Files|Bytes)CachedRatio: percentage of localized (files|bytes) that > were served out of cache: ratio = 100 * caches / (caches + misses) > LocalizationDownloadNanos: total elapsed time in nanoseconds for a container > to go from ResourceRequestTransition to LocalizedTransition -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7747) YARN UI is broken in the minicluster
[ https://issues.apache.org/jira/browse/YARN-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637609#comment-16637609 ] Gera Shegalov commented on YARN-7747: - [~ste...@apache.org] we definitely need tests to prevent this kind of regression in the future. We could make sure that all web/http address keys are properly reflected in MiniYARNCLuster#getConfig implementation and then probe all of them through easy-to-validate REST api. RM URI should respond to the RM-specific REST, and so on and so forth > YARN UI is broken in the minicluster > - > > Key: YARN-7747 > URL: https://issues.apache.org/jira/browse/YARN-7747 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Gera Shegalov >Assignee: Gera Shegalov >Priority: Major > Attachments: YARN-7747.001.patch, YARN-7747.002.patch > > > YARN web apps use non-injected instances of GuiceFilter, i.e. instances > created by Jetty as opposed by Guice itself. This triggers the [call > path|https://github.com/google/guice/blob/master/extensions/servlet/src/com/google/inject/servlet/GuiceFilter.java#L251] > where the static field {{pipeline}} is used instead of the instance field > {{injectedPipeline}}. However, besides GuiceFilter instances created by > Jetty, each Guice module generates them as well. On the injection call path > this static variable is updated by each instance. Thus if there are multiple > modules as it happens to be the case in the minicluster the one loaded last > ends up defining the filter pipeline for all Jetty instances. In the > minicluster case this is the nodemanager UI > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7747) YARN UI is broken in the minicluster
[ https://issues.apache.org/jira/browse/YARN-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-7747: Attachment: YARN-7747.002.patch > YARN UI is broken in the minicluster > - > > Key: YARN-7747 > URL: https://issues.apache.org/jira/browse/YARN-7747 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Gera Shegalov >Assignee: Gera Shegalov >Priority: Major > Attachments: YARN-7747.001.patch, YARN-7747.002.patch > > > YARN web apps use non-injected instances of GuiceFilter, i.e. instances > created by Jetty as opposed by Guice itself. This triggers the [call > path|https://github.com/google/guice/blob/master/extensions/servlet/src/com/google/inject/servlet/GuiceFilter.java#L251] > where the static field {{pipeline}} is used instead of the instance field > {{injectedPipeline}}. However, besides GuiceFilter instances created by > Jetty, each Guice module generates them as well. On the injection call path > this static variable is updated by each instance. Thus if there are multiple > modules as it happens to be the case in the minicluster the one loaded last > ends up defining the filter pipeline for all Jetty instances. In the > minicluster case this is the nodemanager UI > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7847) Provide permalinks for container logs
Gera Shegalov created YARN-7847: --- Summary: Provide permalinks for container logs Key: YARN-7847 URL: https://issues.apache.org/jira/browse/YARN-7847 Project: Hadoop YARN Issue Type: New Feature Components: amrmproxy Reporter: Gera Shegalov YARN doesn't offer a service similar to AM proxy URL for container logs even if log-aggregation is enabled. The current mechanism of having the NM redirect to yarn.log.server.url fails once the node is down. Workarounds like in MR JobHistory to rewrite URI's on the fly are possible, but do not represent a good long term solution to onboard new apps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7747) YARN UI is broken in the minicluster
[ https://issues.apache.org/jira/browse/YARN-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326684#comment-16326684 ] Gera Shegalov commented on YARN-7747: - TestContainerLogsPage failure is tracked in YARN-7734. asflicense -1 is not caused by this patch. I can write a test for test4tests if the approach is accepted. > YARN UI is broken in the minicluster > - > > Key: YARN-7747 > URL: https://issues.apache.org/jira/browse/YARN-7747 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Gera Shegalov >Assignee: Gera Shegalov >Priority: Major > Attachments: YARN-7747.001.patch > > > YARN web apps use non-injected instances of GuiceFilter, i.e. instances > created by Jetty as opposed by Guice itself. This triggers the [call > path|https://github.com/google/guice/blob/master/extensions/servlet/src/com/google/inject/servlet/GuiceFilter.java#L251] > where the static field {{pipeline}} is used instead of the instance field > {{injectedPipeline}}. However, besides GuiceFilter instances created by > Jetty, each Guice module generates them as well. On the injection call path > this static variable is updated by each instance. Thus if there are multiple > modules as it happens to be the case in the minicluster the one loaded last > ends up defining the filter pipeline for all Jetty instances. In the > minicluster case this is the nodemanager UI > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7747) YARN UI is broken in the minicluster
[ https://issues.apache.org/jira/browse/YARN-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-7747: Attachment: YARN-7747.001.patch 001 patch proposal > YARN UI is broken in the minicluster > - > > Key: YARN-7747 > URL: https://issues.apache.org/jira/browse/YARN-7747 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Gera Shegalov >Assignee: Gera Shegalov > Attachments: YARN-7747.001.patch > > > YARN web apps use non-injected instances of GuiceFilter, i.e. instances > created by Jetty as opposed by Guice itself. This triggers the [call > path|https://github.com/google/guice/blob/master/extensions/servlet/src/com/google/inject/servlet/GuiceFilter.java#L251] > where the static field {{pipeline}} is used instead of the instance field > {{injectedPipeline}}. However, besides GuiceFilter instances created by > Jetty, each Guice module generates them as well. On the injection call path > this static variable is updated by each instance. Thus if there are multiple > modules as it happens to be the case in the minicluster the one loaded last > ends up defining the filter pipeline for all Jetty instances. In the > minicluster case this is the nodemanager UI > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7747) YARN UI is broken in the minicluster
Gera Shegalov created YARN-7747: --- Summary: YARN UI is broken in the minicluster Key: YARN-7747 URL: https://issues.apache.org/jira/browse/YARN-7747 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0 Reporter: Gera Shegalov Assignee: Gera Shegalov YARN web apps use non-injected instances of GuiceFilter, i.e. instances created by Jetty as opposed by Guice itself. This triggers the [call path|https://github.com/google/guice/blob/master/extensions/servlet/src/com/google/inject/servlet/GuiceFilter.java#L251] where the static field {{pipeline}} is used instead of the instance field {{injectedPipeline}}. However, besides GuiceFilter instances created by Jetty, each Guice module generates them as well. On the injection call path this static variable is updated by each instance. Thus if there are multiple modules as it happens to be the case in the minicluster the one loaded last ends up defining the filter pipeline for all Jetty instances. In the minicluster case this is the nodemanager UI -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml
Gera Shegalov created YARN-7592: --- Summary: yarn.federation.failover.enabled missing in yarn-default.xml Key: YARN-7592 URL: https://issues.apache.org/jira/browse/YARN-7592 Project: Hadoop YARN Issue Type: Bug Components: federation Affects Versions: 3.0.0-beta1 Reporter: Gera Shegalov yarn.federation.failover.enabled should be documented in yarn-default.xml. I am also not sure why it should be true by default and force the HA retry policy in {{RMProxy#createRMProxy}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-1728) Workaround guice3x-undecoded pathInfo in YARN WebApp
[ https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-1728: Summary: Workaround guice3x-undecoded pathInfo in YARN WebApp (was: History server doesn't understand percent encoded paths) > Workaround guice3x-undecoded pathInfo in YARN WebApp > > > Key: YARN-1728 > URL: https://issues.apache.org/jira/browse/YARN-1728 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Abraham Elmahrek >Assignee: Yuanbo Liu > Fix For: 2.8.0, 2.7.4, 3.0.0-alpha3 > > Attachments: test-case-for-trunk.patch, YARN-1728-branch-2.001.patch, > YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch, > YARN-1728-branch-2.004.patch, YARN-1728-branch-2.005.patch > > > For example, going to the job history server page > http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > results in the following error: > {code} > Cannot get container logs. Invalid nodeId: > test-cdh5-hue.ent.cloudera.com%3A8041 > {code} > Where the url decoded version works: > http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > It seems like both should be supported as the former is simply percent > encoding. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1728) History server doesn't understand percent encoded paths
[ https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888529#comment-15888529 ] Gera Shegalov commented on YARN-1728: - +1m committing > History server doesn't understand percent encoded paths > --- > > Key: YARN-1728 > URL: https://issues.apache.org/jira/browse/YARN-1728 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Abraham Elmahrek >Assignee: Yuanbo Liu > Attachments: test-case-for-trunk.patch, YARN-1728-branch-2.001.patch, > YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch, > YARN-1728-branch-2.004.patch, YARN-1728-branch-2.005.patch > > > For example, going to the job history server page > http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > results in the following error: > {code} > Cannot get container logs. Invalid nodeId: > test-cdh5-hue.ent.cloudera.com%3A8041 > {code} > Where the url decoded version works: > http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > It seems like both should be supported as the former is simply percent > encoding. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1728) History server doesn't understand percent encoded paths
[ https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15886729#comment-15886729 ] Gera Shegalov commented on YARN-1728: - Minor thing, since we have this catch clause, can we add the pathInfo value and stack trace to the log message: {code} } catch (URISyntaxException ex) { LOG.error(pathInfo + ": Failed to decode path.", ex); } {code} > History server doesn't understand percent encoded paths > --- > > Key: YARN-1728 > URL: https://issues.apache.org/jira/browse/YARN-1728 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Abraham Elmahrek >Assignee: Yuanbo Liu > Attachments: YARN-1728-branch-2.001.patch, > YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch, > YARN-1728-branch-2.004.patch > > > For example, going to the job history server page > http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > results in the following error: > {code} > Cannot get container logs. Invalid nodeId: > test-cdh5-hue.ent.cloudera.com%3A8041 > {code} > Where the url decoded version works: > http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > It seems like both should be supported as the former is simply percent > encoding. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1728) History server doesn't understand percent encoded paths
[ https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885826#comment-15885826 ] Gera Shegalov commented on YARN-1728: - [~yuanbo], thanks for the latest patch. I suggested URI.create because we are guaranteed to get a valid pathInfo from the servlet container but it's indeed good to be defensive since we are already dealing with a servlet bug. I am generally +1 In trunk the issue is fixed thanks to guice 4.0/HADOOP-12064 cc: [~ozawa]. And as the quote from the spec says we must not decode twice. Therefore I suggest we split this patch. The test-only patch should go into both trunk and branch-2 such that we catch the issue in all releases. The actual fix should go in branch-2. > History server doesn't understand percent encoded paths > --- > > Key: YARN-1728 > URL: https://issues.apache.org/jira/browse/YARN-1728 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Abraham Elmahrek >Assignee: Yuanbo Liu > Attachments: YARN-1728-branch-2.001.patch, > YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch, > YARN-1728-branch-2.004.patch > > > For example, going to the job history server page > http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > results in the following error: > {code} > Cannot get container logs. Invalid nodeId: > test-cdh5-hue.ent.cloudera.com%3A8041 > {code} > Where the url decoded version works: > http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > It seems like both should be supported as the former is simply percent > encoding. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1728) History server doesn't understand percent encoded paths
[ https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883241#comment-15883241 ] Gera Shegalov commented on YARN-1728: - Hi [~yuanbo], thanks for addressing the issue. I see that Guice itself [fixed it|https://github.com/google/guice/pull/860/files] using {{java.net.URI#getPath}}. Let us use it here so the behavior is consistent with newer Guice. I suggest we use: {code} decodedPathInfo = URI.create(pathInfo).getPath(); {code} > History server doesn't understand percent encoded paths > --- > > Key: YARN-1728 > URL: https://issues.apache.org/jira/browse/YARN-1728 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Abraham Elmahrek >Assignee: Yuanbo Liu > Attachments: YARN-1728-branch-2.001.patch, > YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch > > > For example, going to the job history server page > http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > results in the following error: > {code} > Cannot get container logs. Invalid nodeId: > test-cdh5-hue.ent.cloudera.com%3A8041 > {code} > Where the url decoded version works: > http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > It seems like both should be supported as the former is simply percent > encoding. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4958) The file localization process should allow for wildcards to reduce the application footprint in the state store
[ https://issues.apache.org/jira/browse/YARN-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302955#comment-15302955 ] Gera Shegalov commented on YARN-4958: - Hi [~templedf], no particular comment other than there is that workaround that can achieve it with the what i was suggesting in HADOOP-12747, or programmatically, but it would be nice if it can be done in a more obvious way. > The file localization process should allow for wildcards to reduce the > application footprint in the state store > --- > > Key: YARN-4958 > URL: https://issues.apache.org/jira/browse/YARN-4958 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.8.0 >Reporter: Daniel Templeton >Assignee: Daniel Templeton >Priority: Critical > Attachments: YARN-4958.001.patch, YARN-4958.002.patch, > YARN-4958.003.patch > > > When using the -libjars option to add classes to the classpath, every library > so added is explicitly listed in the {{ContainerLaunchContext}}'s local > resources even though they're all uploaded to the same directory in HDFS. > When using tools like Crunch without an uber JAR or when trying to take > advantage of the shared cache, the number of libraries can be quite large. > We've seen many cases where we had to turn down the max number of > applications to prevent ZK from running out of heap because of the size of > the state store entries. > Rather than listing all files independently, this JIRA proposes to have the > NM allow wildcards in the resource localization paths. Specifically, we > propose to allow a path to have a final component (name) set to "*", which is > interpreted by the NM as "download the full directory and link to every file > in it from the job's working directory." This behavior is the same as the > current behavior when using -libjars, but avoids explicitly listing every > file. > This JIRA does not attempt to provide more general purpose wildcards, such as > "\*.jar" or "file\*", as having multiple entries for a single directory > presents numerous logistical issues. > This JIRA also does not attempt to integrate with the shared cache. That > work will be left to a future JIRA. Specifically, this JIRA only applies > when a full directory is uploaded. Currently the shared cache does not > handle directory uploads. > This JIRA proposes to allow for wildcards both in the internal processing of > the -libjars switch and in paths added through the {{Job}} and > {{DistributedCache}} classes. > The proposed approach is to treat a path, "dir/\*", as "dir" for purposes of > all file verification and localization. In the final step, the NM will query > the localized directory to get a list of the files in "dir" such that each > can be linked from the job's working directory. Since $PWD/\* is always > included on the classpath, all JAR files in "dir" will be in the classpath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4789) Provide helpful exception for non-PATH-like conflict with admin.user.env
[ https://issues.apache.org/jira/browse/YARN-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15191386#comment-15191386 ] Gera Shegalov commented on YARN-4789: - Thanks Jason, this is indeed the case. The question is whether we can do a small change fast before a more involved MAPREDUCE-6491 is committed? > Provide helpful exception for non-PATH-like conflict with admin.user.env > > > Key: YARN-4789 > URL: https://issues.apache.org/jira/browse/YARN-4789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Gera Shegalov >Assignee: Gera Shegalov > Attachments: YARN-4789.001.patch > > > Environment variables specified in mapreduce.admin.user.env are supposed to > be paths (class, shell, library) and they can be merged with the > user-provided values. However, it's also possible that the cluster admins > specify some non-PATH-like variable such as JAVA_HOME. In this case if there > is the same variable provided by the user, we'll get a concatenation that > does not make sense and is difficult to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4789) Provide helpful exception for non-PATH-like conflict with admin.user.env
[ https://issues.apache.org/jira/browse/YARN-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15191384#comment-15191384 ] Gera Shegalov commented on YARN-4789: - The patch is throwing an exception only when both the user and the admin specified an environment variable that cannot be reconciled via concatenation as it happens with various *PATHs, which is the intent of the option 2. Following option 1 would replace the user env. But this seemed to me to violate the spirit of this conf. This conf is designed to preserve the admin settings while allowing to be overridden by the user. That's why I thought warning both sides about misconfig is the best course of action here. > Provide helpful exception for non-PATH-like conflict with admin.user.env > > > Key: YARN-4789 > URL: https://issues.apache.org/jira/browse/YARN-4789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Gera Shegalov >Assignee: Gera Shegalov > Attachments: YARN-4789.001.patch > > > Environment variables specified in mapreduce.admin.user.env are supposed to > be paths (class, shell, library) and they can be merged with the > user-provided values. However, it's also possible that the cluster admins > specify some non-PATH-like variable such as JAVA_HOME. In this case if there > is the same variable provided by the user, we'll get a concatenation that > does not make sense and is difficult to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4789) Provide helpful exception for non-PATH-like conflict with admin.user.env
[ https://issues.apache.org/jira/browse/YARN-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190773#comment-15190773 ] Gera Shegalov commented on YARN-4789: - I see the following options to deal with it: # silently ignore/replace the user-provided value by the one in admin.env # inform the user that the variable is provided by the cluster admins. 001 patch for the latter > Provide helpful exception for non-PATH-like conflict with admin.user.env > > > Key: YARN-4789 > URL: https://issues.apache.org/jira/browse/YARN-4789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Gera Shegalov >Assignee: Gera Shegalov > Attachments: YARN-4789.001.patch > > > Environment variables specified in mapreduce.admin.user.env are supposed to > be paths (class, shell, library) and they can be merged with the > user-provided values. However, it's also possible that the cluster admins > specify some non-PATH-like variable such as JAVA_HOME. In this case if there > is the same variable provided by the user, we'll get a concatenation that > does not make sense and is difficult to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4789) Provide helpful exception for non-PATH-like conflict with admin.user.env
[ https://issues.apache.org/jira/browse/YARN-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-4789: Attachment: YARN-4789.001.patch > Provide helpful exception for non-PATH-like conflict with admin.user.env > > > Key: YARN-4789 > URL: https://issues.apache.org/jira/browse/YARN-4789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Gera Shegalov >Assignee: Gera Shegalov > Attachments: YARN-4789.001.patch > > > Environment variables specified in mapreduce.admin.user.env are supposed to > be paths (class, shell, library) and they can be merged with the > user-provided values. However, it's also possible that the cluster admins > specify some non-PATH-like variable such as JAVA_HOME. In this case if there > is the same variable provided by the user, we'll get a concatenation that > does not make sense and is difficult to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4789) Provide helpful exception for non-PATH-like conflict with admin.user.env
Gera Shegalov created YARN-4789: --- Summary: Provide helpful exception for non-PATH-like conflict with admin.user.env Key: YARN-4789 URL: https://issues.apache.org/jira/browse/YARN-4789 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.2 Reporter: Gera Shegalov Assignee: Gera Shegalov Environment variables specified in mapreduce.admin.user.env are supposed to be paths (class, shell, library) and they can be merged with the user-provided values. However, it's also possible that the cluster admins specify some non-PATH-like variable such as JAVA_HOME. In this case if there is the same variable provided by the user, we'll get a concatenation that does not make sense and is difficult to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15071392#comment-15071392 ] Gera Shegalov commented on YARN-2934: - +1 for YARN-2934.v2.004.patch. There is an extra space in "Error files : ", I took the freedom to fix it myself > Improve handling of container's stderr > --- > > Key: YARN-2934 > URL: https://issues.apache.org/jira/browse/YARN-2934 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Gera Shegalov >Assignee: Naganarasimha G R >Priority: Critical > Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, > YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, > YARN-2934.v1.006.patch, YARN-2934.v1.007.patch, YARN-2934.v1.008.patch, > YARN-2934.v2.001.patch, YARN-2934.v2.002.patch, YARN-2934.v2.003.patch, > YARN-2934.v2.004.patch > > > Most YARN applications redirect stderr to some file. That's why when > container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15070337#comment-15070337 ] Gera Shegalov commented on YARN-2934: - Hi [~Naganarasimha]. Thanks for updating the patch. Things we have not addressed from my previous comments is capping the buffer size. But I now think it's good enough because we have a good small default for the tail NM_CONTAINER_STDERR_BYTES. Still please rename: {code} - FileStatus[] listStatus = fileSystem + FileStatus[] errorStatuses = fileSystem {code} or similar. It's an array of statuses and not status of a list Let us have a space after ',' and a new line in: {code} - .append(StringUtils.arrayToString(errorFileNames)).append(". "); + .append(StringUtils.join(", ", errorFileNames)).append(".\n"); {code} Fix the test code accordingly method verifyTailErrorLogOnContainerExit can/should be private. Same for ContainerExitHandler class. Assume.assumeTrue(Shell.LINUX); should be Assume.assumeFalse(Shell.WINDOWS || Shell.OTHER); but actually why do we need this? The test seems to be platform-independent. Assert.assertNotNull(exitEvent.getDiagnosticInfo()); seems redundant because you then have other asserts implying this already. I suggest to LOG.info the diagnostics instead to make the test log more useful. > Improve handling of container's stderr > --- > > Key: YARN-2934 > URL: https://issues.apache.org/jira/browse/YARN-2934 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Gera Shegalov >Assignee: Naganarasimha G R >Priority: Critical > Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, > YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, > YARN-2934.v1.006.patch, YARN-2934.v1.007.patch, YARN-2934.v1.008.patch, > YARN-2934.v2.001.patch, YARN-2934.v2.002.patch, YARN-2934.v2.003.patch > > > Most YARN applications redirect stderr to some file. That's why when > container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061768#comment-15061768 ] Gera Shegalov commented on YARN-2934: - -1 on manual regexes in favor of code reuse. 99.9% of YARN users will never change this conf. Simple globs I was suggesting already cover even AppMaster.stderr. In ContainerLaunch#getErrorLogTail get rid of {code} if (containerLogDir == null) { return null; } {code} it cannot be null > Improve handling of container's stderr > --- > > Key: YARN-2934 > URL: https://issues.apache.org/jira/browse/YARN-2934 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Gera Shegalov >Assignee: Naganarasimha G R >Priority: Critical > Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, > YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, > YARN-2934.v1.006.patch, YARN-2934.v1.007.patch > > > Most YARN applications redirect stderr to some file. That's why when > container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061847#comment-15061847 ] Gera Shegalov commented on YARN-2934: - Use RawLocalFileSystem, we don't need the checksumming version: FileSystem fileSystem = FileSystem.getLocal(conf).getRaw() > Improve handling of container's stderr > --- > > Key: YARN-2934 > URL: https://issues.apache.org/jira/browse/YARN-2934 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Gera Shegalov >Assignee: Naganarasimha G R >Priority: Critical > Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, > YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, > YARN-2934.v1.006.patch, YARN-2934.v1.007.patch, YARN-2934.v1.008.patch > > > Most YARN applications redirect stderr to some file. That's why when > container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063138#comment-15063138 ] Gera Shegalov commented on YARN-2934: - Thanks for the latest patch. good to see the patch lose 3kb, most of all there are no more changes to the common Configuration class. One checkstyle issue is the 80-column warning is from the patch around: {code} long tailSizeInBytes = conf.getLong( YarnConfiguration.NM_CONTAINER_ERROR_FILE_TAIL_SIZE_IN_BYTES, YarnConfiguration.DEFAULT_NM_CONTAINER_ERROR_FILE_TAIL_SIZE_IN_BYTES); {code} Those are pretty long names: Can we do: container.stderr.tail.bytes NM_CONTAINER_STDERR_BYTES and the corresponding default. Having stderr in name is also great for users to understand what error file is meant in 99% of the cases. Same thing is for container.stderr.pattern Still don't see any value in this, please drop: {code} if (listStatus.length > 1) { LOG.error("Multiple files in " + containerLogDir + ", seems to match the error file name pattern configured. " + Arrays.toString(listStatus)); } {code} Let us not guard the tail read: {code} if (fileSize != 0) { {code} and there is a value in seeing that the file is empty already on the client-side. Instead of {code} IOUtils.closeStream(errorFileIS) {code} call cleanup so we can pass the logger {code} IOUtils.cleanup(LOG, errorFileIS) {code} Since the trunk is on JDK7 min: We can drop the constant UTF_8 and use in {code} new String(tailBytes, StandardCharsets.UTF_8) {code} listStatus as a name for a variable is not intuitive. Maybe use errFileStatus for that. Obviously I meant tailSizeInBytes, thanks for paying attention. Agree that RLFS file status toString might look too ugly. We can FileUtil.stat2Paths or add a loop here to extract just the last path component. Also realizing that we should have a low cap on the tail size to prevent a misconfiguration to knock out NM with OOM on container failures since we do: {code} byte[] tailBytes = new byte[bufferSize]; {code} One can easily see why I initially confused tailBytes for an int. It should be called along the lines {code}tailBuffer{code} > Improve handling of container's stderr > --- > > Key: YARN-2934 > URL: https://issues.apache.org/jira/browse/YARN-2934 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Gera Shegalov >Assignee: Naganarasimha G R >Priority: Critical > Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, > YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, > YARN-2934.v1.006.patch, YARN-2934.v1.007.patch, YARN-2934.v1.008.patch, > YARN-2934.v2.001.patch > > > Most YARN applications redirect stderr to some file. That's why when > container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061908#comment-15061908 ] Gera Shegalov commented on YARN-2934: - That should go to the exception message {code} 422 } else if (listStatus.length > 1) { 423 LOG.warn("Multiple files in " + containerLogDir 424 + ", seems to match the error file name pattern configured "); 425 } {code} Don't do branching, pass the string builder diagnosticInfo and do something like {code} diagnosticInfo .append("Error files: ") .append(Arrays.toString(listStatus)) .append("\n") .append("Last ").append(tailBytes).append(" bytes of ").append(listStatus[0]) .append(new String(tailBytes, UTF_8)); {code} > Improve handling of container's stderr > --- > > Key: YARN-2934 > URL: https://issues.apache.org/jira/browse/YARN-2934 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Gera Shegalov >Assignee: Naganarasimha G R >Priority: Critical > Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, > YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, > YARN-2934.v1.006.patch, YARN-2934.v1.007.patch, YARN-2934.v1.008.patch > > > Most YARN applications redirect stderr to some file. That's why when > container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063217#comment-15063217 ] Gera Shegalov commented on YARN-2934: - the message looks good to me. > Improve handling of container's stderr > --- > > Key: YARN-2934 > URL: https://issues.apache.org/jira/browse/YARN-2934 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Gera Shegalov >Assignee: Naganarasimha G R >Priority: Critical > Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, > YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, > YARN-2934.v1.006.patch, YARN-2934.v1.007.patch, YARN-2934.v1.008.patch, > YARN-2934.v2.001.patch > > > Most YARN applications redirect stderr to some file. That's why when > container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063227#comment-15063227 ] Gera Shegalov commented on YARN-2934: - Regarding my comment about the user, I meant the YARN app user. The users don't look at the NM logs. They look at the exceptions in the webUI and on the client side. if the exception says {code} Container exited with a non-zero exit code 127. Error file(s): [error.log, stderr.1, stderr.2] Last 4096 bytes of error.log : /bin/bash: /no/jvm/here/bin/java: No such file or directory {code} the user will know it should also check those. > Improve handling of container's stderr > --- > > Key: YARN-2934 > URL: https://issues.apache.org/jira/browse/YARN-2934 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Gera Shegalov >Assignee: Naganarasimha G R >Priority: Critical > Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, > YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, > YARN-2934.v1.006.patch, YARN-2934.v1.007.patch, YARN-2934.v1.008.patch, > YARN-2934.v2.001.patch > > > Most YARN applications redirect stderr to some file. That's why when > container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063486#comment-15063486 ] Gera Shegalov commented on YARN-2934: - Minor repetition is not big of a deal, IMO. The reason I thought of printing file statuses is that you see the file size. Which brings us to the following point in the fanciness area. Right now we are blindly grabbing file 0. It would however make much more sense to grab the most recent (highest mtime) non-empty file. > Improve handling of container's stderr > --- > > Key: YARN-2934 > URL: https://issues.apache.org/jira/browse/YARN-2934 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Gera Shegalov >Assignee: Naganarasimha G R >Priority: Critical > Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, > YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, > YARN-2934.v1.006.patch, YARN-2934.v1.007.patch, YARN-2934.v1.008.patch, > YARN-2934.v2.001.patch, YARN-2934.v2.002.patch > > > Most YARN applications redirect stderr to some file. That's why when > container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063487#comment-15063487 ] Gera Shegalov commented on YARN-2934: - Minor repetition is not big of a deal, IMO. The reason I thought of printing file statuses is that you see the file size. Which brings us to the following point in the fanciness area. Right now we are blindly grabbing file 0. It would however make much more sense to grab the most recent (highest mtime) non-empty file. > Improve handling of container's stderr > --- > > Key: YARN-2934 > URL: https://issues.apache.org/jira/browse/YARN-2934 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Gera Shegalov >Assignee: Naganarasimha G R >Priority: Critical > Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, > YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, > YARN-2934.v1.006.patch, YARN-2934.v1.007.patch, YARN-2934.v1.008.patch, > YARN-2934.v2.001.patch, YARN-2934.v2.002.patch > > > Most YARN applications redirect stderr to some file. That's why when > container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059943#comment-15059943 ] Gera Shegalov commented on YARN-2934: - Hi [~Naganarasimha], Please make sure that the patch does not introduce new problems. Both checkstyle and findbugs report problems related to the patch. Check the Hadoop QA comment above. Keep addressing the newly introduced issues without waiting for review to simplify the review process. I suggest to use globs instead of regexes, so you can simply call FileSystem#globStatus. The path pattern could be something like {code}{*stderr*,*STDERR*}{code} or maybe {code}{*err,*ERR,*out,*OUT}{code}. I'd rather have a longer config value than adding more code to make patterns case-insensitive. In practice we mostly need stderr Not sure how fancy we need to be with the case where multiple log files qualify for the pattern, but maybe at least mention to the user there are more files to look at. In general, don't try optimize for the failure case. Things like {code} private static long tailSizeInBytes = -1; {code} look like a bug. Simply get it from conf exactly when it's needed. > Improve handling of container's stderr > --- > > Key: YARN-2934 > URL: https://issues.apache.org/jira/browse/YARN-2934 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Gera Shegalov >Assignee: Naganarasimha G R >Priority: Critical > Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, > YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, > YARN-2934.v1.006.patch > > > Most YARN applications redirect stderr to some file. That's why when > container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048280#comment-15048280 ] Gera Shegalov commented on YARN-2934: - Thanks [~Naganarasimha]! I skimmed the patch, it is in a pretty good shape. Aiming to give you more detailed feedback over next few days. > Improve handling of container's stderr > --- > > Key: YARN-2934 > URL: https://issues.apache.org/jira/browse/YARN-2934 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Gera Shegalov >Assignee: Naganarasimha G R >Priority: Critical > Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, > YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch > > > Most YARN applications redirect stderr to some file. That's why when > container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-683) Class MiniYARNCluster not found when starting the minicluster
[ https://issues.apache.org/jira/browse/YARN-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov resolved YARN-683. Resolution: Duplicate Closing as a dup because HADOOP-9891 now documents this workaround Class MiniYARNCluster not found when starting the minicluster - Key: YARN-683 URL: https://issues.apache.org/jira/browse/YARN-683 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.0.4-alpha Environment: MacOSX 10.8.3 - Java 1.6.0_45 Reporter: Rémy SAISSY Starting the minicluster with the following command line: bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.4-alpha-tests.jar minicluster -format Fails for MiniYARNCluster with the following error: 13/05/14 17:06:58 INFO hdfs.MiniDFSCluster: Cluster is active 13/05/14 17:06:58 INFO mapreduce.MiniHadoopClusterManager: Started MiniDFSCluster -- namenode on port 55205 java.lang.NoClassDefFoundError: org/apache/hadoop/yarn/server/MiniYARNCluster at org.apache.hadoop.mapreduce.MiniHadoopClusterManager.start(MiniHadoopClusterManager.java:170) at org.apache.hadoop.mapreduce.MiniHadoopClusterManager.run(MiniHadoopClusterManager.java:129) at org.apache.hadoop.mapreduce.MiniHadoopClusterManager.main(MiniHadoopClusterManager.java:314) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144) at org.apache.hadoop.test.MapredTestDriver.run(MapredTestDriver.java:115) at org.apache.hadoop.test.MapredTestDriver.main(MapredTestDriver.java:123) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.yarn.server.MiniYARNCluster at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ... 16 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625060#comment-14625060 ] Gera Shegalov commented on YARN-2934: - Hi [~Naganarasimha], yes I was thinking the same, we should try to do it in the java land. I'd prefer using RawLocalFileSytem#read(buf, off, len) in order not to mix in java.io API. Since the NM webUI can read logs, we should have no problems accessing them from the NM JVM. Improve handling of container's stderr --- Key: YARN-2934 URL: https://issues.apache.org/jira/browse/YARN-2934 Project: Hadoop YARN Issue Type: Improvement Reporter: Gera Shegalov Assignee: Naganarasimha G R Priority: Critical Most YARN applications redirect stderr to some file. That's why when container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Moved] (YARN-3917) getResourceCalculatorPlugin for the default should intercept all excpetions
[ https://issues.apache.org/jira/browse/YARN-3917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov moved HADOOP-1 to YARN-3917: -- Affects Version/s: (was: 2.8.0) 2.8.0 Target Version/s: 2.8.0 (was: 2.8.0) Key: YARN-3917 (was: HADOOP-1) Project: Hadoop YARN (was: Hadoop Common) getResourceCalculatorPlugin for the default should intercept all excpetions --- Key: YARN-3917 URL: https://issues.apache.org/jira/browse/YARN-3917 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.8.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: HADOOP-1.001.patch Since the user has not configured a specific plugin, any problems with the default resource calculator instantiation should be ignored. {code} 2015-07-10 08:16:18,445 INFO org.apache.hadoop.service.AbstractService: Service containers-monitor failed in state INITED; cause: java.lang.UnsupportedOperationException: Could not determine OS java.lang.UnsupportedOperationException: Could not determine OS at org.apache.hadoop.util.SysInfo.newInstance(SysInfo.java:43) at org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.init(ResourceCalculatorPlugin.java:37) at org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getResourceCalculatorPlugin(ResourceCalculatorPlugin.java:160) at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl.serviceInit(ContainersMonitorImpl.java:108) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:249) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:312) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:547) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:595) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3917) getResourceCalculatorPlugin for the default should intercept all excpetions
[ https://issues.apache.org/jira/browse/YARN-3917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623663#comment-14623663 ] Gera Shegalov commented on YARN-3917: - Thanks [~chris.douglas] for review. Moved JIRA to YARN beause {{ResourceCalculatorPlugin.java}} is in hadoop-yarn-common. getResourceCalculatorPlugin for the default should intercept all excpetions --- Key: YARN-3917 URL: https://issues.apache.org/jira/browse/YARN-3917 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.8.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: HADOOP-1.001.patch Since the user has not configured a specific plugin, any problems with the default resource calculator instantiation should be ignored. {code} 2015-07-10 08:16:18,445 INFO org.apache.hadoop.service.AbstractService: Service containers-monitor failed in state INITED; cause: java.lang.UnsupportedOperationException: Could not determine OS java.lang.UnsupportedOperationException: Could not determine OS at org.apache.hadoop.util.SysInfo.newInstance(SysInfo.java:43) at org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.init(ResourceCalculatorPlugin.java:37) at org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getResourceCalculatorPlugin(ResourceCalculatorPlugin.java:160) at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl.serviceInit(ContainersMonitorImpl.java:108) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:249) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:312) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:547) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:595) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3917) getResourceCalculatorPlugin for the default should intercept all exceptions
[ https://issues.apache.org/jira/browse/YARN-3917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-3917: Summary: getResourceCalculatorPlugin for the default should intercept all exceptions (was: getResourceCalculatorPlugin for the default should intercept all excpetions) getResourceCalculatorPlugin for the default should intercept all exceptions --- Key: YARN-3917 URL: https://issues.apache.org/jira/browse/YARN-3917 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.8.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: HADOOP-1.001.patch Since the user has not configured a specific plugin, any problems with the default resource calculator instantiation should be ignored. {code} 2015-07-10 08:16:18,445 INFO org.apache.hadoop.service.AbstractService: Service containers-monitor failed in state INITED; cause: java.lang.UnsupportedOperationException: Could not determine OS java.lang.UnsupportedOperationException: Could not determine OS at org.apache.hadoop.util.SysInfo.newInstance(SysInfo.java:43) at org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.init(ResourceCalculatorPlugin.java:37) at org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getResourceCalculatorPlugin(ResourceCalculatorPlugin.java:160) at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl.serviceInit(ContainersMonitorImpl.java:108) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:249) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:312) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:547) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:595) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3768) ArrayIndexOutOfBoundsException with empty environment variables
[ https://issues.apache.org/jira/browse/YARN-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-3768: Description: Looking at line 80 of org.apache.hadoop.yarn.util.Apps an index out of range exception occurs if an environment variable is encountered without a value. {code} java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.hadoop.yarn.util.Apps.setEnvFromInputString(Apps.java:80) {code} I believe this occurs because java will not return empty strings from the split method. Similar to this http://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values was: Looking at line 80 of org.apache.hadoop.yarn.util.Apps an index out of range exception occurs if an environment variable is encountered without a value. I believe this occurs because java will not return empty strings from the split method. Similar to this http://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values ArrayIndexOutOfBoundsException with empty environment variables --- Key: YARN-3768 URL: https://issues.apache.org/jira/browse/YARN-3768 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.5.0 Reporter: Joe Ferner Assignee: zhihai xu Attachments: YARN-3768.000.patch, YARN-3768.001.patch, YARN-3768.002.patch, YARN-3768.003.patch, YARN-3768.004.patch Looking at line 80 of org.apache.hadoop.yarn.util.Apps an index out of range exception occurs if an environment variable is encountered without a value. {code} java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.hadoop.yarn.util.Apps.setEnvFromInputString(Apps.java:80) {code} I believe this occurs because java will not return empty strings from the split method. Similar to this http://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3768) ArrayIndexOutOfBoundsException with empty environment variables
[ https://issues.apache.org/jira/browse/YARN-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-3768: Summary: ArrayIndexOutOfBoundsException with empty environment variables (was: Index out of range exception with environment variables without values) ArrayIndexOutOfBoundsException with empty environment variables --- Key: YARN-3768 URL: https://issues.apache.org/jira/browse/YARN-3768 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.5.0 Reporter: Joe Ferner Assignee: zhihai xu Attachments: YARN-3768.000.patch, YARN-3768.001.patch, YARN-3768.002.patch, YARN-3768.003.patch, YARN-3768.004.patch Looking at line 80 of org.apache.hadoop.yarn.util.Apps an index out of range exception occurs if an environment variable is encountered without a value. I believe this occurs because java will not return empty strings from the split method. Similar to this http://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3768) Index out of range exception with environment variables without values
[ https://issues.apache.org/jira/browse/YARN-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-3768: Attachment: YARN-3768.003.patch Thanks for reviewing the patch, [~zxu]! bq. If the input is a=b=c, it saves Env variable a with value b. Is it correct? Correct, and I agree it does not look the behavior we want. I think the right behavior is to accept any value between the first {{=}} and the next {{,}}. The value should be {{b=c}} in your example. bq. I also noticed the patch will discard Env Variable with empty string value. I am ok with it. I think it might be desirable sometimes to clear a variable that is is set globally. Let us allow it. 003 patch attached! Index out of range exception with environment variables without values -- Key: YARN-3768 URL: https://issues.apache.org/jira/browse/YARN-3768 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.5.0 Reporter: Joe Ferner Assignee: zhihai xu Attachments: YARN-3768.000.patch, YARN-3768.001.patch, YARN-3768.002.patch, YARN-3768.003.patch Looking at line 80 of org.apache.hadoop.yarn.util.Apps an index out of range exception occurs if an environment variable is encountered without a value. I believe this occurs because java will not return empty strings from the split method. Similar to this http://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3768) Index out of range exception with environment variables without values
[ https://issues.apache.org/jira/browse/YARN-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606960#comment-14606960 ] Gera Shegalov commented on YARN-3768: - Good catch, [~zxu]. I rushed on the way home and forgot to regenerate the patch with the {{*}} change after making it locally. +1 pending Jenkins Index out of range exception with environment variables without values -- Key: YARN-3768 URL: https://issues.apache.org/jira/browse/YARN-3768 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.5.0 Reporter: Joe Ferner Assignee: zhihai xu Attachments: YARN-3768.000.patch, YARN-3768.001.patch, YARN-3768.002.patch, YARN-3768.003.patch, YARN-3768.004.patch Looking at line 80 of org.apache.hadoop.yarn.util.Apps an index out of range exception occurs if an environment variable is encountered without a value. I believe this occurs because java will not return empty strings from the split method. Similar to this http://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3768) Index out of range exception with environment variables without values
[ https://issues.apache.org/jira/browse/YARN-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-3768: Attachment: YARN-3768.002.patch You are right [~zxu], and I actually meant to combine matching k=v pairs and capturing k and v in one shot. Index out of range exception with environment variables without values -- Key: YARN-3768 URL: https://issues.apache.org/jira/browse/YARN-3768 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.5.0 Reporter: Joe Ferner Assignee: zhihai xu Attachments: YARN-3768.000.patch, YARN-3768.001.patch, YARN-3768.002.patch Looking at line 80 of org.apache.hadoop.yarn.util.Apps an index out of range exception occurs if an environment variable is encountered without a value. I believe this occurs because java will not return empty strings from the split method. Similar to this http://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3768) Index out of range exception with environment variables without values
[ https://issues.apache.org/jira/browse/YARN-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604372#comment-14604372 ] Gera Shegalov commented on YARN-3768: - 002 attached, with this idea and proper name validation. Index out of range exception with environment variables without values -- Key: YARN-3768 URL: https://issues.apache.org/jira/browse/YARN-3768 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.5.0 Reporter: Joe Ferner Assignee: zhihai xu Attachments: YARN-3768.000.patch, YARN-3768.001.patch, YARN-3768.002.patch Looking at line 80 of org.apache.hadoop.yarn.util.Apps an index out of range exception occurs if an environment variable is encountered without a value. I believe this occurs because java will not return empty strings from the split method. Similar to this http://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3768) Index out of range exception with environment variables without values
[ https://issues.apache.org/jira/browse/YARN-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596443#comment-14596443 ] Gera Shegalov commented on YARN-3768: - Instead of executing two regexes: first directly via Pattern p = Pattern.compile(Shell.getEnvironmentVariableRegex()) and then via split can we simply match via a single regex? we can use a capture group to get the value. Index out of range exception with environment variables without values -- Key: YARN-3768 URL: https://issues.apache.org/jira/browse/YARN-3768 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.5.0 Reporter: Joe Ferner Assignee: zhihai xu Attachments: YARN-3768.000.patch, YARN-3768.001.patch Looking at line 80 of org.apache.hadoop.yarn.util.Apps an index out of range exception occurs if an environment variable is encountered without a value. I believe this occurs because java will not return empty strings from the split method. Similar to this http://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
[ https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14524159#comment-14524159 ] Gera Shegalov commented on YARN-2893: - Thanks for updating the patch [~zxu]. I verified with HADOOP-11889 that ignores imports, that the long import line is the only non-issue. +1 for 005 AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream -- Key: YARN-2893 URL: https://issues.apache.org/jira/browse/YARN-2893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: zhihai xu Attachments: YARN-2893.000.patch, YARN-2893.001.patch, YARN-2893.002.patch, YARN-2893.003.patch, YARN-2893.004.patch, YARN-2893.005.patch MapReduce jobs on our clusters experience sporadic failures due to corrupt tokens in the AM launch context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3568) TestAMRMTokens should use some random port
Gera Shegalov created YARN-3568: --- Summary: TestAMRMTokens should use some random port Key: YARN-3568 URL: https://issues.apache.org/jira/browse/YARN-3568 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.8.0 Reporter: Gera Shegalov Since the default port is used for yarn.resourcemanager.scheduler.address, if we already run a pseudo-distributed cluster on the same development machine, the test fails like this: {code} testMasterKeyRollOver[0](org.apache.hadoop.yarn.server.resourcemanager.security.TestAMRMTokens) Time elapsed: 1.511 sec ERROR! org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:8030] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:444) at sun.nio.ch.Net.bind(Net.java:436) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.apache.hadoop.ipc.Server.bind(Server.java:413) at org.apache.hadoop.ipc.Server$Listener.init(Server.java:590) at org.apache.hadoop.ipc.Server.init(Server.java:2340) at org.apache.hadoop.ipc.RPC$Server.init(RPC.java:945) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.init(ProtobufRpcEngine.java:534) at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:509) at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:787) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132) at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65) at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.serviceStart(ApplicationMasterService.java:140) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:586) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:996) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1037) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1033) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1033) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1073) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.security.TestAMRMTokens.testMasterKeyRollOver(TestAMRMTokens.java:235) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
[ https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520879#comment-14520879 ] Gera Shegalov commented on YARN-2893: - Hi [~zxu], thanks for updating the patch. I believe the remaining checkstyle violation comes from double indentation in the catch block: {code} +} catch (Exception e) { LOG.warn(Unable to parse credentials., e); // Sending APP_REJECTED is fine, since we assume that the // RMApp is in NEW state and thus we haven't yet informed the // scheduler about the existence of the application assert application.getState() == RMAppState.NEW; {code} It will go away once you make it 2-space instead of 4-space indentation that arose because you moved code around. AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream -- Key: YARN-2893 URL: https://issues.apache.org/jira/browse/YARN-2893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: zhihai xu Attachments: YARN-2893.000.patch, YARN-2893.001.patch, YARN-2893.002.patch, YARN-2893.003.patch, YARN-2893.004.patch MapReduce jobs on our clusters experience sporadic failures due to corrupt tokens in the AM launch context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
[ https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513583#comment-14513583 ] Gera Shegalov commented on YARN-2893: - Thanks for the 003 patch, [~zxu]! I agree that validating credentials in either case is a good idea. LGTM. Nits: can you take care of the 80-column violations in your test methods. AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream -- Key: YARN-2893 URL: https://issues.apache.org/jira/browse/YARN-2893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: zhihai xu Attachments: YARN-2893.000.patch, YARN-2893.001.patch, YARN-2893.002.patch, YARN-2893.003.patch MapReduce jobs on our clusters experience sporadic failures due to corrupt tokens in the AM launch context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514536#comment-14514536 ] Gera Shegalov commented on YARN-3491: - Agreed, reducing the number of system calls is a good idea, idea. Using JNI instead of ls can be handled with a separate JIRA PublicLocalizer#addResource is too slow. Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3491.000.patch, YARN-3491.001.patch, YARN-3491.002.patch Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which takes about 10+ ms. The total delay will be approximately number of local dirs * 10+ ms. This delay will be added for each public resource localization. Because PublicLocalizer#addResource is slow, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. And also PublicLocalizer#addResource is running in Dispatcher thread, So the Dispatcher thread will be blocked by PublicLocalizer#addResource for long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner kills localizer before localizing all resources
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513247#comment-14513247 ] Gera Shegalov commented on YARN-3464: - We might need to tweak checkstyle rules. There are a bunch of 80-column-limit violations that seem come from the import statements. Race condition in LocalizerRunner kills localizer before localizing all resources - Key: YARN-3464 URL: https://issues.apache.org/jira/browse/YARN-3464 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Fix For: 2.8.0 Attachments: YARN-3464.000.patch, YARN-3464.001.patch Race condition in LocalizerRunner causes container localization timeout. Currently LocalizerRunner will kill the ContainerLocalizer when pending list for LocalizerResourceRequestEvent is empty. {code} } else if (pending.isEmpty()) { action = LocalizerAction.DIE; } {code} If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the ContainerLocalizer due to empty pending list, this LocalizerResourceRequestEvent will never be handled. Without ContainerLocalizer, LocalizerRunner#update will never be called. The container will stay at LOCALIZING state, until the container is killed by AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513509#comment-14513509 ] Gera Shegalov commented on YARN-3491: - We should switch to {{io.nativeio.NativeIO.POSIX#getFstat}} as implementation in {{RawLocalFileSystem}} to get rid of shell-based implementation for FileStatus. PublicLocalizer#addResource is too slow. Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3491.000.patch, YARN-3491.001.patch, YARN-3491.002.patch Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which takes about 10+ ms. The total delay will be approximately number of local dirs * 10+ ms. This delay will be added for each public resource localization. Because PublicLocalizer#addResource is slow, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. And also PublicLocalizer#addResource is running in Dispatcher thread, So the Dispatcher thread will be blocked by PublicLocalizer#addResource for long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
[ https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510163#comment-14510163 ] Gera Shegalov commented on YARN-2893: - Hi [~zxu], for me personally it's easier to review if you simply make the change, and upload a new patch. The additional benefit is that we'll see hopefully if our assumptions are validated by unit tests. AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream -- Key: YARN-2893 URL: https://issues.apache.org/jira/browse/YARN-2893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: zhihai xu Attachments: YARN-2893.000.patch, YARN-2893.001.patch, YARN-2893.002.patch MapReduce jobs on our clusters experience sporadic failures due to corrupt tokens in the AM launch context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
[ https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395255#comment-14395255 ] Gera Shegalov commented on YARN-2893: - Thanks [~zxu] for the patch, and apologies for the delay. I skimmed over the patch, and it looks good overall. Can you keep your logic in {{RMAppManager#submitApplicationmove}} with parseCredentials but put it back under {{if (UserGroupInformation.isSecurityEnabled()) {}} AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream -- Key: YARN-2893 URL: https://issues.apache.org/jira/browse/YARN-2893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: zhihai xu Attachments: YARN-2893.000.patch, YARN-2893.001.patch, YARN-2893.002.patch MapReduce jobs on our clusters experience sporadic failures due to corrupt tokens in the AM launch context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
[ https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345950#comment-14345950 ] Gera Shegalov commented on YARN-2893: - Hi [~zxu], it's great that you make progress on this JIRA. Any chance you can capture the failure scenarios in some unit test so we can relate it better to the real failures we are seeing. AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream -- Key: YARN-2893 URL: https://issues.apache.org/jira/browse/YARN-2893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: zhihai xu Attachments: YARN-2893.000.patch MapReduce jobs on our clusters experience sporadic failures due to corrupt tokens in the AM launch context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
[ https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346226#comment-14346226 ] Gera Shegalov commented on YARN-2893: - bq. Also I find out a cascading patch to fix the credentials corruption at the jobClient. https://github.com/Cascading/cascading/commit/45b33bb864172486ac43782a4d13329312d01c0e I scanned all reports collected over last months, and the current cluster logs. I can confirm all affected jobs were the ones that still had Cascading 2.5.4-based dependency. Thanks a lot for pointing it out [~zxu]! AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream -- Key: YARN-2893 URL: https://issues.apache.org/jira/browse/YARN-2893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: zhihai xu Attachments: YARN-2893.000.patch MapReduce jobs on our clusters experience sporadic failures due to corrupt tokens in the AM launch context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268767#comment-14268767 ] Gera Shegalov commented on YARN-2934: - bq. Given this, even the tailed stderr is not useful in such a situation. If the app-page ages out, where will the user see this additional diagnostic message that we tail out of logs? It will be in the client output that I showed in the above comments. In our infrastructure, a failed job will generate an alert email containing the client log (or link to it). Improve handling of container's stderr --- Key: YARN-2934 URL: https://issues.apache.org/jira/browse/YARN-2934 Project: Hadoop YARN Issue Type: Improvement Reporter: Gera Shegalov Assignee: Naganarasimha G R Priority: Critical Most YARN applications redirect stderr to some file. That's why when container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
[ https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268356#comment-14268356 ] Gera Shegalov commented on YARN-2893: - Is there a significant fraction of other type of jobs on your clusters ? AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream -- Key: YARN-2893 URL: https://issues.apache.org/jira/browse/YARN-2893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov MapReduce jobs on our clusters experience sporadic failures due to corrupt tokens in the AM launch context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
[ https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268334#comment-14268334 ] Gera Shegalov commented on YARN-2893: - Hi [~ajsquared], what type of jobs are you seeing this with? I think almost all failures for us are Scalding/Cascading jobs, which made me think that it has to do with their multithreaded job submission code. AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream -- Key: YARN-2893 URL: https://issues.apache.org/jira/browse/YARN-2893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov MapReduce jobs on our clusters experience sporadic failures due to corrupt tokens in the AM launch context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268712#comment-14268712 ] Gera Shegalov commented on YARN-2934: - Yes it's related, but not exclusive to AM (try -Dmapreduce.map.env=JAVA_HOME=/no/jvm/here). It's just more severe with AM. cat is not the point. Getting the real diagnostics with something is, +1 for using tail. The pointer to the tracking page can be of little value for a busy cluster. The RMApp is likely to age out by the time the user gets to look at it, and there is no JHS entry because the AM crashed. It would be better to mention the nodeAddress as well, in addition to containerId to be used with 'yarn logs' Improve handling of container's stderr --- Key: YARN-2934 URL: https://issues.apache.org/jira/browse/YARN-2934 Project: Hadoop YARN Issue Type: Improvement Reporter: Gera Shegalov Assignee: Naganarasimha G R Priority: Critical Most YARN applications redirect stderr to some file. That's why when container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-2934: Priority: Critical (was: Major) Improve handling of container's stderr --- Key: YARN-2934 URL: https://issues.apache.org/jira/browse/YARN-2934 Project: Hadoop YARN Issue Type: Improvement Reporter: Gera Shegalov Assignee: Naganarasimha G R Priority: Critical Most YARN applications redirect stderr to some file. That's why when container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267377#comment-14267377 ] Gera Shegalov commented on YARN-2934: - ContainerLaunchContext is meant for a ContainerExecutor in general. In the LCE case the logs may only be readable by the app user. To make it robust we can simply append the catting to the supplied command that runs in the executor. This is the hacky version for it disregarding OS diversity. It presumes that stderr log has strderr in the file name. {code} $ git diff diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java index a87238d..8ea2560 100644 --- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java @@ -190,6 +190,11 @@ public Integer call() { // TODO: Should we instead work via symlinks without this grammar? newCmds.add(expandEnvironment(str, containerLogDir)); } + newCmds.add(||); + newCmds.add((cat + containerLogDir + /*stderr* 12); + newCmds.add(;); + newCmds.add(exit -1)); + launchContext.setCommands(newCmds); MapString, String environment = launchContext.getEnvironment(); {code} Then we get the desired effect: {code} ]$ hadoop org.apache.hadoop.mapreduce.SleepJob -Dyarn.app.mapreduce.am.env=JAVA_HOME=/no/jvm/here -m 10 15/01/06 23:36:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/01/06 23:36:14 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032 15/01/06 23:36:15 INFO mapreduce.JobSubmitter: number of splits:10 15/01/06 23:36:15 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1420533216615_0013 15/01/06 23:36:16 INFO impl.YarnClientImpl: Submitted application application_1420533216615_0013 15/01/06 23:36:16 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1420533216615_0013/ 15/01/06 23:36:16 INFO mapreduce.Job: Running job: job_1420533216615_0013 15/01/06 23:36:21 INFO mapreduce.Job: Job job_1420533216615_0013 running in uber mode : false 15/01/06 23:36:21 INFO mapreduce.Job: map 0% reduce 0% 15/01/06 23:36:21 INFO mapreduce.Job: Job job_1420533216615_0013 failed with state FAILED due to: Application application_1420533216615_0013 failed 2 times due to AM Container for appattempt_1420533216615_0013_02 exited with exitCode: 255 For more detailed output, check application tracking page:http://localhost:8088/proxy/application_1420533216615_0013/Then, click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_1420533216615_0013_02_01 Exit code: 255 Exception message: /bin/bash: /no/jvm/here/bin/java: No such file or directory Stack trace: ExitCodeException exitCode=255: /bin/bash: /no/jvm/here/bin/java: No such file or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:307) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} Improve handling of container's stderr --- Key: YARN-2934 URL: https://issues.apache.org/jira/browse/YARN-2934 Project: Hadoop YARN Issue Type: Improvement Reporter: Gera Shegalov Assignee: Naganarasimha G R Most YARN applications redirect stderr to some file. That's why when container launch fails with {{ExitCodeException}} the message is empty. -- This message
[jira] [Commented] (YARN-2745) Extend YARN to support multi-resource packing of tasks
[ https://issues.apache.org/jira/browse/YARN-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247515#comment-14247515 ] Gera Shegalov commented on YARN-2745: - Thanks for filing this JIRA, [~rgrandl]! We have a number of use cases where we need to schedule by NW bandwidth instead of memory/cores. Extend YARN to support multi-resource packing of tasks -- Key: YARN-2745 URL: https://issues.apache.org/jira/browse/YARN-2745 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager, scheduler Reporter: Robert Grandl Attachments: sigcomm_14_tetris_talk.pptx, tetris_design_doc.docx, tetris_paper.pdf In this umbrella JIRA we propose an extension to existing scheduling techniques, which accounts for all resources used by a task (CPU, memory, disk, network) and it is able to achieve three competing objectives: fairness, improve cluster utilization and reduces average job completion time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2934) Improve handling of container's stderr
Gera Shegalov created YARN-2934: --- Summary: Improve handling of container's stderr Key: YARN-2934 URL: https://issues.apache.org/jira/browse/YARN-2934 Project: Hadoop YARN Issue Type: Improvement Reporter: Gera Shegalov Most YARN applications redirect stderr to some file. That's why when container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239112#comment-14239112 ] Gera Shegalov commented on YARN-2934: - We need to make sure that stderr location is made known in the container launch context such that the wrapper script can cat it to it's stderr and it can be consumed by {{Shell}} Improve handling of container's stderr --- Key: YARN-2934 URL: https://issues.apache.org/jira/browse/YARN-2934 Project: Hadoop YARN Issue Type: Improvement Reporter: Gera Shegalov Most YARN applications redirect stderr to some file. That's why when container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239779#comment-14239779 ] Gera Shegalov commented on YARN-2934: - Hi [~Naganarasimha], yes that's what I meant. Maybe this is specific to the {{DefaultContainerExecutor}}. When testing on my macbook: {code} $ hadoop org.apache.hadoop.mapreduce.SleepJob -Dyarn.app.mapreduce.am.env=JAVA_HOME=/no/jvm/here -m 1 {code} All you get: {code} 2014-12-09 09:15:00,252 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1418144997824_0001_01_01 and exit code: 127 ExitCodeException exitCode=127: at org.apache.hadoop.util.Shell.runCommand(Shell.java:544) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:721) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} In the stderr log of the container, you can see the real deal: {code} Log Type: stderr Log Upload Time: Tue Dec 09 09:15:05 -0800 2014 Log Length: 60 /bin/bash: /no/jvm/here/bin/java: No such file or directory {code} Improve handling of container's stderr --- Key: YARN-2934 URL: https://issues.apache.org/jira/browse/YARN-2934 Project: Hadoop YARN Issue Type: Improvement Reporter: Gera Shegalov Assignee: Naganarasimha G R Most YARN applications redirect stderr to some file. That's why when container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239785#comment-14239785 ] Gera Shegalov commented on YARN-2934: - Hi [~xgong], bq. we already have a Environment.LOG_DIRS env which returns a comma separated list of log-dirs. I am aware of this. However, each app is free to pump their standard error 2 in any file under this dir. Improve handling of container's stderr --- Key: YARN-2934 URL: https://issues.apache.org/jira/browse/YARN-2934 Project: Hadoop YARN Issue Type: Improvement Reporter: Gera Shegalov Assignee: Naganarasimha G R Most YARN applications redirect stderr to some file. That's why when container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2899) Run TestDockerContainerExecutorWithMocks on Linux only
[ https://issues.apache.org/jira/browse/YARN-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224095#comment-14224095 ] Gera Shegalov commented on YARN-2899: - Thanks for the patch, [~mingma]. Explicit listing of supported OS is a better approach. +1 (non-binding) Run TestDockerContainerExecutorWithMocks on Linux only -- Key: YARN-2899 URL: https://issues.apache.org/jira/browse/YARN-2899 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma Assignee: Ming Ma Priority: Minor Attachments: YARN-2899.patch It seems the test should strictly check for Linux, otherwise, it will fail when the OS isn't Linux. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
Gera Shegalov created YARN-2893: --- Summary: AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream Key: YARN-2893 URL: https://issues.apache.org/jira/browse/YARN-2893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov MapReduce jobs on our clusters experience sporadic failures due to corrupt tokens in the AM launch context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
[ https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221662#comment-14221662 ] Gera Shegalov commented on YARN-2893: - Here is the stack trace: {code} Got exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.security.Credentials.readTokenStorageStream(Credentials.java:189) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.setupTokens(AMLauncher.java:225) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.createAMContainerLaunchContext(AMLauncher.java:196) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:107) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:250) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Since the launch context is corrupt all subsequent max app attempts fail as well . This is a non-deterministic Heisenbug that does not reproduce on job re-submission. AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream -- Key: YARN-2893 URL: https://issues.apache.org/jira/browse/YARN-2893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov MapReduce jobs on our clusters experience sporadic failures due to corrupt tokens in the AM launch context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used
[ https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212671#comment-14212671 ] Gera Shegalov commented on YARN-2862: - [~mingma], It's potentially already fixed by YARN-2010. We can try it for our scenario. RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used --- Key: YARN-2862 URL: https://issues.apache.org/jira/browse/YARN-2862 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma This might be a known issue. Given FileSystemRMStateStore isn't used for HA scenario, it might not be that important, unless there is something we need to fix at RM layer to make it more tolerant to RMStore issue. When RM was hard shutdown, OS might not get a chance to persist blocks. Some of the stored application data end up with size zero after reboot. And RM didn't like that. {noformat} ls -al /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351 total 156 drwxr-xr-x.2 x y 4096 Nov 13 16:45 . drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 .. -rw-r--r--.1 x y 0 Nov 13 16:45 appattempt_1412702189634_324351_01 -rw-r--r--.1 x y 0 Nov 13 16:45 .appattempt_1412702189634_324351_01.crc -rw-r--r--.1 x y 0 Nov 13 16:45 application_1412702189634_324351 -rw-r--r--.1 x y 0 Nov 13 16:45 .application_1412702189634_324351.crc {noformat} When RM starts up {noformat} 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem opening checksum file: file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501) ... 2014-11-13 17:40:48,876 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used
[ https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212935#comment-14212935 ] Gera Shegalov commented on YARN-2862: - [~jianhe], to add more details: we use 2.4+patches, YARN-1185 is in 2.3. RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used --- Key: YARN-2862 URL: https://issues.apache.org/jira/browse/YARN-2862 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma This might be a known issue. Given FileSystemRMStateStore isn't used for HA scenario, it might not be that important, unless there is something we need to fix at RM layer to make it more tolerant to RMStore issue. When RM was hard shutdown, OS might not get a chance to persist blocks. Some of the stored application data end up with size zero after reboot. And RM didn't like that. {noformat} ls -al /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351 total 156 drwxr-xr-x.2 x y 4096 Nov 13 16:45 . drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 .. -rw-r--r--.1 x y 0 Nov 13 16:45 appattempt_1412702189634_324351_01 -rw-r--r--.1 x y 0 Nov 13 16:45 .appattempt_1412702189634_324351_01.crc -rw-r--r--.1 x y 0 Nov 13 16:45 application_1412702189634_324351 -rw-r--r--.1 x y 0 Nov 13 16:45 .application_1412702189634_324351.crc {noformat} When RM starts up {noformat} 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem opening checksum file: file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501) ... 2014-11-13 17:40:48,876 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2857) ConcurrentModificationException in ContainerLogAppender
[ https://issues.apache.org/jira/browse/YARN-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14211519#comment-14211519 ] Gera Shegalov commented on YARN-2857: - Clean Jenkins in the last build demonstrates that the patch fixes the reproducer in the previous build: {code} testAppendInClose(org.apache.hadoop.yarn.TestContainerLogAppender) Time elapsed: 0.066 sec ERROR! java.util.ConcurrentModificationException: null at java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:761) at java.util.LinkedList$ListItr.next(LinkedList.java:696) at org.apache.hadoop.yarn.ContainerLogAppender.close(ContainerLogAppender.java:81) at org.apache.hadoop.yarn.TestContainerLogAppender.testAppendInClose(TestContainerLogAppender.java:44) {code} . +1(non-binding) ConcurrentModificationException in ContainerLogAppender --- Key: YARN-2857 URL: https://issues.apache.org/jira/browse/YARN-2857 Project: Hadoop YARN Issue Type: Bug Reporter: Mohammad Kamrul Islam Assignee: Mohammad Kamrul Islam Priority: Critical Attachments: ContainerLogAppender.java, MAPREDUCE-6139-test.01.patch, MAPREDUCE-6139.1.patch, MAPREDUCE-6139.2.patch, MAPREDUCE-6139.3.patch, YARN-2857.3.patch Context: * Hadoop-2.3.0 * Using Oozie 4.0.1 * Pig version 0.11.x The job is submitted by Oozie to launch Pig script. The following exception traces were found on MR task log: In syslog: {noformat} 2014-10-24 20:37:29,317 WARN [Thread-5] org.apache.hadoop.util.ShutdownHookManager: ShutdownHook '' failed, java.util.ConcurrentModificationException java.util.ConcurrentModificationException at java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:966) at java.util.LinkedList$ListItr.next(LinkedList.java:888) at org.apache.hadoop.yarn.ContainerLogAppender.close(ContainerLogAppender.java:94) at org.apache.log4j.helpers.AppenderAttachableImpl.removeAllAppenders(AppenderAttachableImpl.java:141) at org.apache.log4j.Category.removeAllAppenders(Category.java:891) at org.apache.log4j.Hierarchy.shutdown(Hierarchy.java:471) at org.apache.log4j.LogManager.shutdown(LogManager.java:267) at org.apache.hadoop.mapred.TaskLog.syncLogsShutdown(TaskLog.java:286) at org.apache.hadoop.mapred.TaskLog$2.run(TaskLog.java:339) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) 2014-10-24 20:37:29,395 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system... {noformat} in stderr: {noformat} java.util.ConcurrentModificationException at java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:966) at java.util.LinkedList$ListItr.next(LinkedList.java:888) at org.apache.hadoop.yarn.ContainerLogAppender.close(ContainerLogAppender.java:94) at org.apache.log4j.helpers.AppenderAttachableImpl.removeAllAppenders(AppenderAttachableImpl.java:141) at org.apache.log4j.Category.removeAllAppenders(Category.java:891) at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:759) at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:648) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:514) at org.apache.log4j.PropertyConfigurator.configure(PropertyConfigurator.java:440) at org.apache.pig.Main.configureLog4J(Main.java:740) at org.apache.pig.Main.run(Main.java:384) at org.apache.pig.PigRunner.run(PigRunner.java:49) at org.apache.oozie.action.hadoop.PigMain.runPigJob(PigMain.java:283) at org.apache.oozie.action.hadoop.PigMain.run(PigMain.java:223) at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:37) at org.apache.oozie.action.hadoop.PigMain.main(PigMain.java:76) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:226) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at
[jira] [Assigned] (YARN-2707) Potential null dereference in FSDownload
[ https://issues.apache.org/jira/browse/YARN-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov reassigned YARN-2707: --- Assignee: Gera Shegalov Potential null dereference in FSDownload Key: YARN-2707 URL: https://issues.apache.org/jira/browse/YARN-2707 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Gera Shegalov Priority: Minor Here is related code in call(): {code} Pattern pattern = null; String p = resource.getPattern(); if (p != null) { pattern = Pattern.compile(p); } unpack(new File(dTmp.toUri()), new File(dFinal.toUri()), pattern); {code} In unpack(): {code} RunJar.unJar(localrsrc, dst, pattern); {code} unJar() would dereference the pattern without checking whether it is null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2707) Potential null dereference in FSDownload
[ https://issues.apache.org/jira/browse/YARN-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-2707: Attachment: YARN-2707.v01.patch Thanks for reporting this bug, [~yuzhih...@gmail.com]. It turns out that there is a test for this but it was not checking Future.get. Future.isDone returns true for failed Callables as well. Potential null dereference in FSDownload Key: YARN-2707 URL: https://issues.apache.org/jira/browse/YARN-2707 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Gera Shegalov Priority: Minor Attachments: YARN-2707.v01.patch Here is related code in call(): {code} Pattern pattern = null; String p = resource.getPattern(); if (p != null) { pattern = Pattern.compile(p); } unpack(new File(dTmp.toUri()), new File(dFinal.toUri()), pattern); {code} In unpack(): {code} RunJar.unJar(localrsrc, dst, pattern); {code} unJar() would dereference the pattern without checking whether it is null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2688) Better diagnostics on Container Launch failures
[ https://issues.apache.org/jira/browse/YARN-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171426#comment-14171426 ] Gera Shegalov commented on YARN-2688: - Localizer diagnostics was improved by YARN-2377. Better diagnostics on Container Launch failures --- Key: YARN-2688 URL: https://issues.apache.org/jira/browse/YARN-2688 Project: Hadoop YARN Issue Type: Bug Reporter: Arun C Murthy We need better diagnostics on container launch failures due to errors like localizations issues, wrong command for container launch etc. Currently, if the container doesn't launch, we get nothing - not even container logs since there are no logs to aggregate either. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1542) Add unit test for public resource on viewfs
[ https://issues.apache.org/jira/browse/YARN-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-1542: Attachment: YARN-1542.v05.patch v05: rebasing the patch again. Add unit test for public resource on viewfs --- Key: YARN-1542 URL: https://issues.apache.org/jira/browse/YARN-1542 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-1542.v01.patch, YARN-1542.v02.patch, YARN-1542.v03.patch, YARN-1542.v04.patch, YARN-1542.v05.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2377) Localization exception stack traces are not passed as diagnostic info
[ https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-2377: Attachment: YARN-2377.v02.patch Thanks for review, [~jlowe]! Your points are valid, uploading v02 to accommodate them. Localization exception stack traces are not passed as diagnostic info - Key: YARN-2377 URL: https://issues.apache.org/jira/browse/YARN-2377 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-2377.v01.patch, YARN-2377.v02.patch In the Localizer log one can only see this kind of message {code} 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar, 1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos tException: ha-nn-uri-0 {code} And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is propagated as diagnostics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2377) Localization exception stack traces are not passed as diagnostic info
[ https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132994#comment-14132994 ] Gera Shegalov commented on YARN-2377: - Hi [~Naganarasimha], there is no deserialization in a sense of converting bytes to the original exception class. This fields are already strings in yarn_protos.proto: {code} 33 message SerializedExceptionProto { 34 optional string message = 1; 35 optional string trace = 2; 36 optional string class_name = 3; 37 optional SerializedExceptionProto cause = 4; 38 } {code} Localization exception stack traces are not passed as diagnostic info - Key: YARN-2377 URL: https://issues.apache.org/jira/browse/YARN-2377 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-2377.v01.patch In the Localizer log one can only see this kind of message {code} 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar, 1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos tException: ha-nn-uri-0 {code} And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is propagated as diagnostics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2377) Localization exception stack traces are not passed as diagnostic info
[ https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125846#comment-14125846 ] Gera Shegalov commented on YARN-2377: - [~kasha], do you agree with the points above? Localization exception stack traces are not passed as diagnostic info - Key: YARN-2377 URL: https://issues.apache.org/jira/browse/YARN-2377 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-2377.v01.patch In the Localizer log one can only see this kind of message {code} 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar, 1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos tException: ha-nn-uri-0 {code} And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is propagated as diagnostics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)
[ https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113398#comment-14113398 ] Gera Shegalov commented on YARN-2405: - YARN-2405.2.patch LGTM. We should let junit catch the original exception and it will properly fail the test. NPE in FairSchedulerAppsBlock (scheduler page) -- Key: YARN-2405 URL: https://issues.apache.org/jira/browse/YARN-2405 Project: Hadoop YARN Issue Type: Bug Reporter: Maysam Yabandeh Assignee: Tsuyoshi OZAWA Attachments: YARN-2405.1.patch, YARN-2405.2.patch, YARN-2405.3.patch FairSchedulerAppsBlock#render throws NPE at this line {code} int fairShare = fsinfo.getAppFairShare(attemptId); {code} This causes the scheduler page now showing the app since it lack the definition of appsTableData {code} Uncaught ReferenceError: appsTableData is not defined {code} The problem is temporary meaning that it is usually resolved by itself either after a retry or after a few hours. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)
[ https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114738#comment-14114738 ] Gera Shegalov commented on YARN-2405: - bq. ... I think both designs are acceptable in this case. True but one has more code for no reason. NPE in FairSchedulerAppsBlock (scheduler page) -- Key: YARN-2405 URL: https://issues.apache.org/jira/browse/YARN-2405 Project: Hadoop YARN Issue Type: Bug Reporter: Maysam Yabandeh Assignee: Tsuyoshi OZAWA Attachments: YARN-2405.1.patch, YARN-2405.2.patch, YARN-2405.3.patch, YARN-2405.4.patch FairSchedulerAppsBlock#render throws NPE at this line {code} int fairShare = fsinfo.getAppFairShare(attemptId); {code} This causes the scheduler page now showing the app since it lack the definition of appsTableData {code} Uncaught ReferenceError: appsTableData is not defined {code} The problem is temporary meaning that it is usually resolved by itself either after a retry or after a few hours. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)
[ https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112028#comment-14112028 ] Gera Shegalov commented on YARN-2405: - +1 for not masking NPE bugs. It's a performance problem, and would catch irrelevant NPE's as well. NPE in FairSchedulerAppsBlock (scheduler page) -- Key: YARN-2405 URL: https://issues.apache.org/jira/browse/YARN-2405 Project: Hadoop YARN Issue Type: Bug Reporter: Maysam Yabandeh Assignee: Tsuyoshi OZAWA FairSchedulerAppsBlock#render throws NPE at this line {code} int fairShare = fsinfo.getAppFairShare(attemptId); {code} This causes the scheduler page now showing the app since it lack the definition of appsTableData {code} Uncaught ReferenceError: appsTableData is not defined {code} The problem is temporary meaning that it is usually resolved by itself either after a retry or after a few hours. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2377) Localization exception stack traces are not passed as diagnostic info
[ https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110221#comment-14110221 ] Gera Shegalov commented on YARN-2377: - Hi [~kasha], I considered {{StringUtils#stringifyException}} but discarded it due to the following disadvantages: # redundant deserialization of the exception object just for the sake of serializing it right away # as a consequence, hypothetically, when localization service runs as a separate process with a dedicated classpath, we can encounter a {{ClassNotFoundException}} during deserialization Localization exception stack traces are not passed as diagnostic info - Key: YARN-2377 URL: https://issues.apache.org/jira/browse/YARN-2377 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-2377.v01.patch In the Localizer log one can only see this kind of message {code} 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar, 1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos tException: ha-nn-uri-0 {code} And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is propagated as diagnostics. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2377) Localization exception stack traces are not passed as diagnostic info
Gera Shegalov created YARN-2377: --- Summary: Localization exception stack traces are not passed as diagnostic info Key: YARN-2377 URL: https://issues.apache.org/jira/browse/YARN-2377 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov In the Localizer log one can only see this kind of message {code} 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar, 1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos tException: ha-nn-uri-0 {code} And then onlt {{ java.net.UnknownHos tException: ha-nn-uri-0}} message is propagated as diagnostics. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2377) Localization exception stack traces are not passed as diagnostic info
[ https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-2377: Description: In the Localizer log one can only see this kind of message {code} 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar, 1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos tException: ha-nn-uri-0 {code} And then only {{ java.net.UnknownHos tException: ha-nn-uri-0}} message is propagated as diagnostics. was: In the Localizer log one can only see this kind of message {code} 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar, 1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos tException: ha-nn-uri-0 {code} And then onlt {{ java.net.UnknownHos tException: ha-nn-uri-0}} message is propagated as diagnostics. Localization exception stack traces are not passed as diagnostic info - Key: YARN-2377 URL: https://issues.apache.org/jira/browse/YARN-2377 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov In the Localizer log one can only see this kind of message {code} 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar, 1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos tException: ha-nn-uri-0 {code} And then only {{ java.net.UnknownHos tException: ha-nn-uri-0}} message is propagated as diagnostics. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2377) Localization exception stack traces are not passed as diagnostic info
[ https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-2377: Description: In the Localizer log one can only see this kind of message {code} 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar, 1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos tException: ha-nn-uri-0 {code} And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is propagated as diagnostics. was: In the Localizer log one can only see this kind of message {code} 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar, 1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos tException: ha-nn-uri-0 {code} And then only {{ java.net.UnknownHos tException: ha-nn-uri-0}} message is propagated as diagnostics. Localization exception stack traces are not passed as diagnostic info - Key: YARN-2377 URL: https://issues.apache.org/jira/browse/YARN-2377 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov In the Localizer log one can only see this kind of message {code} 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar, 1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos tException: ha-nn-uri-0 {code} And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is propagated as diagnostics. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2377) Localization exception stack traces are not passed as diagnostic info
[ https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-2377: Attachment: YARN-2377.v01.patch v01 for review. With this you get a more actionable stack trace: {code} 14/07/31 17:46:39 INFO mapreduce.Job: Job job_1406853387336_0001 failed with state FAILED due to: Application application_1406853387336_0001 failed 2 times due to AM Container for appattempt_1406853387336_0001_02 exited with exitCode: -1000 For more detailed output, check application tracking page:http://tw-mbp-gshegalov:8088/proxy/application_1406853387336_0001/Then, click on links to logs of each attempt. Diagnostics: java.net.UnknownHostException: ha-nn-uri-0 java.lang.IllegalArgumentException: java.net.UnknownHostException: ha-nn-uri-0 at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:373) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:260) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:153) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:607) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:552) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:139) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2590) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2624) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2606) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:248) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:60) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:356) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:354) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:394) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1626) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:353) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695) Caused by: java.net.UnknownHostException: ha-nn-uri-0 ... 29 more Caused by: ha-nn-uri-0 java.lang.IllegalArgumentException: java.net.UnknownHostException: ha-nn-uri-0 at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:373) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:260) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:153) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:607) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:552) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:139) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2590) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2624) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2606) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:248) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:60) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:356) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:354) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:394) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1626) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:353) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59) at
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077453#comment-14077453 ] Gera Shegalov commented on YARN-796: Hi [~yufeldman], thanks for posting the patch. Please rebase it since it no longer applies. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.1 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1741) XInclude support broken for YARN ResourceManager
[ https://issues.apache.org/jira/browse/YARN-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14052744#comment-14052744 ] Gera Shegalov commented on YARN-1741: - I made a modification to my patch (v04) for HADOOP-10623 to highlight how xi:include issue can easily be resolved by using the existing FsUrlStreamHandler I stumbled upon. XInclude support broken for YARN ResourceManager Key: YARN-1741 URL: https://issues.apache.org/jira/browse/YARN-1741 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Eric Sirianni Assignee: Xuan Gong Priority: Critical Labels: regression The XInclude support in Hadoop configuration files (introduced via HADOOP-4944) was broken by the recent {{ConfigurationProvider}} changes to YARN ResourceManager. Specifically, YARN-1459 and, more generally, the YARN-1611 family of JIRAs for ResourceManager HA. The issue is that {{ConfigurationProvider}} provides a raw {{InputStream}} as a {{Configuration}} resource for what was previously a {{Path}}-based resource. For {{Path}} resources, the absolute file path is used as the {{systemId}} for the {{DocumentBuilder.parse()}} call: {code} } else if (resource instanceof Path) { // a file resource ... doc = parse(builder, new BufferedInputStream( new FileInputStream(file)), ((Path)resource).toString()); } {code} The {{systemId}} is used to resolve XIncludes (among other things): {code} /** * Parse the content of the given codeInputStream/code as an * XML document and return a new DOM Document object. ... * @param systemId Provide a base for resolving relative URIs. ... */ public Document parse(InputStream is, String systemId) {code} However, for loading raw {{InputStream}} resources, the {{systemId}} is set to {{null}}: {code} } else if (resource instanceof InputStream) { doc = parse(builder, (InputStream) resource, null); {code} causing XInclude resolution to fail. In our particular environment, we make extensive use of XIncludes to standardize common configuration parameters across multiple Hadoop clusters. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1741) XInclude support broken for YARN ResourceManager
[ https://issues.apache.org/jira/browse/YARN-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043758#comment-14043758 ] Gera Shegalov commented on YARN-1741: - Since there is a general problem of loading conf via InputStream, to support these cases we need to enable users to pass custom EntityResolver. We should implement this kind of method: {code} Configuration#addResource(InputStream is, EntityResolver er) {code} XInclude support broken for YARN ResourceManager Key: YARN-1741 URL: https://issues.apache.org/jira/browse/YARN-1741 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Eric Sirianni Priority: Critical Labels: regression The XInclude support in Hadoop configuration files (introduced via HADOOP-4944) was broken by the recent {{ConfigurationProvider}} changes to YARN ResourceManager. Specifically, YARN-1459 and, more generally, the YARN-1611 family of JIRAs for ResourceManager HA. The issue is that {{ConfigurationProvider}} provides a raw {{InputStream}} as a {{Configuration}} resource for what was previously a {{Path}}-based resource. For {{Path}} resources, the absolute file path is used as the {{systemId}} for the {{DocumentBuilder.parse()}} call: {code} } else if (resource instanceof Path) { // a file resource ... doc = parse(builder, new BufferedInputStream( new FileInputStream(file)), ((Path)resource).toString()); } {code} The {{systemId}} is used to resolve XIncludes (among other things): {code} /** * Parse the content of the given codeInputStream/code as an * XML document and return a new DOM Document object. ... * @param systemId Provide a base for resolving relative URIs. ... */ public Document parse(InputStream is, String systemId) {code} However, for loading raw {{InputStream}} resources, the {{systemId}} is set to {{null}}: {code} } else if (resource instanceof InputStream) { doc = parse(builder, (InputStream) resource, null); {code} causing XInclude resolution to fail. In our particular environment, we make extensive use of XIncludes to standardize common configuration parameters across multiple Hadoop clusters. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1897) Define SignalContainerRequest and SignalContainerResponse
[ https://issues.apache.org/jira/browse/YARN-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002875#comment-14002875 ] Gera Shegalov commented on YARN-1897: - I am confused, [~mingma]. I thought we agreed to do it as YARN-1515. Define SignalContainerRequest and SignalContainerResponse - Key: YARN-1897 URL: https://issues.apache.org/jira/browse/YARN-1897 Project: Hadoop YARN Issue Type: Sub-task Components: api Reporter: Ming Ma Assignee: Ming Ma Attachments: YARN-1897-2.patch, YARN-1897-3.patch, YARN-1897-4.patch, YARN-1897.1.patch We need to define SignalContainerRequest and SignalContainerResponse first as they are needed by other sub tasks. SignalContainerRequest should use OS-independent commands and provide a way to application to specify reason for diagnosis. SignalContainerResponse might be empty. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC
[ https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997266#comment-13997266 ] Gera Shegalov commented on YARN-1515: - Ok, I can work on CMP.signalContainer and replace stopContainers with signalContainer Ability to dump the container threads and stop the containers in a single RPC - Key: YARN-1515 URL: https://issues.apache.org/jira/browse/YARN-1515 Project: Hadoop YARN Issue Type: Sub-task Components: api, nodemanager Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, YARN-1515.v06.patch, YARN-1515.v07.patch This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for timed-out task attempts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC
[ https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-1515: Attachment: YARN-1515.v07.patch v07 addressing Jason's review. Thanks! Ability to dump the container threads and stop the containers in a single RPC - Key: YARN-1515 URL: https://issues.apache.org/jira/browse/YARN-1515 Project: Hadoop YARN Issue Type: New Feature Components: api, nodemanager Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, YARN-1515.v06.patch, YARN-1515.v07.patch This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for timed-out task attempts. -- This message was sent by Atlassian JIRA (v6.2#6252)