[jira] [Commented] (YARN-7747) YARN UI is broken in the minicluster

2023-06-05 Thread Gera Shegalov (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17729468#comment-17729468
 ] 

Gera Shegalov commented on YARN-7747:
-

Sorry I dropped the ball with this JIRA. I have no bandwidth to work on it. I 
unassigned it from myself if someone can pick it up.

> YARN UI is broken in the minicluster 
> -
>
> Key: YARN-7747
> URL: https://issues.apache.org/jira/browse/YARN-7747
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Gera Shegalov
>Priority: Major
> Attachments: YARN-7747.001.patch, YARN-7747.002.patch
>
>
> YARN web apps use non-injected instances of GuiceFilter, i.e. instances 
> created by Jetty as opposed by Guice itself. This triggers the [call 
> path|https://github.com/google/guice/blob/master/extensions/servlet/src/com/google/inject/servlet/GuiceFilter.java#L251]
>  where the static field {{pipeline}} is used instead of the instance field 
> {{injectedPipeline}}. However, besides GuiceFilter instances created by 
> Jetty, each Guice module generates them as well. On the injection call path 
> this static variable is updated by each instance. Thus if there are multiple 
> modules as it happens to be the case in the minicluster the one loaded last 
> ends up defining the filter pipeline for all Jetty instances. In the 
> minicluster case this is the nodemanager UI
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7747) YARN UI is broken in the minicluster

2023-06-05 Thread Gera Shegalov (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov reassigned YARN-7747:
---

Assignee: (was: Gera Shegalov)

> YARN UI is broken in the minicluster 
> -
>
> Key: YARN-7747
> URL: https://issues.apache.org/jira/browse/YARN-7747
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Gera Shegalov
>Priority: Major
> Attachments: YARN-7747.001.patch, YARN-7747.002.patch
>
>
> YARN web apps use non-injected instances of GuiceFilter, i.e. instances 
> created by Jetty as opposed by Guice itself. This triggers the [call 
> path|https://github.com/google/guice/blob/master/extensions/servlet/src/com/google/inject/servlet/GuiceFilter.java#L251]
>  where the static field {{pipeline}} is used instead of the instance field 
> {{injectedPipeline}}. However, besides GuiceFilter instances created by 
> Jetty, each Guice module generates them as well. On the injection call path 
> this static variable is updated by each instance. Thus if there are multiple 
> modules as it happens to be the case in the minicluster the one loaded last 
> ends up defining the filter pipeline for all Jetty instances. In the 
> minicluster case this is the nodemanager UI
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11055) In cgroups-operations.c some fprintf format strings don't end with "\n"

2022-01-03 Thread Gera Shegalov (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov reassigned YARN-11055:


Assignee: Gera Shegalov

> In cgroups-operations.c some fprintf format strings don't end with "\n" 
> 
>
> Key: YARN-11055
> URL: https://issues.apache.org/jira/browse/YARN-11055
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.3.1
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
>Priority: Minor
>  Labels: cgroups, easyfix
>
> In cgroup-operations.c some {{{}fprintf{}}}s are missing a newline char at 
> the end leading to a hard-to-parse error message output 
> example: 
> https://github.com/apache/hadoop/blame/b225287913ac366a531eacfa0266adbdf03d883e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/cgroups/cgroups-operations.c#L130
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11055) In cgroups-operations.c some fprintf format strings don't end with "\n"

2021-12-31 Thread Gera Shegalov (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-11055:
-
Summary: In cgroups-operations.c some fprintf format strings don't end with 
"\n"   (was: In cgroups-operations.c some fprintf format strings lack "\n" )

> In cgroups-operations.c some fprintf format strings don't end with "\n" 
> 
>
> Key: YARN-11055
> URL: https://issues.apache.org/jira/browse/YARN-11055
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.3.1
>Reporter: Gera Shegalov
>Priority: Minor
>  Labels: cgroups, easyfix
>
> In cgroup-operations.c some {{{}fprintf{}}}s are missing a newline char at 
> the end leading to a hard-to-parse error message output 
> example: 
> https://github.com/apache/hadoop/blame/b225287913ac366a531eacfa0266adbdf03d883e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/cgroups/cgroups-operations.c#L130
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11055) In cgroups-operations.c some fprintf format strings lack "\n"

2021-12-31 Thread Gera Shegalov (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-11055:
-
Summary: In cgroups-operations.c some fprintf format strings lack "\n"   
(was: cgroups-operations.c some fprintf format strings lack "\n" )

> In cgroups-operations.c some fprintf format strings lack "\n" 
> --
>
> Key: YARN-11055
> URL: https://issues.apache.org/jira/browse/YARN-11055
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.3.1
>Reporter: Gera Shegalov
>Priority: Minor
>  Labels: cgroups, easyfix
>
> In cgroup-operations.c some {{{}fprintf{}}}s are missing a newline char at 
> the end leading to a hard-to-parse error message output 
> example: 
> https://github.com/apache/hadoop/blame/b225287913ac366a531eacfa0266adbdf03d883e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/cgroups/cgroups-operations.c#L130
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11055) cgroups-operations.c some fprintf format strings lack "\n"

2021-12-30 Thread Gera Shegalov (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-11055:
-
Priority: Minor  (was: Major)

> cgroups-operations.c some fprintf format strings lack "\n" 
> ---
>
> Key: YARN-11055
> URL: https://issues.apache.org/jira/browse/YARN-11055
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.3.1
>Reporter: Gera Shegalov
>Priority: Minor
>  Labels: cgroups, easyfix
>
> In cgroup-operations.c some {{{}fprintf{}}}s are missing a newline char at 
> the end leading to a hard-to-parse error message output 
> example: 
> https://github.com/apache/hadoop/blame/b225287913ac366a531eacfa0266adbdf03d883e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/cgroups/cgroups-operations.c#L130
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11056) Incorrect capitalization of NVIDIA in the docs

2021-12-30 Thread Gera Shegalov (Jira)
Gera Shegalov created YARN-11056:


 Summary: Incorrect capitalization of NVIDIA in the docs 
 Key: YARN-11056
 URL: https://issues.apache.org/jira/browse/YARN-11056
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Gera Shegalov


According to [https://www.nvidia.com/en-us/about-nvidia/legal-info/]  the 
spelling should be all-caps NVIDIA

Examples of differing capitalization 
https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/UsingGpus.md

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11055) cgroups-operations.c some fprintf format strings lack "\n"

2021-12-30 Thread Gera Shegalov (Jira)
Gera Shegalov created YARN-11055:


 Summary: cgroups-operations.c some fprintf format strings lack 
"\n" 
 Key: YARN-11055
 URL: https://issues.apache.org/jira/browse/YARN-11055
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.3.1, 3.3.0, 3.2.0, 3.1.0, 3.0.0
Reporter: Gera Shegalov


In cgroup-operations.c some {{{}fprintf{}}}s are missing a newline char at the 
end leading to a hard-to-parse error message output 

example: 
https://github.com/apache/hadoop/blame/b225287913ac366a531eacfa0266adbdf03d883e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/cgroups/cgroups-operations.c#L130
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1529) Add Localization overhead metrics to NM

2020-07-31 Thread Gera Shegalov (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169137#comment-17169137
 ] 

Gera Shegalov commented on YARN-1529:
-

I am glad this is still useful. Thanks for committing, [~Jim_Brennan] [~epayne]!

> Add Localization overhead metrics to NM
> ---
>
> Key: YARN-1529
> URL: https://issues.apache.org/jira/browse/YARN-1529
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Gera Shegalov
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.2.2, 2.10.1, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: YARN-1529-branch-2.10.001.patch, YARN-1529.005.patch, 
> YARN-1529.006.patch, YARN-1529.v01.patch, YARN-1529.v02.patch, 
> YARN-1529.v03.patch, YARN-1529.v04.patch
>
>
> Users are often unaware of localization cost that their jobs incur. To 
> measure effectiveness of localization caches it is necessary to expose the 
> overhead in the form of metrics.
> We propose addition of the following metrics to NodeManagerMetrics.
> When a container is about to launch, its set of LocalResources has to be 
> fetched from a central location, typically on HDFS, that results in a number 
> of download requests for the files missing in caches.
> LocalizedFilesMissed: total files (requests) downloaded from DFS. Cache 
> misses.
> LocalizedFilesCached: total localization requests that were served from local 
> caches. Cache hits.
> LocalizedBytesMissed: total bytes downloaded from DFS due to cache misses.
> LocalizedBytesCached: total bytes satisfied from local caches.
> Localized(Files|Bytes)CachedRatio: percentage of localized (files|bytes) that 
> were served out of cache: ratio = 100 * caches / (caches + misses)
> LocalizationDownloadNanos: total elapsed time in nanoseconds for a container 
> to go from ResourceRequestTransition to LocalizedTransition



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7747) YARN UI is broken in the minicluster

2018-10-03 Thread Gera Shegalov (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637609#comment-16637609
 ] 

Gera Shegalov commented on YARN-7747:
-

[~ste...@apache.org] we definitely need tests to prevent this kind of 
regression in the future. We could make sure that all web/http address keys are 
properly reflected in MiniYARNCLuster#getConfig implementation and then probe 
all of them through easy-to-validate REST api. RM URI should respond to the 
RM-specific REST, and so on and so forth

> YARN UI is broken in the minicluster 
> -
>
> Key: YARN-7747
> URL: https://issues.apache.org/jira/browse/YARN-7747
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
>Priority: Major
> Attachments: YARN-7747.001.patch, YARN-7747.002.patch
>
>
> YARN web apps use non-injected instances of GuiceFilter, i.e. instances 
> created by Jetty as opposed by Guice itself. This triggers the [call 
> path|https://github.com/google/guice/blob/master/extensions/servlet/src/com/google/inject/servlet/GuiceFilter.java#L251]
>  where the static field {{pipeline}} is used instead of the instance field 
> {{injectedPipeline}}. However, besides GuiceFilter instances created by 
> Jetty, each Guice module generates them as well. On the injection call path 
> this static variable is updated by each instance. Thus if there are multiple 
> modules as it happens to be the case in the minicluster the one loaded last 
> ends up defining the filter pipeline for all Jetty instances. In the 
> minicluster case this is the nodemanager UI
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7747) YARN UI is broken in the minicluster

2018-03-14 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-7747:

Attachment: YARN-7747.002.patch

> YARN UI is broken in the minicluster 
> -
>
> Key: YARN-7747
> URL: https://issues.apache.org/jira/browse/YARN-7747
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
>Priority: Major
> Attachments: YARN-7747.001.patch, YARN-7747.002.patch
>
>
> YARN web apps use non-injected instances of GuiceFilter, i.e. instances 
> created by Jetty as opposed by Guice itself. This triggers the [call 
> path|https://github.com/google/guice/blob/master/extensions/servlet/src/com/google/inject/servlet/GuiceFilter.java#L251]
>  where the static field {{pipeline}} is used instead of the instance field 
> {{injectedPipeline}}. However, besides GuiceFilter instances created by 
> Jetty, each Guice module generates them as well. On the injection call path 
> this static variable is updated by each instance. Thus if there are multiple 
> modules as it happens to be the case in the minicluster the one loaded last 
> ends up defining the filter pipeline for all Jetty instances. In the 
> minicluster case this is the nodemanager UI
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7847) Provide permalinks for container logs

2018-01-29 Thread Gera Shegalov (JIRA)
Gera Shegalov created YARN-7847:
---

 Summary: Provide permalinks for container logs
 Key: YARN-7847
 URL: https://issues.apache.org/jira/browse/YARN-7847
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: amrmproxy
Reporter: Gera Shegalov


YARN doesn't offer a service similar to AM proxy URL for container logs even if 
log-aggregation is enabled. The current mechanism of having the NM redirect to 
yarn.log.server.url fails once the node is down. Workarounds like in MR 
JobHistory to rewrite URI's on the fly are possible, but do not represent a 
good long term solution to onboard new apps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7747) YARN UI is broken in the minicluster

2018-01-15 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326684#comment-16326684
 ] 

Gera Shegalov commented on YARN-7747:
-

TestContainerLogsPage failure is tracked in YARN-7734. asflicense -1 is not 
caused by this patch. I can write a test for test4tests if the approach is 
accepted.

> YARN UI is broken in the minicluster 
> -
>
> Key: YARN-7747
> URL: https://issues.apache.org/jira/browse/YARN-7747
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
>Priority: Major
> Attachments: YARN-7747.001.patch
>
>
> YARN web apps use non-injected instances of GuiceFilter, i.e. instances 
> created by Jetty as opposed by Guice itself. This triggers the [call 
> path|https://github.com/google/guice/blob/master/extensions/servlet/src/com/google/inject/servlet/GuiceFilter.java#L251]
>  where the static field {{pipeline}} is used instead of the instance field 
> {{injectedPipeline}}. However, besides GuiceFilter instances created by 
> Jetty, each Guice module generates them as well. On the injection call path 
> this static variable is updated by each instance. Thus if there are multiple 
> modules as it happens to be the case in the minicluster the one loaded last 
> ends up defining the filter pipeline for all Jetty instances. In the 
> minicluster case this is the nodemanager UI
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7747) YARN UI is broken in the minicluster

2018-01-13 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-7747:

Attachment: YARN-7747.001.patch

001 patch proposal

> YARN UI is broken in the minicluster 
> -
>
> Key: YARN-7747
> URL: https://issues.apache.org/jira/browse/YARN-7747
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
> Attachments: YARN-7747.001.patch
>
>
> YARN web apps use non-injected instances of GuiceFilter, i.e. instances 
> created by Jetty as opposed by Guice itself. This triggers the [call 
> path|https://github.com/google/guice/blob/master/extensions/servlet/src/com/google/inject/servlet/GuiceFilter.java#L251]
>  where the static field {{pipeline}} is used instead of the instance field 
> {{injectedPipeline}}. However, besides GuiceFilter instances created by 
> Jetty, each Guice module generates them as well. On the injection call path 
> this static variable is updated by each instance. Thus if there are multiple 
> modules as it happens to be the case in the minicluster the one loaded last 
> ends up defining the filter pipeline for all Jetty instances. In the 
> minicluster case this is the nodemanager UI
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7747) YARN UI is broken in the minicluster

2018-01-13 Thread Gera Shegalov (JIRA)
Gera Shegalov created YARN-7747:
---

 Summary: YARN UI is broken in the minicluster 
 Key: YARN-7747
 URL: https://issues.apache.org/jira/browse/YARN-7747
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov


YARN web apps use non-injected instances of GuiceFilter, i.e. instances created 
by Jetty as opposed by Guice itself. This triggers the [call 
path|https://github.com/google/guice/blob/master/extensions/servlet/src/com/google/inject/servlet/GuiceFilter.java#L251]
 where the static field {{pipeline}} is used instead of the instance field 
{{injectedPipeline}}. However, besides GuiceFilter instances created by Jetty, 
each Guice module generates them as well. On the injection call path this 
static variable is updated by each instance. Thus if there are multiple modules 
as it happens to be the case in the minicluster the one loaded last ends up 
defining the filter pipeline for all Jetty instances. In the minicluster case 
this is the nodemanager UI
 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml

2017-12-01 Thread Gera Shegalov (JIRA)
Gera Shegalov created YARN-7592:
---

 Summary: yarn.federation.failover.enabled missing in 
yarn-default.xml
 Key: YARN-7592
 URL: https://issues.apache.org/jira/browse/YARN-7592
 Project: Hadoop YARN
  Issue Type: Bug
  Components: federation
Affects Versions: 3.0.0-beta1
Reporter: Gera Shegalov


yarn.federation.failover.enabled should be documented in yarn-default.xml. I am 
also not sure why it should be true by default and force the HA retry policy in 
{{RMProxy#createRMProxy}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-1728) Workaround guice3x-undecoded pathInfo in YARN WebApp

2017-02-28 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-1728:

Summary: Workaround guice3x-undecoded pathInfo in YARN WebApp  (was: 
History server doesn't understand percent encoded paths)

> Workaround guice3x-undecoded pathInfo in YARN WebApp
> 
>
> Key: YARN-1728
> URL: https://issues.apache.org/jira/browse/YARN-1728
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Abraham Elmahrek
>Assignee: Yuanbo Liu
> Fix For: 2.8.0, 2.7.4, 3.0.0-alpha3
>
> Attachments: test-case-for-trunk.patch, YARN-1728-branch-2.001.patch, 
> YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch, 
> YARN-1728-branch-2.004.patch, YARN-1728-branch-2.005.patch
>
>
> For example, going to the job history server page 
> http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
>  results in the following error:
> {code}
> Cannot get container logs. Invalid nodeId: 
> test-cdh5-hue.ent.cloudera.com%3A8041
> {code}
> Where the url decoded version works:
> http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
> It seems like both should be supported as the former is simply percent 
> encoding.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1728) History server doesn't understand percent encoded paths

2017-02-28 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888529#comment-15888529
 ] 

Gera Shegalov commented on YARN-1728:
-

+1m committing

> History server doesn't understand percent encoded paths
> ---
>
> Key: YARN-1728
> URL: https://issues.apache.org/jira/browse/YARN-1728
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Abraham Elmahrek
>Assignee: Yuanbo Liu
> Attachments: test-case-for-trunk.patch, YARN-1728-branch-2.001.patch, 
> YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch, 
> YARN-1728-branch-2.004.patch, YARN-1728-branch-2.005.patch
>
>
> For example, going to the job history server page 
> http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
>  results in the following error:
> {code}
> Cannot get container logs. Invalid nodeId: 
> test-cdh5-hue.ent.cloudera.com%3A8041
> {code}
> Where the url decoded version works:
> http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
> It seems like both should be supported as the former is simply percent 
> encoding.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1728) History server doesn't understand percent encoded paths

2017-02-27 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15886729#comment-15886729
 ] 

Gera Shegalov commented on YARN-1728:
-

Minor thing, since we have this catch clause, can we add the pathInfo value and 
stack trace to the log message:
{code}
}  catch (URISyntaxException ex) {
  LOG.error(pathInfo + ": Failed to decode path.", ex);
}
{code}

> History server doesn't understand percent encoded paths
> ---
>
> Key: YARN-1728
> URL: https://issues.apache.org/jira/browse/YARN-1728
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Abraham Elmahrek
>Assignee: Yuanbo Liu
> Attachments: YARN-1728-branch-2.001.patch, 
> YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch, 
> YARN-1728-branch-2.004.patch
>
>
> For example, going to the job history server page 
> http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
>  results in the following error:
> {code}
> Cannot get container logs. Invalid nodeId: 
> test-cdh5-hue.ent.cloudera.com%3A8041
> {code}
> Where the url decoded version works:
> http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
> It seems like both should be supported as the former is simply percent 
> encoding.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1728) History server doesn't understand percent encoded paths

2017-02-27 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885826#comment-15885826
 ] 

Gera Shegalov commented on YARN-1728:
-

[~yuanbo], thanks for the latest patch. I suggested URI.create because we are 
guaranteed to get a valid pathInfo from the servlet container but it's indeed 
good to be defensive since we are already dealing with a servlet bug.

I am generally +1 

In trunk the issue is fixed thanks to guice 4.0/HADOOP-12064 cc: [~ozawa]. And 
as the quote from the spec says we must not decode twice. Therefore I suggest 
we split this patch. The test-only patch should go into both trunk and branch-2 
such that we catch the issue in all releases. The actual fix should go in 
branch-2.


> History server doesn't understand percent encoded paths
> ---
>
> Key: YARN-1728
> URL: https://issues.apache.org/jira/browse/YARN-1728
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Abraham Elmahrek
>Assignee: Yuanbo Liu
> Attachments: YARN-1728-branch-2.001.patch, 
> YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch, 
> YARN-1728-branch-2.004.patch
>
>
> For example, going to the job history server page 
> http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
>  results in the following error:
> {code}
> Cannot get container logs. Invalid nodeId: 
> test-cdh5-hue.ent.cloudera.com%3A8041
> {code}
> Where the url decoded version works:
> http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
> It seems like both should be supported as the former is simply percent 
> encoding.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1728) History server doesn't understand percent encoded paths

2017-02-24 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883241#comment-15883241
 ] 

Gera Shegalov commented on YARN-1728:
-

Hi [~yuanbo], thanks for addressing the issue. I see that Guice itself [fixed 
it|https://github.com/google/guice/pull/860/files] using 
{{java.net.URI#getPath}}. Let us use it here so the behavior is consistent with 
newer Guice.

I suggest we use:

{code}
decodedPathInfo = URI.create(pathInfo).getPath();
{code}


> History server doesn't understand percent encoded paths
> ---
>
> Key: YARN-1728
> URL: https://issues.apache.org/jira/browse/YARN-1728
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Abraham Elmahrek
>Assignee: Yuanbo Liu
> Attachments: YARN-1728-branch-2.001.patch, 
> YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch
>
>
> For example, going to the job history server page 
> http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
>  results in the following error:
> {code}
> Cannot get container logs. Invalid nodeId: 
> test-cdh5-hue.ent.cloudera.com%3A8041
> {code}
> Where the url decoded version works:
> http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
> It seems like both should be supported as the former is simply percent 
> encoding.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4958) The file localization process should allow for wildcards to reduce the application footprint in the state store

2016-05-26 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302955#comment-15302955
 ] 

Gera Shegalov commented on YARN-4958:
-

Hi [~templedf], no particular comment other than there is that workaround that 
can achieve  it with the what i was suggesting in  HADOOP-12747, or 
programmatically, but it would be nice if it can be done in a more obvious way.

> The file localization process should allow for wildcards to reduce the 
> application footprint in the state store
> ---
>
> Key: YARN-4958
> URL: https://issues.apache.org/jira/browse/YARN-4958
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-4958.001.patch, YARN-4958.002.patch, 
> YARN-4958.003.patch
>
>
> When using the -libjars option to add classes to the classpath, every library 
> so added is explicitly listed in the {{ContainerLaunchContext}}'s local 
> resources even though they're all uploaded to the same directory in HDFS.  
> When using tools like Crunch without an uber JAR or when trying to take 
> advantage of the shared cache, the number of libraries can be quite large.  
> We've seen many cases where we had to turn down the max number of 
> applications to prevent ZK from running out of heap because of the size of 
> the state store entries.
> Rather than listing all files independently, this JIRA proposes to have the 
> NM allow wildcards in the resource localization paths.  Specifically, we 
> propose to allow a path to have a final component (name) set to "*", which is 
> interpreted by the NM as "download the full directory and link to every file 
> in it from the job's working directory."  This behavior is the same as the 
> current behavior when using -libjars, but avoids explicitly listing every 
> file.
> This JIRA does not attempt to provide more general purpose wildcards, such as 
> "\*.jar" or "file\*", as having multiple entries for a single directory 
> presents numerous logistical issues.
> This JIRA also does not attempt to integrate with the shared cache.  That 
> work will be left to a future JIRA.  Specifically, this JIRA only applies 
> when a full directory is uploaded.  Currently the shared cache does not 
> handle directory uploads.
> This JIRA proposes to allow for wildcards both in the internal processing of 
> the -libjars switch and in paths added through the {{Job}} and 
> {{DistributedCache}} classes.
> The proposed approach is to treat a path, "dir/\*", as "dir" for purposes of 
> all file verification and localization.  In the final step, the NM will query 
> the localized directory to get a list of the files in "dir" such that each 
> can be linked from the job's working directory.  Since $PWD/\* is always 
> included on the classpath, all JAR files in "dir" will be in the classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4789) Provide helpful exception for non-PATH-like conflict with admin.user.env

2016-03-11 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15191386#comment-15191386
 ] 

Gera Shegalov commented on YARN-4789:
-

Thanks Jason, this is indeed the case. The question is whether we can do a 
small change fast before a more involved MAPREDUCE-6491 is committed?

> Provide helpful exception for non-PATH-like conflict with admin.user.env
> 
>
> Key: YARN-4789
> URL: https://issues.apache.org/jira/browse/YARN-4789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
> Attachments: YARN-4789.001.patch
>
>
> Environment variables specified in mapreduce.admin.user.env are supposed to 
> be paths (class, shell, library) and they can be merged with the 
> user-provided values. However, it's also possible that the cluster admins 
> specify some non-PATH-like variable such as JAVA_HOME. In this case if there 
> is the same variable provided by the user, we'll get a concatenation that 
> does not make sense and is difficult to debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4789) Provide helpful exception for non-PATH-like conflict with admin.user.env

2016-03-11 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15191384#comment-15191384
 ] 

Gera Shegalov commented on YARN-4789:
-

The patch is throwing an exception only when both the user and the admin 
specified an environment variable that cannot be reconciled via concatenation 
as it happens with various *PATHs, which is the intent of the option 2.

Following option 1 would replace the user env. But this seemed to me to violate 
the spirit of this conf. This conf is designed to preserve the admin settings  
while allowing to be overridden by the user. That's why I thought warning both 
sides about misconfig is the best course of action here.

> Provide helpful exception for non-PATH-like conflict with admin.user.env
> 
>
> Key: YARN-4789
> URL: https://issues.apache.org/jira/browse/YARN-4789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
> Attachments: YARN-4789.001.patch
>
>
> Environment variables specified in mapreduce.admin.user.env are supposed to 
> be paths (class, shell, library) and they can be merged with the 
> user-provided values. However, it's also possible that the cluster admins 
> specify some non-PATH-like variable such as JAVA_HOME. In this case if there 
> is the same variable provided by the user, we'll get a concatenation that 
> does not make sense and is difficult to debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4789) Provide helpful exception for non-PATH-like conflict with admin.user.env

2016-03-11 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190773#comment-15190773
 ] 

Gera Shegalov commented on YARN-4789:
-

I see the following options to deal with it:
# silently ignore/replace the user-provided value by the one in admin.env
# inform the user that the variable is provided by the cluster admins.

001 patch for the latter

> Provide helpful exception for non-PATH-like conflict with admin.user.env
> 
>
> Key: YARN-4789
> URL: https://issues.apache.org/jira/browse/YARN-4789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
> Attachments: YARN-4789.001.patch
>
>
> Environment variables specified in mapreduce.admin.user.env are supposed to 
> be paths (class, shell, library) and they can be merged with the 
> user-provided values. However, it's also possible that the cluster admins 
> specify some non-PATH-like variable such as JAVA_HOME. In this case if there 
> is the same variable provided by the user, we'll get a concatenation that 
> does not make sense and is difficult to debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4789) Provide helpful exception for non-PATH-like conflict with admin.user.env

2016-03-11 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-4789:

Attachment: YARN-4789.001.patch

> Provide helpful exception for non-PATH-like conflict with admin.user.env
> 
>
> Key: YARN-4789
> URL: https://issues.apache.org/jira/browse/YARN-4789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
> Attachments: YARN-4789.001.patch
>
>
> Environment variables specified in mapreduce.admin.user.env are supposed to 
> be paths (class, shell, library) and they can be merged with the 
> user-provided values. However, it's also possible that the cluster admins 
> specify some non-PATH-like variable such as JAVA_HOME. In this case if there 
> is the same variable provided by the user, we'll get a concatenation that 
> does not make sense and is difficult to debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4789) Provide helpful exception for non-PATH-like conflict with admin.user.env

2016-03-11 Thread Gera Shegalov (JIRA)
Gera Shegalov created YARN-4789:
---

 Summary: Provide helpful exception for non-PATH-like conflict with 
admin.user.env
 Key: YARN-4789
 URL: https://issues.apache.org/jira/browse/YARN-4789
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.2
Reporter: Gera Shegalov
Assignee: Gera Shegalov


Environment variables specified in mapreduce.admin.user.env are supposed to be 
paths (class, shell, library) and they can be merged with the user-provided 
values. However, it's also possible that the cluster admins specify some 
non-PATH-like variable such as JAVA_HOME. In this case if there is the same 
variable provided by the user, we'll get a concatenation that does not make 
sense and is difficult to debug.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2015-12-24 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15071392#comment-15071392
 ] 

Gera Shegalov commented on YARN-2934:
-

+1 for YARN-2934.v2.004.patch. There is an extra space in "Error files : ", I 
took the freedom to fix it myself

> Improve handling of container's stderr 
> ---
>
> Key: YARN-2934
> URL: https://issues.apache.org/jira/browse/YARN-2934
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Gera Shegalov
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, 
> YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, 
> YARN-2934.v1.006.patch, YARN-2934.v1.007.patch, YARN-2934.v1.008.patch, 
> YARN-2934.v2.001.patch, YARN-2934.v2.002.patch, YARN-2934.v2.003.patch, 
> YARN-2934.v2.004.patch
>
>
> Most YARN applications redirect stderr to some file. That's why when 
> container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2015-12-23 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15070337#comment-15070337
 ] 

Gera Shegalov commented on YARN-2934:
-

Hi [~Naganarasimha]. Thanks for updating the patch. 

Things we have not addressed from my previous comments is capping the buffer 
size. But I now think it's good enough because we have a good small default for 
the tail NM_CONTAINER_STDERR_BYTES.

Still please rename:
{code}
-  FileStatus[] listStatus = fileSystem
+  FileStatus[] errorStatuses = fileSystem
{code}
or similar. It's an array of statuses and not status of a list

Let us have a space after ',' and a new line in:
{code}
-  .append(StringUtils.arrayToString(errorFileNames)).append(". ");
+  .append(StringUtils.join(", ", errorFileNames)).append(".\n");
{code}
Fix the test code accordingly

method verifyTailErrorLogOnContainerExit can/should be private. Same for 
ContainerExitHandler class.

Assume.assumeTrue(Shell.LINUX);
should be 
Assume.assumeFalse(Shell.WINDOWS || Shell.OTHER);
but actually why do we need this? The test seems to be platform-independent.

Assert.assertNotNull(exitEvent.getDiagnosticInfo());

seems redundant because you then have other asserts implying this already. I 
suggest to LOG.info the diagnostics instead to make the test log more useful.



> Improve handling of container's stderr 
> ---
>
> Key: YARN-2934
> URL: https://issues.apache.org/jira/browse/YARN-2934
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Gera Shegalov
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, 
> YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, 
> YARN-2934.v1.006.patch, YARN-2934.v1.007.patch, YARN-2934.v1.008.patch, 
> YARN-2934.v2.001.patch, YARN-2934.v2.002.patch, YARN-2934.v2.003.patch
>
>
> Most YARN applications redirect stderr to some file. That's why when 
> container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2015-12-17 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061768#comment-15061768
 ] 

Gera Shegalov commented on YARN-2934:
-

-1 on manual regexes in favor of code reuse. 99.9% of YARN users will never 
change this conf. Simple globs I was suggesting already cover even 
AppMaster.stderr. 

In ContainerLaunch#getErrorLogTail
get rid of 
{code}
if (containerLogDir == null) {
  return null;
}
{code}

it cannot be null


> Improve handling of container's stderr 
> ---
>
> Key: YARN-2934
> URL: https://issues.apache.org/jira/browse/YARN-2934
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Gera Shegalov
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, 
> YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, 
> YARN-2934.v1.006.patch, YARN-2934.v1.007.patch
>
>
> Most YARN applications redirect stderr to some file. That's why when 
> container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2015-12-17 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061847#comment-15061847
 ] 

Gera Shegalov commented on YARN-2934:
-

Use RawLocalFileSystem, we don't need the checksumming version: FileSystem 
fileSystem = FileSystem.getLocal(conf).getRaw() 


> Improve handling of container's stderr 
> ---
>
> Key: YARN-2934
> URL: https://issues.apache.org/jira/browse/YARN-2934
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Gera Shegalov
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, 
> YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, 
> YARN-2934.v1.006.patch, YARN-2934.v1.007.patch, YARN-2934.v1.008.patch
>
>
> Most YARN applications redirect stderr to some file. That's why when 
> container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2015-12-17 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063138#comment-15063138
 ] 

Gera Shegalov commented on YARN-2934:
-

Thanks for the latest patch. good to see the patch lose 3kb, most of all there 
are no more changes to the common Configuration class.

One checkstyle issue is the 80-column warning is from the patch around:
{code}
long tailSizeInBytes = conf.getLong(
YarnConfiguration.NM_CONTAINER_ERROR_FILE_TAIL_SIZE_IN_BYTES,

YarnConfiguration.DEFAULT_NM_CONTAINER_ERROR_FILE_TAIL_SIZE_IN_BYTES);
{code}

Those are pretty long names:
Can we do: 
container.stderr.tail.bytes 
NM_CONTAINER_STDERR_BYTES
and the corresponding default. Having stderr in name is also great for users to 
understand what error file is meant in 99% of the cases.
Same thing is for container.stderr.pattern

Still don't see any value in this, please drop:
{code}
if (listStatus.length > 1) {
  LOG.error("Multiple files in " + containerLogDir
  + ", seems to match the error file name pattern configured. "
  + Arrays.toString(listStatus));
}
{code}

Let us not guard the tail read:
{code}
if (fileSize != 0) {
{code}
and there is a value in seeing that the file is empty already on the 
client-side.

Instead of 
{code}
IOUtils.closeStream(errorFileIS) 
{code}

call cleanup so we can pass the logger
{code}
IOUtils.cleanup(LOG, errorFileIS)
{code}

Since the trunk is on JDK7 min:
We can drop the constant UTF_8 and use
in 
{code}
new String(tailBytes, StandardCharsets.UTF_8)
{code}

listStatus as a name for a variable is not intuitive. Maybe use errFileStatus 
for that. 

Obviously I meant tailSizeInBytes, thanks for paying attention. Agree that RLFS 
file status toString might look too ugly.

We can FileUtil.stat2Paths or add a loop here to extract just the last path 
component. 

Also realizing that we should have a low cap on the tail size to prevent a 
misconfiguration to knock out NM with OOM on container failures since we do:
{code}
byte[] tailBytes = new byte[bufferSize];
{code}

One can easily see why I initially confused tailBytes for an int. It should be 
called along the lines {code}tailBuffer{code} 






> Improve handling of container's stderr 
> ---
>
> Key: YARN-2934
> URL: https://issues.apache.org/jira/browse/YARN-2934
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Gera Shegalov
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, 
> YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, 
> YARN-2934.v1.006.patch, YARN-2934.v1.007.patch, YARN-2934.v1.008.patch, 
> YARN-2934.v2.001.patch
>
>
> Most YARN applications redirect stderr to some file. That's why when 
> container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2015-12-17 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061908#comment-15061908
 ] 

Gera Shegalov commented on YARN-2934:
-

That should go to the exception message
{code}
422  } else if (listStatus.length > 1) {
423   LOG.warn("Multiple files in " + containerLogDir
424   + ", seems to match the error file name pattern configured ");
425  }
{code}

Don't do branching, pass the string builder diagnosticInfo and do something like
{code}
diagnosticInfo
  .append("Error files: ")
  .append(Arrays.toString(listStatus))
  .append("\n")
  .append("Last ").append(tailBytes).append(" bytes of 
").append(listStatus[0])
  .append(new String(tailBytes, UTF_8));
{code}

> Improve handling of container's stderr 
> ---
>
> Key: YARN-2934
> URL: https://issues.apache.org/jira/browse/YARN-2934
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Gera Shegalov
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, 
> YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, 
> YARN-2934.v1.006.patch, YARN-2934.v1.007.patch, YARN-2934.v1.008.patch
>
>
> Most YARN applications redirect stderr to some file. That's why when 
> container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2015-12-17 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063217#comment-15063217
 ] 

Gera Shegalov commented on YARN-2934:
-

the message looks good to me.

> Improve handling of container's stderr 
> ---
>
> Key: YARN-2934
> URL: https://issues.apache.org/jira/browse/YARN-2934
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Gera Shegalov
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, 
> YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, 
> YARN-2934.v1.006.patch, YARN-2934.v1.007.patch, YARN-2934.v1.008.patch, 
> YARN-2934.v2.001.patch
>
>
> Most YARN applications redirect stderr to some file. That's why when 
> container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2015-12-17 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063227#comment-15063227
 ] 

Gera Shegalov commented on YARN-2934:
-

Regarding my comment about the user, I meant the YARN app user. The users don't 
look at the NM logs. They look at the exceptions in the webUI and on the client 
side. if the exception says 

{code}
Container exited with a non-zero exit code 127. Error file(s): [error.log, 
stderr.1, stderr.2]
Last 4096 bytes of error.log :
/bin/bash: /no/jvm/here/bin/java: No such file or directory
{code}

the user will know it should also check those.

> Improve handling of container's stderr 
> ---
>
> Key: YARN-2934
> URL: https://issues.apache.org/jira/browse/YARN-2934
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Gera Shegalov
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, 
> YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, 
> YARN-2934.v1.006.patch, YARN-2934.v1.007.patch, YARN-2934.v1.008.patch, 
> YARN-2934.v2.001.patch
>
>
> Most YARN applications redirect stderr to some file. That's why when 
> container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2015-12-17 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063486#comment-15063486
 ] 

Gera Shegalov commented on YARN-2934:
-

Minor repetition is not big of a deal, IMO. 

The reason I thought of printing file statuses is that you see the file size. 
Which brings us to the following point in the fanciness area. Right now we are 
blindly grabbing file 0. It would however make much more sense to grab the most 
recent (highest mtime) non-empty file.

> Improve handling of container's stderr 
> ---
>
> Key: YARN-2934
> URL: https://issues.apache.org/jira/browse/YARN-2934
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Gera Shegalov
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, 
> YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, 
> YARN-2934.v1.006.patch, YARN-2934.v1.007.patch, YARN-2934.v1.008.patch, 
> YARN-2934.v2.001.patch, YARN-2934.v2.002.patch
>
>
> Most YARN applications redirect stderr to some file. That's why when 
> container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2015-12-17 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063487#comment-15063487
 ] 

Gera Shegalov commented on YARN-2934:
-

Minor repetition is not big of a deal, IMO. 

The reason I thought of printing file statuses is that you see the file size. 
Which brings us to the following point in the fanciness area. Right now we are 
blindly grabbing file 0. It would however make much more sense to grab the most 
recent (highest mtime) non-empty file.

> Improve handling of container's stderr 
> ---
>
> Key: YARN-2934
> URL: https://issues.apache.org/jira/browse/YARN-2934
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Gera Shegalov
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, 
> YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, 
> YARN-2934.v1.006.patch, YARN-2934.v1.007.patch, YARN-2934.v1.008.patch, 
> YARN-2934.v2.001.patch, YARN-2934.v2.002.patch
>
>
> Most YARN applications redirect stderr to some file. That's why when 
> container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2015-12-16 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059943#comment-15059943
 ] 

Gera Shegalov commented on YARN-2934:
-

Hi [~Naganarasimha],
Please make sure that the patch does not introduce new problems. Both 
checkstyle and findbugs report problems related to the patch. Check the Hadoop 
QA comment above. Keep addressing the newly introduced issues without waiting 
for review to simplify the review process. 

I suggest to use globs instead of regexes, so you can simply call 
FileSystem#globStatus. The path pattern could be something like 
{code}{*stderr*,*STDERR*}{code} or maybe {code}{*err,*ERR,*out,*OUT}{code}. I'd 
rather have a longer config value than adding more code to make patterns 
case-insensitive. In practice we mostly need stderr

Not sure how fancy we need to be with the case where multiple log files qualify 
for the pattern, but maybe at least mention to the user there are more files to 
look at. 

In general, don't try optimize for the failure case. Things like
{code}
private static long tailSizeInBytes = -1;
{code}
look like a bug. Simply get it from conf exactly when it's needed.


> Improve handling of container's stderr 
> ---
>
> Key: YARN-2934
> URL: https://issues.apache.org/jira/browse/YARN-2934
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Gera Shegalov
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, 
> YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch, 
> YARN-2934.v1.006.patch
>
>
> Most YARN applications redirect stderr to some file. That's why when 
> container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2015-12-09 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048280#comment-15048280
 ] 

Gera Shegalov commented on YARN-2934:
-

Thanks [~Naganarasimha]! I skimmed the patch, it is in a pretty good shape. 
Aiming to give you more detailed feedback over next few days.

> Improve handling of container's stderr 
> ---
>
> Key: YARN-2934
> URL: https://issues.apache.org/jira/browse/YARN-2934
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Gera Shegalov
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, 
> YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch
>
>
> Most YARN applications redirect stderr to some file. That's why when 
> container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-683) Class MiniYARNCluster not found when starting the minicluster

2015-08-15 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov resolved YARN-683.

Resolution: Duplicate

Closing as a dup because HADOOP-9891 now documents this workaround

 Class MiniYARNCluster not found when starting the minicluster
 -

 Key: YARN-683
 URL: https://issues.apache.org/jira/browse/YARN-683
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.0.4-alpha
 Environment: MacOSX 10.8.3 - Java 1.6.0_45
Reporter: Rémy SAISSY

 Starting the minicluster with the following command line:
 bin/hadoop jar 
 share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.4-alpha-tests.jar
  minicluster -format
 Fails for MiniYARNCluster with the following error:
 13/05/14 17:06:58 INFO hdfs.MiniDFSCluster: Cluster is active
 13/05/14 17:06:58 INFO mapreduce.MiniHadoopClusterManager: Started 
 MiniDFSCluster -- namenode on port 55205
 java.lang.NoClassDefFoundError: org/apache/hadoop/yarn/server/MiniYARNCluster
   at 
 org.apache.hadoop.mapreduce.MiniHadoopClusterManager.start(MiniHadoopClusterManager.java:170)
   at 
 org.apache.hadoop.mapreduce.MiniHadoopClusterManager.run(MiniHadoopClusterManager.java:129)
   at 
 org.apache.hadoop.mapreduce.MiniHadoopClusterManager.main(MiniHadoopClusterManager.java:314)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
   at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
   at 
 org.apache.hadoop.test.MapredTestDriver.run(MapredTestDriver.java:115)
   at 
 org.apache.hadoop.test.MapredTestDriver.main(MapredTestDriver.java:123)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.hadoop.yarn.server.MiniYARNCluster
   at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
   ... 16 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2015-07-13 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625060#comment-14625060
 ] 

Gera Shegalov commented on YARN-2934:
-

Hi [~Naganarasimha], yes I was thinking the same, we should try to do it in the 
java land. I'd prefer using RawLocalFileSytem#read(buf, off, len)  in order not 
to mix in java.io API. Since the NM webUI can read logs, we should have no 
problems accessing them from the NM JVM.

 Improve handling of container's stderr 
 ---

 Key: YARN-2934
 URL: https://issues.apache.org/jira/browse/YARN-2934
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Gera Shegalov
Assignee: Naganarasimha G R
Priority: Critical

 Most YARN applications redirect stderr to some file. That's why when 
 container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Moved] (YARN-3917) getResourceCalculatorPlugin for the default should intercept all excpetions

2015-07-11 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov moved HADOOP-1 to YARN-3917:
--

Affects Version/s: (was: 2.8.0)
   2.8.0
 Target Version/s: 2.8.0  (was: 2.8.0)
  Key: YARN-3917  (was: HADOOP-1)
  Project: Hadoop YARN  (was: Hadoop Common)

 getResourceCalculatorPlugin for the default should intercept all excpetions
 ---

 Key: YARN-3917
 URL: https://issues.apache.org/jira/browse/YARN-3917
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.8.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: HADOOP-1.001.patch


 Since the user has not configured a specific plugin, any problems with the 
 default resource calculator instantiation should be ignored.
 {code}
 2015-07-10 08:16:18,445 INFO org.apache.hadoop.service.AbstractService: 
 Service containers-monitor failed in state INITED; cause: 
 java.lang.UnsupportedOperationException: Could not determine OS
 java.lang.UnsupportedOperationException: Could not determine OS
 at org.apache.hadoop.util.SysInfo.newInstance(SysInfo.java:43)
 at 
 org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.init(ResourceCalculatorPlugin.java:37)
 at 
 org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getResourceCalculatorPlugin(ResourceCalculatorPlugin.java:160)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl.serviceInit(ContainersMonitorImpl.java:108)
 at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
 at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:249)
 at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
 at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:312)
 at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:547)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:595)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3917) getResourceCalculatorPlugin for the default should intercept all excpetions

2015-07-11 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623663#comment-14623663
 ] 

Gera Shegalov commented on YARN-3917:
-

Thanks [~chris.douglas] for review. Moved JIRA to YARN beause 
{{ResourceCalculatorPlugin.java}} is in hadoop-yarn-common.

 getResourceCalculatorPlugin for the default should intercept all excpetions
 ---

 Key: YARN-3917
 URL: https://issues.apache.org/jira/browse/YARN-3917
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.8.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: HADOOP-1.001.patch


 Since the user has not configured a specific plugin, any problems with the 
 default resource calculator instantiation should be ignored.
 {code}
 2015-07-10 08:16:18,445 INFO org.apache.hadoop.service.AbstractService: 
 Service containers-monitor failed in state INITED; cause: 
 java.lang.UnsupportedOperationException: Could not determine OS
 java.lang.UnsupportedOperationException: Could not determine OS
 at org.apache.hadoop.util.SysInfo.newInstance(SysInfo.java:43)
 at 
 org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.init(ResourceCalculatorPlugin.java:37)
 at 
 org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getResourceCalculatorPlugin(ResourceCalculatorPlugin.java:160)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl.serviceInit(ContainersMonitorImpl.java:108)
 at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
 at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:249)
 at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
 at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:312)
 at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:547)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:595)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3917) getResourceCalculatorPlugin for the default should intercept all exceptions

2015-07-11 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-3917:

Summary: getResourceCalculatorPlugin for the default should intercept all 
exceptions  (was: getResourceCalculatorPlugin for the default should intercept 
all excpetions)

 getResourceCalculatorPlugin for the default should intercept all exceptions
 ---

 Key: YARN-3917
 URL: https://issues.apache.org/jira/browse/YARN-3917
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.8.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: HADOOP-1.001.patch


 Since the user has not configured a specific plugin, any problems with the 
 default resource calculator instantiation should be ignored.
 {code}
 2015-07-10 08:16:18,445 INFO org.apache.hadoop.service.AbstractService: 
 Service containers-monitor failed in state INITED; cause: 
 java.lang.UnsupportedOperationException: Could not determine OS
 java.lang.UnsupportedOperationException: Could not determine OS
 at org.apache.hadoop.util.SysInfo.newInstance(SysInfo.java:43)
 at 
 org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.init(ResourceCalculatorPlugin.java:37)
 at 
 org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getResourceCalculatorPlugin(ResourceCalculatorPlugin.java:160)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl.serviceInit(ContainersMonitorImpl.java:108)
 at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
 at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:249)
 at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
 at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:312)
 at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:547)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:595)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3768) ArrayIndexOutOfBoundsException with empty environment variables

2015-06-30 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-3768:

Description: 
Looking at line 80 of org.apache.hadoop.yarn.util.Apps an index out of range 
exception occurs if an environment variable is encountered without a value.

{code}
java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.hadoop.yarn.util.Apps.setEnvFromInputString(Apps.java:80)
{code}

I believe this occurs because java will not return empty strings from the split 
method. Similar to this 
http://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values

  was:
Looking at line 80 of org.apache.hadoop.yarn.util.Apps an index out of range 
exception occurs if an environment variable is encountered without a value.

I believe this occurs because java will not return empty strings from the split 
method. Similar to this 
http://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values


 ArrayIndexOutOfBoundsException with empty environment variables
 ---

 Key: YARN-3768
 URL: https://issues.apache.org/jira/browse/YARN-3768
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.5.0
Reporter: Joe Ferner
Assignee: zhihai xu
 Attachments: YARN-3768.000.patch, YARN-3768.001.patch, 
 YARN-3768.002.patch, YARN-3768.003.patch, YARN-3768.004.patch


 Looking at line 80 of org.apache.hadoop.yarn.util.Apps an index out of range 
 exception occurs if an environment variable is encountered without a value.
 {code}
 java.lang.ArrayIndexOutOfBoundsException: 1
   at org.apache.hadoop.yarn.util.Apps.setEnvFromInputString(Apps.java:80)
 {code}
 I believe this occurs because java will not return empty strings from the 
 split method. Similar to this 
 http://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3768) ArrayIndexOutOfBoundsException with empty environment variables

2015-06-30 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-3768:

Summary: ArrayIndexOutOfBoundsException with empty environment variables  
(was: Index out of range exception with environment variables without values)

 ArrayIndexOutOfBoundsException with empty environment variables
 ---

 Key: YARN-3768
 URL: https://issues.apache.org/jira/browse/YARN-3768
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.5.0
Reporter: Joe Ferner
Assignee: zhihai xu
 Attachments: YARN-3768.000.patch, YARN-3768.001.patch, 
 YARN-3768.002.patch, YARN-3768.003.patch, YARN-3768.004.patch


 Looking at line 80 of org.apache.hadoop.yarn.util.Apps an index out of range 
 exception occurs if an environment variable is encountered without a value.
 I believe this occurs because java will not return empty strings from the 
 split method. Similar to this 
 http://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3768) Index out of range exception with environment variables without values

2015-06-29 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-3768:

Attachment: YARN-3768.003.patch

Thanks for reviewing the patch, [~zxu]! 

bq. If the input is a=b=c, it saves Env variable a with value b. Is it 
correct? 
Correct, and I agree it does not look the behavior we want. I think the right 
behavior is to accept any value between the first {{=}} and the next {{,}}. The 
value should be {{b=c}} in your example.

bq. I also noticed the patch will discard Env Variable with empty string value. 
I am ok with it.
I think it might be desirable sometimes to clear a variable that is is set 
globally.  Let us allow it.

003 patch attached!

 Index out of range exception with environment variables without values
 --

 Key: YARN-3768
 URL: https://issues.apache.org/jira/browse/YARN-3768
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.5.0
Reporter: Joe Ferner
Assignee: zhihai xu
 Attachments: YARN-3768.000.patch, YARN-3768.001.patch, 
 YARN-3768.002.patch, YARN-3768.003.patch


 Looking at line 80 of org.apache.hadoop.yarn.util.Apps an index out of range 
 exception occurs if an environment variable is encountered without a value.
 I believe this occurs because java will not return empty strings from the 
 split method. Similar to this 
 http://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3768) Index out of range exception with environment variables without values

2015-06-29 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606960#comment-14606960
 ] 

Gera Shegalov commented on YARN-3768:
-

Good catch, [~zxu]. I rushed on the way home and forgot to regenerate the patch 
with the {{*}} change after making it locally. 

+1 pending Jenkins

 Index out of range exception with environment variables without values
 --

 Key: YARN-3768
 URL: https://issues.apache.org/jira/browse/YARN-3768
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.5.0
Reporter: Joe Ferner
Assignee: zhihai xu
 Attachments: YARN-3768.000.patch, YARN-3768.001.patch, 
 YARN-3768.002.patch, YARN-3768.003.patch, YARN-3768.004.patch


 Looking at line 80 of org.apache.hadoop.yarn.util.Apps an index out of range 
 exception occurs if an environment variable is encountered without a value.
 I believe this occurs because java will not return empty strings from the 
 split method. Similar to this 
 http://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3768) Index out of range exception with environment variables without values

2015-06-27 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-3768:

Attachment: YARN-3768.002.patch

You are right [~zxu], and I actually meant to combine matching k=v pairs and 
capturing k and v in one shot.

 Index out of range exception with environment variables without values
 --

 Key: YARN-3768
 URL: https://issues.apache.org/jira/browse/YARN-3768
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.5.0
Reporter: Joe Ferner
Assignee: zhihai xu
 Attachments: YARN-3768.000.patch, YARN-3768.001.patch, 
 YARN-3768.002.patch


 Looking at line 80 of org.apache.hadoop.yarn.util.Apps an index out of range 
 exception occurs if an environment variable is encountered without a value.
 I believe this occurs because java will not return empty strings from the 
 split method. Similar to this 
 http://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3768) Index out of range exception with environment variables without values

2015-06-27 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604372#comment-14604372
 ] 

Gera Shegalov commented on YARN-3768:
-

002 attached, with this idea and proper name validation.

 Index out of range exception with environment variables without values
 --

 Key: YARN-3768
 URL: https://issues.apache.org/jira/browse/YARN-3768
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.5.0
Reporter: Joe Ferner
Assignee: zhihai xu
 Attachments: YARN-3768.000.patch, YARN-3768.001.patch, 
 YARN-3768.002.patch


 Looking at line 80 of org.apache.hadoop.yarn.util.Apps an index out of range 
 exception occurs if an environment variable is encountered without a value.
 I believe this occurs because java will not return empty strings from the 
 split method. Similar to this 
 http://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3768) Index out of range exception with environment variables without values

2015-06-22 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596443#comment-14596443
 ] 

Gera Shegalov commented on YARN-3768:
-

Instead of executing two regexes:

first directly via Pattern p = 
Pattern.compile(Shell.getEnvironmentVariableRegex()) and then via split

can we simply match via a single regex? we can use a capture group to get the 
value.

 Index out of range exception with environment variables without values
 --

 Key: YARN-3768
 URL: https://issues.apache.org/jira/browse/YARN-3768
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.5.0
Reporter: Joe Ferner
Assignee: zhihai xu
 Attachments: YARN-3768.000.patch, YARN-3768.001.patch


 Looking at line 80 of org.apache.hadoop.yarn.util.Apps an index out of range 
 exception occurs if an environment variable is encountered without a value.
 I believe this occurs because java will not return empty strings from the 
 split method. Similar to this 
 http://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream

2015-05-01 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14524159#comment-14524159
 ] 

Gera Shegalov commented on YARN-2893:
-

Thanks for updating the patch [~zxu]. I verified with HADOOP-11889 that ignores 
imports, that the long import line  is the only non-issue. +1 for 005

 AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
 --

 Key: YARN-2893
 URL: https://issues.apache.org/jira/browse/YARN-2893
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: zhihai xu
 Attachments: YARN-2893.000.patch, YARN-2893.001.patch, 
 YARN-2893.002.patch, YARN-2893.003.patch, YARN-2893.004.patch, 
 YARN-2893.005.patch


 MapReduce jobs on our clusters experience sporadic failures due to corrupt 
 tokens in the AM launch context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3568) TestAMRMTokens should use some random port

2015-05-01 Thread Gera Shegalov (JIRA)
Gera Shegalov created YARN-3568:
---

 Summary: TestAMRMTokens should use some random port
 Key: YARN-3568
 URL: https://issues.apache.org/jira/browse/YARN-3568
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.8.0
Reporter: Gera Shegalov


Since the default port is used for yarn.resourcemanager.scheduler.address, if 
we already run a pseudo-distributed cluster on the same development machine, 
the test fails like this:
{code}
testMasterKeyRollOver[0](org.apache.hadoop.yarn.server.resourcemanager.security.TestAMRMTokens)
  Time elapsed: 1.511 sec   ERROR!
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: 
Problem binding to [0.0.0.0:8030] java.net.BindException: Address already in 
use; For more details see:  http://wiki.apache.org/hadoop/BindException
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:444)
at sun.nio.ch.Net.bind(Net.java:436)
at 
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.apache.hadoop.ipc.Server.bind(Server.java:413)
at org.apache.hadoop.ipc.Server$Listener.init(Server.java:590)
at org.apache.hadoop.ipc.Server.init(Server.java:2340)
at org.apache.hadoop.ipc.RPC$Server.init(RPC.java:945)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server.init(ProtobufRpcEngine.java:534)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:509)
at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:787)
at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
at 
org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.serviceStart(ApplicationMasterService.java:140)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:586)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:996)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1037)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1033)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1033)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1073)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.TestAMRMTokens.testMasterKeyRollOver(TestAMRMTokens.java:235)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream

2015-04-29 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520879#comment-14520879
 ] 

Gera Shegalov commented on YARN-2893:
-

Hi [~zxu], thanks for updating the patch. I believe the remaining checkstyle 
violation comes from double indentation in the catch block:
{code}
+} catch (Exception e) {
 LOG.warn(Unable to parse credentials., e);
 // Sending APP_REJECTED is fine, since we assume that the
 // RMApp is in NEW state and thus we haven't yet informed the
 // scheduler about the existence of the application
 assert application.getState() == RMAppState.NEW;
{code}
It will go away once you make it 2-space instead of 4-space indentation that 
arose because you moved code around.


 AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
 --

 Key: YARN-2893
 URL: https://issues.apache.org/jira/browse/YARN-2893
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: zhihai xu
 Attachments: YARN-2893.000.patch, YARN-2893.001.patch, 
 YARN-2893.002.patch, YARN-2893.003.patch, YARN-2893.004.patch


 MapReduce jobs on our clusters experience sporadic failures due to corrupt 
 tokens in the AM launch context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream

2015-04-27 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513583#comment-14513583
 ] 

Gera Shegalov commented on YARN-2893:
-

Thanks for the 003 patch, [~zxu]! I agree that validating credentials in either 
case is a good idea. LGTM. Nits: can you take care of the 80-column violations 
in your test methods.

 AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
 --

 Key: YARN-2893
 URL: https://issues.apache.org/jira/browse/YARN-2893
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: zhihai xu
 Attachments: YARN-2893.000.patch, YARN-2893.001.patch, 
 YARN-2893.002.patch, YARN-2893.003.patch


 MapReduce jobs on our clusters experience sporadic failures due to corrupt 
 tokens in the AM launch context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.

2015-04-27 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514536#comment-14514536
 ] 

Gera Shegalov commented on YARN-3491:
-

Agreed, reducing the number of system calls is a good idea, idea. Using JNI 
instead of ls can be handled with a separate JIRA

 PublicLocalizer#addResource is too slow.
 

 Key: YARN-3491
 URL: https://issues.apache.org/jira/browse/YARN-3491
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3491.000.patch, YARN-3491.001.patch, 
 YARN-3491.002.patch


 Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
 getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
 checkLocalDir is very slow which takes about 10+ ms.
 The total delay will be approximately number of local dirs * 10+ ms.
 This delay will be added for each public resource localization.
 Because PublicLocalizer#addResource is slow, the thread pool can't be fully 
 utilized. Instead of doing public resource localization in 
 parallel(multithreading), public resource localization is serialized most of 
 the time.
 And also PublicLocalizer#addResource is running in Dispatcher thread, 
 So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
 long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner kills localizer before localizing all resources

2015-04-26 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513247#comment-14513247
 ] 

Gera Shegalov commented on YARN-3464:
-

We might need to tweak checkstyle rules. There are a bunch of 80-column-limit 
violations that seem come from the import statements.

 Race condition in LocalizerRunner kills localizer before localizing all 
 resources
 -

 Key: YARN-3464
 URL: https://issues.apache.org/jira/browse/YARN-3464
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Fix For: 2.8.0

 Attachments: YARN-3464.000.patch, YARN-3464.001.patch


 Race condition in LocalizerRunner causes container localization timeout.
 Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
 for LocalizerResourceRequestEvent is empty.
 {code}
   } else if (pending.isEmpty()) {
 action = LocalizerAction.DIE;
   }
 {code}
 If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
 ContainerLocalizer due to empty pending list, this 
 LocalizerResourceRequestEvent will never be handled.
 Without ContainerLocalizer, LocalizerRunner#update will never be called.
 The container will stay at LOCALIZING state, until the container is killed by 
 AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.

2015-04-26 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513509#comment-14513509
 ] 

Gera Shegalov commented on YARN-3491:
-

We should switch to {{io.nativeio.NativeIO.POSIX#getFstat}} as implementation 
in {{RawLocalFileSystem}} to get rid of shell-based implementation for 
FileStatus.

 PublicLocalizer#addResource is too slow.
 

 Key: YARN-3491
 URL: https://issues.apache.org/jira/browse/YARN-3491
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3491.000.patch, YARN-3491.001.patch, 
 YARN-3491.002.patch


 Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
 getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
 checkLocalDir is very slow which takes about 10+ ms.
 The total delay will be approximately number of local dirs * 10+ ms.
 This delay will be added for each public resource localization.
 Because PublicLocalizer#addResource is slow, the thread pool can't be fully 
 utilized. Instead of doing public resource localization in 
 parallel(multithreading), public resource localization is serialized most of 
 the time.
 And also PublicLocalizer#addResource is running in Dispatcher thread, 
 So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
 long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream

2015-04-23 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510163#comment-14510163
 ] 

Gera Shegalov commented on YARN-2893:
-

Hi [~zxu], for me personally it's easier to review if you simply make the 
change, and upload a new patch. The additional benefit is that we'll see 
hopefully if our assumptions are validated by unit tests.

 AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
 --

 Key: YARN-2893
 URL: https://issues.apache.org/jira/browse/YARN-2893
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: zhihai xu
 Attachments: YARN-2893.000.patch, YARN-2893.001.patch, 
 YARN-2893.002.patch


 MapReduce jobs on our clusters experience sporadic failures due to corrupt 
 tokens in the AM launch context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream

2015-04-03 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395255#comment-14395255
 ] 

Gera Shegalov commented on YARN-2893:
-

Thanks [~zxu] for the patch, and apologies for the delay. I skimmed over the 
patch, and it looks good overall.

Can you keep your logic in {{RMAppManager#submitApplicationmove}} with 
parseCredentials but put it back under {{if 
(UserGroupInformation.isSecurityEnabled()) {}}

 AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
 --

 Key: YARN-2893
 URL: https://issues.apache.org/jira/browse/YARN-2893
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: zhihai xu
 Attachments: YARN-2893.000.patch, YARN-2893.001.patch, 
 YARN-2893.002.patch


 MapReduce jobs on our clusters experience sporadic failures due to corrupt 
 tokens in the AM launch context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream

2015-03-03 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345950#comment-14345950
 ] 

Gera Shegalov commented on YARN-2893:
-

Hi [~zxu], it's great that you make progress on this JIRA. Any chance you can 
capture the failure scenarios in some unit test so we can relate it better to 
the real failures we are seeing.

 AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
 --

 Key: YARN-2893
 URL: https://issues.apache.org/jira/browse/YARN-2893
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: zhihai xu
 Attachments: YARN-2893.000.patch


 MapReduce jobs on our clusters experience sporadic failures due to corrupt 
 tokens in the AM launch context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream

2015-03-03 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346226#comment-14346226
 ] 

Gera Shegalov commented on YARN-2893:
-

bq. Also I find out a cascading patch to fix the credentials corruption at the 
jobClient. 
https://github.com/Cascading/cascading/commit/45b33bb864172486ac43782a4d13329312d01c0e
I scanned all reports collected over last months, and the current cluster logs. 
I can confirm all affected jobs were the ones that still had Cascading 
2.5.4-based dependency. Thanks a lot for pointing it out [~zxu]!

 AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
 --

 Key: YARN-2893
 URL: https://issues.apache.org/jira/browse/YARN-2893
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: zhihai xu
 Attachments: YARN-2893.000.patch


 MapReduce jobs on our clusters experience sporadic failures due to corrupt 
 tokens in the AM launch context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2015-01-07 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268767#comment-14268767
 ] 

Gera Shegalov commented on YARN-2934:
-

bq. Given this, even the tailed stderr is not useful in such a situation. If 
the app-page ages out, where will the user see this additional diagnostic 
message that we tail out of logs?

It will be in the client output that I showed in the above comments. In our 
infrastructure, a failed job will generate an alert email containing the client 
log (or link to it).


 Improve handling of container's stderr 
 ---

 Key: YARN-2934
 URL: https://issues.apache.org/jira/browse/YARN-2934
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Gera Shegalov
Assignee: Naganarasimha G R
Priority: Critical

 Most YARN applications redirect stderr to some file. That's why when 
 container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream

2015-01-07 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268356#comment-14268356
 ] 

Gera Shegalov commented on YARN-2893:
-

Is there a significant fraction of other type of jobs on your clusters ?

 AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
 --

 Key: YARN-2893
 URL: https://issues.apache.org/jira/browse/YARN-2893
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov

 MapReduce jobs on our clusters experience sporadic failures due to corrupt 
 tokens in the AM launch context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream

2015-01-07 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268334#comment-14268334
 ] 

Gera Shegalov commented on YARN-2893:
-

Hi [~ajsquared], what type of jobs are you seeing this with? I think almost all 
failures for us are Scalding/Cascading jobs, which made me think that it has to 
do with their multithreaded job submission code.

 AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
 --

 Key: YARN-2893
 URL: https://issues.apache.org/jira/browse/YARN-2893
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov

 MapReduce jobs on our clusters experience sporadic failures due to corrupt 
 tokens in the AM launch context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2015-01-07 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268712#comment-14268712
 ] 

Gera Shegalov commented on YARN-2934:
-

Yes it's related, but not exclusive to AM (try 
-Dmapreduce.map.env=JAVA_HOME=/no/jvm/here). It's just more severe with AM. 
cat is not the point. Getting the real diagnostics with something is, +1 for 
using tail. The pointer to the tracking page can be of little value for a busy 
cluster. The RMApp is likely to age out by the time the user gets to look at 
it, and there is no JHS entry because the AM crashed. It would be better to 
mention the nodeAddress  as well, in addition to containerId to be used with 
'yarn logs' 

 Improve handling of container's stderr 
 ---

 Key: YARN-2934
 URL: https://issues.apache.org/jira/browse/YARN-2934
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Gera Shegalov
Assignee: Naganarasimha G R
Priority: Critical

 Most YARN applications redirect stderr to some file. That's why when 
 container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2934) Improve handling of container's stderr

2015-01-07 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-2934:

Priority: Critical  (was: Major)

 Improve handling of container's stderr 
 ---

 Key: YARN-2934
 URL: https://issues.apache.org/jira/browse/YARN-2934
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Gera Shegalov
Assignee: Naganarasimha G R
Priority: Critical

 Most YARN applications redirect stderr to some file. That's why when 
 container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2015-01-06 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267377#comment-14267377
 ] 

Gera Shegalov commented on YARN-2934:
-

ContainerLaunchContext is meant for a ContainerExecutor in general. In the LCE 
case the logs may only be readable by the app user. To make it robust we can 
simply append the catting to the supplied command that runs in the executor. 
This is the hacky version for it disregarding OS diversity. It presumes that 
stderr log has strderr in the file name.

{code}
$ git diff
diff --git 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
 
b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
index a87238d..8ea2560 100644
--- 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
+++ 
b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
@@ -190,6 +190,11 @@ public Integer call() {
 // TODO: Should we instead work via symlinks without this grammar?
 newCmds.add(expandEnvironment(str, containerLogDir));
   }
+  newCmds.add(||);
+  newCmds.add((cat  + containerLogDir + /*stderr* 12);
+  newCmds.add(;);
+  newCmds.add(exit -1));
+
   launchContext.setCommands(newCmds);
 
   MapString, String environment = launchContext.getEnvironment();
{code}

Then we get the desired effect:
{code}
]$ hadoop org.apache.hadoop.mapreduce.SleepJob 
-Dyarn.app.mapreduce.am.env=JAVA_HOME=/no/jvm/here -m 10 
15/01/06 23:36:13 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
15/01/06 23:36:14 INFO client.RMProxy: Connecting to ResourceManager at 
localhost/127.0.0.1:8032
15/01/06 23:36:15 INFO mapreduce.JobSubmitter: number of splits:10
15/01/06 23:36:15 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
job_1420533216615_0013
15/01/06 23:36:16 INFO impl.YarnClientImpl: Submitted application 
application_1420533216615_0013
15/01/06 23:36:16 INFO mapreduce.Job: The url to track the job: 
http://localhost:8088/proxy/application_1420533216615_0013/
15/01/06 23:36:16 INFO mapreduce.Job: Running job: job_1420533216615_0013
15/01/06 23:36:21 INFO mapreduce.Job: Job job_1420533216615_0013 running in 
uber mode : false
15/01/06 23:36:21 INFO mapreduce.Job:  map 0% reduce 0%
15/01/06 23:36:21 INFO mapreduce.Job: Job job_1420533216615_0013 failed with 
state FAILED due to: Application application_1420533216615_0013 failed 2 times 
due to AM Container for appattempt_1420533216615_0013_02 exited with  
exitCode: 255
For more detailed output, check application tracking 
page:http://localhost:8088/proxy/application_1420533216615_0013/Then, click on 
links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1420533216615_0013_02_01
Exit code: 255
Exception message: /bin/bash: /no/jvm/here/bin/java: No such file or directory

Stack trace: ExitCodeException exitCode=255: /bin/bash: /no/jvm/here/bin/java: 
No such file or directory

at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:307)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

 Improve handling of container's stderr 
 ---

 Key: YARN-2934
 URL: https://issues.apache.org/jira/browse/YARN-2934
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Gera Shegalov
Assignee: Naganarasimha G R

 Most YARN applications redirect stderr to some file. That's why when 
 container launch fails with {{ExitCodeException}} the message is empty.



--
This message 

[jira] [Commented] (YARN-2745) Extend YARN to support multi-resource packing of tasks

2014-12-15 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247515#comment-14247515
 ] 

Gera Shegalov commented on YARN-2745:
-

Thanks for filing this JIRA, [~rgrandl]! We have a number of use cases where we 
need to schedule by NW bandwidth instead of memory/cores.

 Extend YARN to support multi-resource packing of tasks
 --

 Key: YARN-2745
 URL: https://issues.apache.org/jira/browse/YARN-2745
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager, scheduler
Reporter: Robert Grandl
 Attachments: sigcomm_14_tetris_talk.pptx, tetris_design_doc.docx, 
 tetris_paper.pdf


 In this umbrella JIRA we propose an extension to existing scheduling 
 techniques, which accounts for all resources used by a task (CPU, memory, 
 disk, network) and it is able to achieve three competing objectives: 
 fairness, improve cluster utilization and reduces average job completion time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2934) Improve handling of container's stderr

2014-12-09 Thread Gera Shegalov (JIRA)
Gera Shegalov created YARN-2934:
---

 Summary: Improve handling of container's stderr 
 Key: YARN-2934
 URL: https://issues.apache.org/jira/browse/YARN-2934
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Gera Shegalov


Most YARN applications redirect stderr to some file. That's why when container 
launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2014-12-09 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239112#comment-14239112
 ] 

Gera Shegalov commented on YARN-2934:
-

 We need to make sure that stderr location is made known in the container 
launch context such that the wrapper script can cat it to it's stderr and it 
can be consumed by {{Shell}}

 Improve handling of container's stderr 
 ---

 Key: YARN-2934
 URL: https://issues.apache.org/jira/browse/YARN-2934
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Gera Shegalov

 Most YARN applications redirect stderr to some file. That's why when 
 container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2014-12-09 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239779#comment-14239779
 ] 

Gera Shegalov commented on YARN-2934:
-

Hi [~Naganarasimha], yes that's what I meant. Maybe this is specific to the 
{{DefaultContainerExecutor}}.
When testing on my macbook:
{code}
$ hadoop org.apache.hadoop.mapreduce.SleepJob 
-Dyarn.app.mapreduce.am.env=JAVA_HOME=/no/jvm/here -m 1
{code}

All you get: 
{code}
2014-12-09 09:15:00,252 WARN 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception 
from container-launch with container ID: container_1418144997824_0001_01_01 
and exit code: 127
ExitCodeException exitCode=127:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:544)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:721)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

In the stderr log of the container, you can see the real deal:
{code}
Log Type: stderr
Log Upload Time: Tue Dec 09 09:15:05 -0800 2014
Log Length: 60
/bin/bash: /no/jvm/here/bin/java: No such file or directory
{code}

 Improve handling of container's stderr 
 ---

 Key: YARN-2934
 URL: https://issues.apache.org/jira/browse/YARN-2934
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Gera Shegalov
Assignee: Naganarasimha G R

 Most YARN applications redirect stderr to some file. That's why when 
 container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2014-12-09 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239785#comment-14239785
 ] 

Gera Shegalov commented on YARN-2934:
-

Hi [~xgong], 
bq. we already have a Environment.LOG_DIRS env which returns a comma separated 
list of log-dirs.

I am aware of this. However, each app is free to pump their standard error 2 
in any file under this dir. 

 Improve handling of container's stderr 
 ---

 Key: YARN-2934
 URL: https://issues.apache.org/jira/browse/YARN-2934
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Gera Shegalov
Assignee: Naganarasimha G R

 Most YARN applications redirect stderr to some file. That's why when 
 container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2899) Run TestDockerContainerExecutorWithMocks on Linux only

2014-11-24 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224095#comment-14224095
 ] 

Gera Shegalov commented on YARN-2899:
-

Thanks for the patch, [~mingma]. Explicit listing of supported OS is a better 
approach. +1 (non-binding)

 Run TestDockerContainerExecutorWithMocks on Linux only
 --

 Key: YARN-2899
 URL: https://issues.apache.org/jira/browse/YARN-2899
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ming Ma
Assignee: Ming Ma
Priority: Minor
 Attachments: YARN-2899.patch


 It seems the test should strictly check for Linux, otherwise, it will fail 
 when the OS isn't Linux.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream

2014-11-21 Thread Gera Shegalov (JIRA)
Gera Shegalov created YARN-2893:
---

 Summary: AMLaucher: sporadic job failures due to EOFException in 
readTokenStorageStream
 Key: YARN-2893
 URL: https://issues.apache.org/jira/browse/YARN-2893
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov


MapReduce jobs on our clusters experience sporadic failures due to corrupt 
tokens in the AM launch context.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream

2014-11-21 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221662#comment-14221662
 ] 

Gera Shegalov commented on YARN-2893:
-

Here is the stack trace:
{code}
 Got exception: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at 
org.apache.hadoop.security.Credentials.readTokenStorageStream(Credentials.java:189)
at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.setupTokens(AMLauncher.java:225)
at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.createAMContainerLaunchContext(AMLauncher.java:196)
at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:107)
at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:250)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

Since the launch context is corrupt all subsequent max app attempts fail as 
well . This is a non-deterministic Heisenbug that does not reproduce on job 
re-submission.

 AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
 --

 Key: YARN-2893
 URL: https://issues.apache.org/jira/browse/YARN-2893
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov

 MapReduce jobs on our clusters experience sporadic failures due to corrupt 
 tokens in the AM launch context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used

2014-11-14 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212671#comment-14212671
 ] 

Gera Shegalov commented on YARN-2862:
-

[~mingma], It's potentially already fixed by YARN-2010. We can try it for our 
scenario.

 RM might not start if the machine was hard shutdown and 
 FileSystemRMStateStore was used
 ---

 Key: YARN-2862
 URL: https://issues.apache.org/jira/browse/YARN-2862
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ming Ma

 This might be a known issue. Given FileSystemRMStateStore isn't used for HA 
 scenario, it might not be that important, unless there is something we need 
 to fix at RM layer to make it more tolerant to RMStore issue.
 When RM was hard shutdown, OS might not get a chance to persist blocks. Some 
 of the stored application data end up with size zero after reboot. And RM 
 didn't like that.
 {noformat}
 ls -al 
 /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351
 total 156
 drwxr-xr-x.2 x y   4096 Nov 13 16:45 .
 drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 ..
 -rw-r--r--.1 x y  0 Nov 13 16:45 
 appattempt_1412702189634_324351_01
 -rw-r--r--.1 x y  0 Nov 13 16:45 
 .appattempt_1412702189634_324351_01.crc
 -rw-r--r--.1 x y  0 Nov 13 16:45 application_1412702189634_324351
 -rw-r--r--.1 x y  0 Nov 13 16:45 .application_1412702189634_324351.crc
 {noformat}
 When RM starts up
 {noformat}
 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem 
 opening checksum file: 
 file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351.
   Ignoring exception:
 java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:197)
 at java.io.DataInputStream.readFully(DataInputStream.java:169)
 at 
 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146)
 at 
 org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501)
 ...
 2014-11-13 17:40:48,876 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to 
 load/recover state
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used

2014-11-14 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212935#comment-14212935
 ] 

Gera Shegalov commented on YARN-2862:
-

[~jianhe], to add more details: we use 2.4+patches, YARN-1185 is in 2.3.

 RM might not start if the machine was hard shutdown and 
 FileSystemRMStateStore was used
 ---

 Key: YARN-2862
 URL: https://issues.apache.org/jira/browse/YARN-2862
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ming Ma

 This might be a known issue. Given FileSystemRMStateStore isn't used for HA 
 scenario, it might not be that important, unless there is something we need 
 to fix at RM layer to make it more tolerant to RMStore issue.
 When RM was hard shutdown, OS might not get a chance to persist blocks. Some 
 of the stored application data end up with size zero after reboot. And RM 
 didn't like that.
 {noformat}
 ls -al 
 /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351
 total 156
 drwxr-xr-x.2 x y   4096 Nov 13 16:45 .
 drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 ..
 -rw-r--r--.1 x y  0 Nov 13 16:45 
 appattempt_1412702189634_324351_01
 -rw-r--r--.1 x y  0 Nov 13 16:45 
 .appattempt_1412702189634_324351_01.crc
 -rw-r--r--.1 x y  0 Nov 13 16:45 application_1412702189634_324351
 -rw-r--r--.1 x y  0 Nov 13 16:45 .application_1412702189634_324351.crc
 {noformat}
 When RM starts up
 {noformat}
 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem 
 opening checksum file: 
 file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351.
   Ignoring exception:
 java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:197)
 at java.io.DataInputStream.readFully(DataInputStream.java:169)
 at 
 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146)
 at 
 org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501)
 ...
 2014-11-13 17:40:48,876 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to 
 load/recover state
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2857) ConcurrentModificationException in ContainerLogAppender

2014-11-13 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14211519#comment-14211519
 ] 

Gera Shegalov commented on YARN-2857:
-

Clean Jenkins in the last build demonstrates that the patch fixes the 
reproducer in the previous build:
{code}
testAppendInClose(org.apache.hadoop.yarn.TestContainerLogAppender)  Time 
elapsed: 0.066 sec   ERROR!
java.util.ConcurrentModificationException: null
at 
java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:761)
at java.util.LinkedList$ListItr.next(LinkedList.java:696)
at 
org.apache.hadoop.yarn.ContainerLogAppender.close(ContainerLogAppender.java:81)
at 
org.apache.hadoop.yarn.TestContainerLogAppender.testAppendInClose(TestContainerLogAppender.java:44)
{code}

. +1(non-binding)

 ConcurrentModificationException in ContainerLogAppender
 ---

 Key: YARN-2857
 URL: https://issues.apache.org/jira/browse/YARN-2857
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Mohammad Kamrul Islam
Assignee: Mohammad Kamrul Islam
Priority: Critical
 Attachments: ContainerLogAppender.java, MAPREDUCE-6139-test.01.patch, 
 MAPREDUCE-6139.1.patch, MAPREDUCE-6139.2.patch, MAPREDUCE-6139.3.patch, 
 YARN-2857.3.patch


 Context:
 * Hadoop-2.3.0
 * Using Oozie 4.0.1
 * Pig version 0.11.x
 The job is submitted by Oozie to launch Pig script.
 The following exception traces were found on MR task log:
 In syslog:
 {noformat}
 2014-10-24 20:37:29,317 WARN [Thread-5] 
 org.apache.hadoop.util.ShutdownHookManager: ShutdownHook '' failed, 
 java.util.ConcurrentModificationException
 java.util.ConcurrentModificationException
   at 
 java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:966)
   at java.util.LinkedList$ListItr.next(LinkedList.java:888)
   at 
 org.apache.hadoop.yarn.ContainerLogAppender.close(ContainerLogAppender.java:94)
   at 
 org.apache.log4j.helpers.AppenderAttachableImpl.removeAllAppenders(AppenderAttachableImpl.java:141)
   at org.apache.log4j.Category.removeAllAppenders(Category.java:891)
   at org.apache.log4j.Hierarchy.shutdown(Hierarchy.java:471)
   at org.apache.log4j.LogManager.shutdown(LogManager.java:267)
   at org.apache.hadoop.mapred.TaskLog.syncLogsShutdown(TaskLog.java:286)
   at org.apache.hadoop.mapred.TaskLog$2.run(TaskLog.java:339)
   at 
 org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
 2014-10-24 20:37:29,395 INFO [main] 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics 
 system...
 {noformat}
 in stderr:
 {noformat}
 java.util.ConcurrentModificationException
   at 
 java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:966)
   at java.util.LinkedList$ListItr.next(LinkedList.java:888)
   at 
 org.apache.hadoop.yarn.ContainerLogAppender.close(ContainerLogAppender.java:94)
   at 
 org.apache.log4j.helpers.AppenderAttachableImpl.removeAllAppenders(AppenderAttachableImpl.java:141)
   at org.apache.log4j.Category.removeAllAppenders(Category.java:891)
   at 
 org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:759)
   at 
 org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:648)
   at 
 org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:514)
   at 
 org.apache.log4j.PropertyConfigurator.configure(PropertyConfigurator.java:440)
   at org.apache.pig.Main.configureLog4J(Main.java:740)
   at org.apache.pig.Main.run(Main.java:384)
   at org.apache.pig.PigRunner.run(PigRunner.java:49)
   at org.apache.oozie.action.hadoop.PigMain.runPigJob(PigMain.java:283)
   at org.apache.oozie.action.hadoop.PigMain.run(PigMain.java:223)
   at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:37)
   at org.apache.oozie.action.hadoop.PigMain.main(PigMain.java:76)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:483)
   at 
 org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:226)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:422)
   at 
 

[jira] [Assigned] (YARN-2707) Potential null dereference in FSDownload

2014-10-20 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov reassigned YARN-2707:
---

Assignee: Gera Shegalov

 Potential null dereference in FSDownload
 

 Key: YARN-2707
 URL: https://issues.apache.org/jira/browse/YARN-2707
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ted Yu
Assignee: Gera Shegalov
Priority: Minor

 Here is related code in call():
 {code}
   Pattern pattern = null;
   String p = resource.getPattern();
   if (p != null) {
 pattern = Pattern.compile(p);
   }
   unpack(new File(dTmp.toUri()), new File(dFinal.toUri()), pattern);
 {code}
 In unpack():
 {code}
 RunJar.unJar(localrsrc, dst, pattern);
 {code}
 unJar() would dereference the pattern without checking whether it is null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2707) Potential null dereference in FSDownload

2014-10-20 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-2707:

Attachment: YARN-2707.v01.patch

Thanks for reporting this bug, [~yuzhih...@gmail.com]. It turns out that there 
is a test for this but it was not checking Future.get. Future.isDone returns 
true for failed Callables as well.

 Potential null dereference in FSDownload
 

 Key: YARN-2707
 URL: https://issues.apache.org/jira/browse/YARN-2707
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ted Yu
Assignee: Gera Shegalov
Priority: Minor
 Attachments: YARN-2707.v01.patch


 Here is related code in call():
 {code}
   Pattern pattern = null;
   String p = resource.getPattern();
   if (p != null) {
 pattern = Pattern.compile(p);
   }
   unpack(new File(dTmp.toUri()), new File(dFinal.toUri()), pattern);
 {code}
 In unpack():
 {code}
 RunJar.unJar(localrsrc, dst, pattern);
 {code}
 unJar() would dereference the pattern without checking whether it is null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2688) Better diagnostics on Container Launch failures

2014-10-14 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171426#comment-14171426
 ] 

Gera Shegalov commented on YARN-2688:
-

Localizer diagnostics was improved by YARN-2377.

 Better diagnostics on Container Launch failures
 ---

 Key: YARN-2688
 URL: https://issues.apache.org/jira/browse/YARN-2688
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Arun C Murthy

 We need better diagnostics on container launch failures due to errors like 
 localizations issues, wrong command for container launch etc. Currently, if 
 the container doesn't launch, we get nothing - not even container logs since 
 there are no logs to aggregate either.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1542) Add unit test for public resource on viewfs

2014-10-14 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-1542:

Attachment: YARN-1542.v05.patch

v05: rebasing the patch again.

 Add unit test for public resource on viewfs
 ---

 Key: YARN-1542
 URL: https://issues.apache.org/jira/browse/YARN-1542
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1542.v01.patch, YARN-1542.v02.patch, 
 YARN-1542.v03.patch, YARN-1542.v04.patch, YARN-1542.v05.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2377) Localization exception stack traces are not passed as diagnostic info

2014-10-08 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-2377:

Attachment: YARN-2377.v02.patch

Thanks for review, [~jlowe]! Your points are valid, uploading v02 to 
accommodate them.

 Localization exception stack traces are not passed as diagnostic info
 -

 Key: YARN-2377
 URL: https://issues.apache.org/jira/browse/YARN-2377
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-2377.v01.patch, YARN-2377.v02.patch


 In the Localizer log one can only see this kind of message
 {code}
 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { 
 hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar,
  1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos 
 tException: ha-nn-uri-0
 {code}
 And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is 
 propagated as diagnostics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2377) Localization exception stack traces are not passed as diagnostic info

2014-09-13 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132994#comment-14132994
 ] 

Gera Shegalov commented on YARN-2377:
-

Hi [~Naganarasimha], there is no deserialization in a sense of converting bytes 
to the original exception class. This fields are already strings in 
yarn_protos.proto:
{code}
 33 message SerializedExceptionProto {  

 34   optional string message = 1;  

 35   optional string trace = 2;

 36   optional string class_name = 3;   

 37   optional SerializedExceptionProto cause = 4;  

 38 }
{code}

 Localization exception stack traces are not passed as diagnostic info
 -

 Key: YARN-2377
 URL: https://issues.apache.org/jira/browse/YARN-2377
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-2377.v01.patch


 In the Localizer log one can only see this kind of message
 {code}
 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { 
 hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar,
  1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos 
 tException: ha-nn-uri-0
 {code}
 And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is 
 propagated as diagnostics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2377) Localization exception stack traces are not passed as diagnostic info

2014-09-08 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125846#comment-14125846
 ] 

Gera Shegalov commented on YARN-2377:
-

[~kasha], do you agree with the points above?

 Localization exception stack traces are not passed as diagnostic info
 -

 Key: YARN-2377
 URL: https://issues.apache.org/jira/browse/YARN-2377
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-2377.v01.patch


 In the Localizer log one can only see this kind of message
 {code}
 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { 
 hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar,
  1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos 
 tException: ha-nn-uri-0
 {code}
 And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is 
 propagated as diagnostics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)

2014-08-28 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113398#comment-14113398
 ] 

Gera Shegalov commented on YARN-2405:
-

YARN-2405.2.patch LGTM. We should let junit catch the original exception and it 
will properly fail the test.

 NPE in FairSchedulerAppsBlock (scheduler page)
 --

 Key: YARN-2405
 URL: https://issues.apache.org/jira/browse/YARN-2405
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Maysam Yabandeh
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2405.1.patch, YARN-2405.2.patch, YARN-2405.3.patch


 FairSchedulerAppsBlock#render throws NPE at this line
 {code}
   int fairShare = fsinfo.getAppFairShare(attemptId);
 {code}
 This causes the scheduler page now showing the app since it lack the 
 definition of appsTableData
 {code}
  Uncaught ReferenceError: appsTableData is not defined 
 {code}
 The problem is temporary meaning that it is usually resolved by itself either 
 after a retry or after a few hours.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)

2014-08-28 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114738#comment-14114738
 ] 

Gera Shegalov commented on YARN-2405:
-

bq. ... I think both designs are acceptable in this case. 

True but one has more code for no reason.


 NPE in FairSchedulerAppsBlock (scheduler page)
 --

 Key: YARN-2405
 URL: https://issues.apache.org/jira/browse/YARN-2405
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Maysam Yabandeh
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2405.1.patch, YARN-2405.2.patch, YARN-2405.3.patch, 
 YARN-2405.4.patch


 FairSchedulerAppsBlock#render throws NPE at this line
 {code}
   int fairShare = fsinfo.getAppFairShare(attemptId);
 {code}
 This causes the scheduler page now showing the app since it lack the 
 definition of appsTableData
 {code}
  Uncaught ReferenceError: appsTableData is not defined 
 {code}
 The problem is temporary meaning that it is usually resolved by itself either 
 after a retry or after a few hours.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)

2014-08-27 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112028#comment-14112028
 ] 

Gera Shegalov commented on YARN-2405:
-

+1 for not masking NPE bugs. It's a performance problem, and would catch 
irrelevant NPE's as well.

 NPE in FairSchedulerAppsBlock (scheduler page)
 --

 Key: YARN-2405
 URL: https://issues.apache.org/jira/browse/YARN-2405
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Maysam Yabandeh
Assignee: Tsuyoshi OZAWA

 FairSchedulerAppsBlock#render throws NPE at this line
 {code}
   int fairShare = fsinfo.getAppFairShare(attemptId);
 {code}
 This causes the scheduler page now showing the app since it lack the 
 definition of appsTableData
 {code}
  Uncaught ReferenceError: appsTableData is not defined 
 {code}
 The problem is temporary meaning that it is usually resolved by itself either 
 after a retry or after a few hours.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2377) Localization exception stack traces are not passed as diagnostic info

2014-08-25 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110221#comment-14110221
 ] 

Gera Shegalov commented on YARN-2377:
-

Hi [~kasha], I considered {{StringUtils#stringifyException}} but discarded it 
due to the following disadvantages: 
# redundant deserialization of the exception object just for the sake of 
serializing it right away
# as a consequence, hypothetically,  when localization service runs as a 
separate process with a dedicated classpath, we can encounter a 
{{ClassNotFoundException}} during deserialization

 Localization exception stack traces are not passed as diagnostic info
 -

 Key: YARN-2377
 URL: https://issues.apache.org/jira/browse/YARN-2377
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-2377.v01.patch


 In the Localizer log one can only see this kind of message
 {code}
 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { 
 hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar,
  1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos 
 tException: ha-nn-uri-0
 {code}
 And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is 
 propagated as diagnostics.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2377) Localization exception stack traces are not passed as diagnostic info

2014-07-31 Thread Gera Shegalov (JIRA)
Gera Shegalov created YARN-2377:
---

 Summary: Localization exception stack traces are not passed as 
diagnostic info
 Key: YARN-2377
 URL: https://issues.apache.org/jira/browse/YARN-2377
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov


In the Localizer log one can only see this kind of message
{code}
14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { 
hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar,
 1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos tException: 
ha-nn-uri-0
{code}

And then onlt {{ java.net.UnknownHos tException: ha-nn-uri-0}} message is 
propagated as diagnostics.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2377) Localization exception stack traces are not passed as diagnostic info

2014-07-31 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-2377:


Description: 
In the Localizer log one can only see this kind of message
{code}
14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { 
hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar,
 1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos tException: 
ha-nn-uri-0
{code}

And then only {{ java.net.UnknownHos tException: ha-nn-uri-0}} message is 
propagated as diagnostics.

  was:
In the Localizer log one can only see this kind of message
{code}
14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { 
hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar,
 1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos tException: 
ha-nn-uri-0
{code}

And then onlt {{ java.net.UnknownHos tException: ha-nn-uri-0}} message is 
propagated as diagnostics.


 Localization exception stack traces are not passed as diagnostic info
 -

 Key: YARN-2377
 URL: https://issues.apache.org/jira/browse/YARN-2377
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov

 In the Localizer log one can only see this kind of message
 {code}
 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { 
 hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar,
  1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos 
 tException: ha-nn-uri-0
 {code}
 And then only {{ java.net.UnknownHos tException: ha-nn-uri-0}} message is 
 propagated as diagnostics.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2377) Localization exception stack traces are not passed as diagnostic info

2014-07-31 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-2377:


Description: 
In the Localizer log one can only see this kind of message
{code}
14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { 
hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar,
 1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos tException: 
ha-nn-uri-0
{code}

And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is 
propagated as diagnostics.

  was:
In the Localizer log one can only see this kind of message
{code}
14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { 
hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar,
 1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos tException: 
ha-nn-uri-0
{code}

And then only {{ java.net.UnknownHos tException: ha-nn-uri-0}} message is 
propagated as diagnostics.


 Localization exception stack traces are not passed as diagnostic info
 -

 Key: YARN-2377
 URL: https://issues.apache.org/jira/browse/YARN-2377
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov

 In the Localizer log one can only see this kind of message
 {code}
 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { 
 hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar,
  1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos 
 tException: ha-nn-uri-0
 {code}
 And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is 
 propagated as diagnostics.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2377) Localization exception stack traces are not passed as diagnostic info

2014-07-31 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-2377:


Attachment: YARN-2377.v01.patch

v01 for review. With this you get a more actionable stack trace:

{code}
14/07/31 17:46:39 INFO mapreduce.Job: Job job_1406853387336_0001 failed with 
state FAILED due to: Application application_1406853387336_0001 failed 2 times 
due to AM Container for appattempt_1406853387336_0001_02 exited with  
exitCode: -1000
For more detailed output, check application tracking 
page:http://tw-mbp-gshegalov:8088/proxy/application_1406853387336_0001/Then, 
click on links to logs of each attempt.
Diagnostics: java.net.UnknownHostException: ha-nn-uri-0
java.lang.IllegalArgumentException: java.net.UnknownHostException: ha-nn-uri-0
at 
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:373)
at 
org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:260)
at 
org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:153)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:607)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:552)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:139)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2590)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2624)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2606)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:248)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:60)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:356)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:354)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:394)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1626)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:353)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
Caused by: java.net.UnknownHostException: ha-nn-uri-0
... 29 more
Caused by: ha-nn-uri-0
java.lang.IllegalArgumentException: java.net.UnknownHostException: ha-nn-uri-0
at 
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:373)
at 
org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:260)
at 
org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:153)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:607)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:552)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:139)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2590)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2624)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2606)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:248)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:60)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:356)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:354)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:394)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1626)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:353)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59)
at 

[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-07-29 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077453#comment-14077453
 ] 

Gera Shegalov commented on YARN-796:


Hi [~yufeldman], thanks for posting the patch. Please rebase it since it no 
longer applies.

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.1


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1741) XInclude support broken for YARN ResourceManager

2014-07-04 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14052744#comment-14052744
 ] 

Gera Shegalov commented on YARN-1741:
-

I made a modification to my patch (v04) for HADOOP-10623 to highlight how 
xi:include issue can easily be resolved by using the existing 
FsUrlStreamHandler I stumbled upon.

 XInclude support broken for YARN ResourceManager
 

 Key: YARN-1741
 URL: https://issues.apache.org/jira/browse/YARN-1741
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Eric Sirianni
Assignee: Xuan Gong
Priority: Critical
  Labels: regression

 The XInclude support in Hadoop configuration files (introduced via 
 HADOOP-4944) was broken by the recent {{ConfigurationProvider}} changes to 
 YARN ResourceManager.  Specifically, YARN-1459 and, more generally, the 
 YARN-1611 family of JIRAs for ResourceManager HA.
 The issue is that {{ConfigurationProvider}} provides a raw {{InputStream}} as 
 a {{Configuration}} resource for what was previously a {{Path}}-based 
 resource.  
 For {{Path}} resources, the absolute file path is used as the {{systemId}} 
 for the {{DocumentBuilder.parse()}} call:
 {code}
   } else if (resource instanceof Path) {  // a file resource
 ...
   doc = parse(builder, new BufferedInputStream(
   new FileInputStream(file)), ((Path)resource).toString());
 }
 {code}
 The {{systemId}} is used to resolve XIncludes (among other things):
 {code}
 /**
  * Parse the content of the given codeInputStream/code as an
  * XML document and return a new DOM Document object.
 ...
  * @param systemId Provide a base for resolving relative URIs.
 ...
  */
 public Document parse(InputStream is, String systemId)
 {code}
 However, for loading raw {{InputStream}} resources, the {{systemId}} is set 
 to {{null}}:
 {code}
   } else if (resource instanceof InputStream) {
 doc = parse(builder, (InputStream) resource, null);
 {code}
 causing XInclude resolution to fail.
 In our particular environment, we make extensive use of XIncludes to 
 standardize common configuration parameters across multiple Hadoop clusters.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1741) XInclude support broken for YARN ResourceManager

2014-06-25 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043758#comment-14043758
 ] 

Gera Shegalov commented on YARN-1741:
-

Since there is a general problem of loading conf via InputStream, to support 
these cases we need to enable users to pass custom EntityResolver.

We should implement this kind of method:
{code}
Configuration#addResource(InputStream is, EntityResolver er)
{code}

 XInclude support broken for YARN ResourceManager
 

 Key: YARN-1741
 URL: https://issues.apache.org/jira/browse/YARN-1741
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Eric Sirianni
Priority: Critical
  Labels: regression

 The XInclude support in Hadoop configuration files (introduced via 
 HADOOP-4944) was broken by the recent {{ConfigurationProvider}} changes to 
 YARN ResourceManager.  Specifically, YARN-1459 and, more generally, the 
 YARN-1611 family of JIRAs for ResourceManager HA.
 The issue is that {{ConfigurationProvider}} provides a raw {{InputStream}} as 
 a {{Configuration}} resource for what was previously a {{Path}}-based 
 resource.  
 For {{Path}} resources, the absolute file path is used as the {{systemId}} 
 for the {{DocumentBuilder.parse()}} call:
 {code}
   } else if (resource instanceof Path) {  // a file resource
 ...
   doc = parse(builder, new BufferedInputStream(
   new FileInputStream(file)), ((Path)resource).toString());
 }
 {code}
 The {{systemId}} is used to resolve XIncludes (among other things):
 {code}
 /**
  * Parse the content of the given codeInputStream/code as an
  * XML document and return a new DOM Document object.
 ...
  * @param systemId Provide a base for resolving relative URIs.
 ...
  */
 public Document parse(InputStream is, String systemId)
 {code}
 However, for loading raw {{InputStream}} resources, the {{systemId}} is set 
 to {{null}}:
 {code}
   } else if (resource instanceof InputStream) {
 doc = parse(builder, (InputStream) resource, null);
 {code}
 causing XInclude resolution to fail.
 In our particular environment, we make extensive use of XIncludes to 
 standardize common configuration parameters across multiple Hadoop clusters.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1897) Define SignalContainerRequest and SignalContainerResponse

2014-05-20 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002875#comment-14002875
 ] 

Gera Shegalov commented on YARN-1897:
-

I am confused, [~mingma]. I thought we agreed to do it as YARN-1515.

 Define SignalContainerRequest and SignalContainerResponse
 -

 Key: YARN-1897
 URL: https://issues.apache.org/jira/browse/YARN-1897
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: YARN-1897-2.patch, YARN-1897-3.patch, YARN-1897-4.patch, 
 YARN-1897.1.patch


 We need to define SignalContainerRequest and SignalContainerResponse first as 
 they are needed by other sub tasks. SignalContainerRequest should use 
 OS-independent commands and provide a way to application to specify reason 
 for diagnosis. SignalContainerResponse might be empty.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC

2014-05-14 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997266#comment-13997266
 ] 

Gera Shegalov commented on YARN-1515:
-

Ok, I can work on CMP.signalContainer and replace stopContainers with 
signalContainer

 Ability to dump the container threads and stop the containers in a single RPC
 -

 Key: YARN-1515
 URL: https://issues.apache.org/jira/browse/YARN-1515
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, nodemanager
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, 
 YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, 
 YARN-1515.v06.patch, YARN-1515.v07.patch


 This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for 
 timed-out task attempts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC

2014-05-10 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-1515:


Attachment: YARN-1515.v07.patch

v07 addressing Jason's review. Thanks!

 Ability to dump the container threads and stop the containers in a single RPC
 -

 Key: YARN-1515
 URL: https://issues.apache.org/jira/browse/YARN-1515
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, nodemanager
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, 
 YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, 
 YARN-1515.v06.patch, YARN-1515.v07.patch


 This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for 
 timed-out task attempts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


  1   2   >